Missing values are ubiquitous in real-world datasets. In this work, we show how to handle them with heterogeneous ensembles of classifiers that outperform state-of-the-art solutions. Several approaches are compared using several different datasets. Some state-of-the-art classifiers, e.g., SVM and RotBoost, are tested first and coupled with the Expectation-Maximization (EM) imputation method. The classifiers are then combined to build ensembles. Using the Wilcoxon signed-rank test (reject the null hypothesis, level of significance 0.05), we show that our best heterogeneous ensembles, obtained by combining a forest of decision trees (a method that does not require any dataset-specific tuning) with a cluster-based imputation method, outperforms two dataset-tuned solutions: a stand-alone SVM classifier and a random subspace of SVMs, both based on LibSVM, the most widely used SVM toolbox in the world. Our heterogeneous ensembles also exhibit better performance than a recent cluster-based imputation method for handling missing values (a method which has been shown to outperform several other state-of-the-art imputation approaches) when both the training set and the testing set contain 10% missing values.

Heterogeneous Ensembles for the Missing Feature Problem

NANNI, LORIS;FANTOZZI, CARLO
2013

Abstract

Missing values are ubiquitous in real-world datasets. In this work, we show how to handle them with heterogeneous ensembles of classifiers that outperform state-of-the-art solutions. Several approaches are compared using several different datasets. Some state-of-the-art classifiers, e.g., SVM and RotBoost, are tested first and coupled with the Expectation-Maximization (EM) imputation method. The classifiers are then combined to build ensembles. Using the Wilcoxon signed-rank test (reject the null hypothesis, level of significance 0.05), we show that our best heterogeneous ensembles, obtained by combining a forest of decision trees (a method that does not require any dataset-specific tuning) with a cluster-based imputation method, outperforms two dataset-tuned solutions: a stand-alone SVM classifier and a random subspace of SVMs, both based on LibSVM, the most widely used SVM toolbox in the world. Our heterogeneous ensembles also exhibit better performance than a recent cluster-based imputation method for handling missing values (a method which has been shown to outperform several other state-of-the-art imputation approaches) when both the training set and the testing set contain 10% missing values.
2013
Proceedings of the Northeast Decision Sciences Institute
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/2552888
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact