Missing values are ubiquitous in real-world datasets. In this work, we show how to handle them with heterogeneous ensembles of classifiers that outperform state-of-the-art solutions. Several approaches are compared using several different datasets. Some state-of-the-art classifiers, e.g., SVM and RotBoost, are tested first and coupled with the Expectation-Maximization (EM) imputation method. The classifiers are then combined to build ensembles. Using the Wilcoxon signed-rank test (reject the null hypothesis, level of significance 0.05), we show that our best heterogeneous ensembles, obtained by combining a forest of decision trees (a method that does not require any dataset-specific tuning) with a cluster-based imputation method, outperforms two dataset-tuned solutions: a stand-alone SVM classifier and a random subspace of SVMs, both based on LibSVM, the most widely used SVM toolbox in the world. Our heterogeneous ensembles also exhibit better performance than a recent cluster-based imputation method for handling missing values (a method which has been shown to outperform several other state-of-the-art imputation approaches) when both the training set and the testing set contain 10% missing values.
Heterogeneous Ensembles for the Missing Feature Problem
NANNI, LORIS;FANTOZZI, CARLO
2013
Abstract
Missing values are ubiquitous in real-world datasets. In this work, we show how to handle them with heterogeneous ensembles of classifiers that outperform state-of-the-art solutions. Several approaches are compared using several different datasets. Some state-of-the-art classifiers, e.g., SVM and RotBoost, are tested first and coupled with the Expectation-Maximization (EM) imputation method. The classifiers are then combined to build ensembles. Using the Wilcoxon signed-rank test (reject the null hypothesis, level of significance 0.05), we show that our best heterogeneous ensembles, obtained by combining a forest of decision trees (a method that does not require any dataset-specific tuning) with a cluster-based imputation method, outperforms two dataset-tuned solutions: a stand-alone SVM classifier and a random subspace of SVMs, both based on LibSVM, the most widely used SVM toolbox in the world. Our heterogeneous ensembles also exhibit better performance than a recent cluster-based imputation method for handling missing values (a method which has been shown to outperform several other state-of-the-art imputation approaches) when both the training set and the testing set contain 10% missing values.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.