We focus on two recently proposed algorithms in the family of "boosting"-based learners for automated text classification, ADABOOST. MH and ADABOOST.MHKR. While the former is a realization of the well-known ADABOOST algorithm specifically aimed at multilabel text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far. A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the "weighted" representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document's content than binary representations. In this paper we address the problem of exploiting the potential of weighted representations in the context of ADABOOST-like algorithms by discretizing the continuous attributes through the application of entropy-based discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations.
Discretizing continuous attributes in AdaBoost for text categorization
SEBASTIANI, FABRIZIO;SPERDUTI, ALESSANDRO
2003
Abstract
We focus on two recently proposed algorithms in the family of "boosting"-based learners for automated text classification, ADABOOST. MH and ADABOOST.MHKR. While the former is a realization of the well-known ADABOOST algorithm specifically aimed at multilabel text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far. A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the "weighted" representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document's content than binary representations. In this paper we address the problem of exploiting the potential of weighted representations in the context of ADABOOST-like algorithms by discretizing the continuous attributes through the application of entropy-based discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.