Since massive collections of textual documents become more and more available in digital format, the organization and classification of these documents (for example in Digital Library Management System (DLMS)) becomes an important issue. For this reason, finding a suitable graphical representation of documents would be of help for system designers during the process of raw data exploration, and users to interpret results more clearly. Automatic Text Categorization (ATC), which is the task of organizing large collection of documents into predefined categories by means of Machine Learning (ML) methods, is a potential (and rarely explored) field of application of the Visual Data Exploration (VDE) techniques. Starting from a recent approach that represents documents with only two dimensions, we study the possibilities of an enhanced three-dimensional plot. In particular, the problem of detecting duplicates within documents collections as well as how the visualization of these duplicates may help both system designers and users is tackled.
3-D Environment to Represent Textual Documents for Duplicate Detection and Collection Examination
DI NUNZIO, GIORGIO MARIA
2005
Abstract
Since massive collections of textual documents become more and more available in digital format, the organization and classification of these documents (for example in Digital Library Management System (DLMS)) becomes an important issue. For this reason, finding a suitable graphical representation of documents would be of help for system designers during the process of raw data exploration, and users to interpret results more clearly. Automatic Text Categorization (ATC), which is the task of organizing large collection of documents into predefined categories by means of Machine Learning (ML) methods, is a potential (and rarely explored) field of application of the Visual Data Exploration (VDE) techniques. Starting from a recent approach that represents documents with only two dimensions, we study the possibilities of an enhanced three-dimensional plot. In particular, the problem of detecting duplicates within documents collections as well as how the visualization of these duplicates may help both system designers and users is tackled.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.