3-D Environment to Represent Textual Documents for Duplicate Detection and Collection Examination

Di Nunzio, Giorgio Maria

Since massive collections of textual documents become more and more available in digital format, the organization and classification of these documents (for example in Digital Library Management System (DLMS)) becomes an important issue. For this reason, finding a suitable graphical representation of documents would be of help for system designers during the process of raw data exploration, and users to interpret results more clearly. Automatic Text Categorization (ATC), which is the task of organizing large collection of documents into predefined categories by means of Machine Learning (ML) methods, is a potential (and rarely explored) field of application of the Visual Data Exploration (VDE) techniques. Starting from a recent approach that represents documents with only two dimensions, we study the possibilities of an enhanced three-dimensional plot. In particular, the problem of detecting duplicates within documents collections as well as how the visualization of these duplicates may help both system designers and users is tackled.