An important step in cluster analysis is the evaluation oft he quality of a given clustering through structural mea-sures of goodness. Measures that do not require additional information for their evaluation (but the clustering itself), called internal measures, are commonly used because of their generality. The most widely used internal measure is the silhouette coefficient, whose naive computation requires a quadratic number of distance calculations, unfeasible for massive datasets. Surprisingly, there are no known general methods to efficiently approximate the silhouette coefficient of a clustering with rigorously provable high accuracy. In this paper, we present the first scalable algorithm to compute such a rigorous approximation for the evaluation of clusterings based on any metric distances. Our algorithm approximates the silhouette coefficient within a mere additive error O (ε) with probability 1 − δ using a very small number of distance calculations, for any fixed ε, δ ∈ (0, 1). We also provide a distributed implementation of the algorithm using the MapReduce model, which runs in constant rounds and requires only sublinear local space at each worker, thus making our estimation approach applicable to big data scenarios. An extensive experimental evaluation provides evidence that our algorithm returns highly accurate silhouette estimates,unlike competing heuristics, while running in a fraction of the time required by the exact computation.
Scalable Distributed Approximation of Internal Measures for Clustering Evaluation
Altieri, Federico;Pietracaprina, Andrea
;Pucci, Geppino
;Vandin, Fabio
2021
Abstract
An important step in cluster analysis is the evaluation oft he quality of a given clustering through structural mea-sures of goodness. Measures that do not require additional information for their evaluation (but the clustering itself), called internal measures, are commonly used because of their generality. The most widely used internal measure is the silhouette coefficient, whose naive computation requires a quadratic number of distance calculations, unfeasible for massive datasets. Surprisingly, there are no known general methods to efficiently approximate the silhouette coefficient of a clustering with rigorously provable high accuracy. In this paper, we present the first scalable algorithm to compute such a rigorous approximation for the evaluation of clusterings based on any metric distances. Our algorithm approximates the silhouette coefficient within a mere additive error O (ε) with probability 1 − δ using a very small number of distance calculations, for any fixed ε, δ ∈ (0, 1). We also provide a distributed implementation of the algorithm using the MapReduce model, which runs in constant rounds and requires only sublinear local space at each worker, thus making our estimation approach applicable to big data scenarios. An extensive experimental evaluation provides evidence that our algorithm returns highly accurate silhouette estimates,unlike competing heuristics, while running in a fraction of the time required by the exact computation.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.