An important step in cluster analysis is the evaluation oft he quality of a given clustering through structural mea-sures of goodness. Measures that do not require additional information for their evaluation (but the clustering itself), called internal measures, are commonly used because of their generality. The most widely used internal measure is the silhouette coefficient, whose naive computation requires a quadratic number of distance calculations, unfeasible for massive datasets. Surprisingly, there are no known general methods to efficiently approximate the silhouette coefficient of a clustering with rigorously provable high accuracy. In this paper, we present the first scalable algorithm to compute such a rigorous approximation for the evaluation of clusterings based on any metric distances. Our algorithm approximates the silhouette coefficient within a mere additive error O (ε) with probability 1 − δ using a very small number of distance calculations, for any fixed ε, δ ∈ (0, 1). We also provide a distributed implementation of the algorithm using the MapReduce model, which runs in constant rounds and requires only sublinear local space at each worker, thus making our estimation approach applicable to big data scenarios. An extensive experimental evaluation provides evidence that our algorithm returns highly accurate silhouette estimates,unlike competing heuristics, while running in a fraction of the time required by the exact computation.

Scalable Distributed Approximation of Internal Measures for Clustering Evaluation

Altieri, Federico;Pietracaprina, Andrea
;
Pucci, Geppino
;
Vandin, Fabio
2021

Abstract

An important step in cluster analysis is the evaluation oft he quality of a given clustering through structural mea-sures of goodness. Measures that do not require additional information for their evaluation (but the clustering itself), called internal measures, are commonly used because of their generality. The most widely used internal measure is the silhouette coefficient, whose naive computation requires a quadratic number of distance calculations, unfeasible for massive datasets. Surprisingly, there are no known general methods to efficiently approximate the silhouette coefficient of a clustering with rigorously provable high accuracy. In this paper, we present the first scalable algorithm to compute such a rigorous approximation for the evaluation of clusterings based on any metric distances. Our algorithm approximates the silhouette coefficient within a mere additive error O (ε) with probability 1 − δ using a very small number of distance calculations, for any fixed ε, δ ∈ (0, 1). We also provide a distributed implementation of the algorithm using the MapReduce model, which runs in constant rounds and requires only sublinear local space at each worker, thus making our estimation approach applicable to big data scenarios. An extensive experimental evaluation provides evidence that our algorithm returns highly accurate silhouette estimates,unlike competing heuristics, while running in a fraction of the time required by the exact computation.
2021
Proceedings of the 2021 SIAM International Conference on Data Mining, SDM 2021
SIAM International Conference on Data Mining, SDM 2021
978-1-61197-670-0
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3401589
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact