Scalable Distributed Approximation of Internal Measures for Clustering Evaluation

Altieri, Federico; Pietracaprina, Andrea; Pucci, Geppino; Vandin, Fabio

doi:10.1137/1.9781611976700.73

An important step in cluster analysis is the evaluation oft he quality of a given clustering through structural mea-sures of goodness. Measures that do not require additional information for their evaluation (but the clustering itself), called internal measures, are commonly used because of their generality. The most widely used internal measure is the silhouette coeﬃcient, whose naive computation requires a quadratic number of distance calculations, unfeasible for massive datasets. Surprisingly, there are no known general methods to eﬃciently approximate the silhouette coeﬃcient of a clustering with rigorously provable high accuracy. In this paper, we present the ﬁrst scalable algorithm to compute such a rigorous approximation for the evaluation of clusterings based on any metric distances. Our algorithm approximates the silhouette coeﬃcient within a mere additive error O (ε) with probability 1 − δ using a very small number of distance calculations, for any ﬁxed ε, δ ∈ (0, 1). We also provide a distributed implementation of the algorithm using the MapReduce model, which runs in constant rounds and requires only sublinear local space at each worker, thus making our estimation approach applicable to big data scenarios. An extensive experimental evaluation provides evidence that our algorithm returns highly accurate silhouette estimates,unlike competing heuristics, while running in a fraction of the time required by the exact computation.