Implementing Distributed Approximate Similarity Joins using Locality Sensitive Hashing

Aumuller, M.; Ceccarello, M.

doi:10.5441/002/edbt.2022.07

Similarity joins are a basic primitive in data mining. Given two sets of points, we are interested in reporting all pairs of points whose similarity is above a user-defined threshold. Solving the problem naively entails verifying all possible pairs, which can be infeasible for large inputs. In such contexts, Locality Sensitive Hashing (LSH) is often considered to reduce the number of pairs to verify. However, while it provides subquadratic running time, large input sets make it nevertheless necessary to resort to distributed computing. Hu, Yi, and Tao (PODS’17, TODS’19) proposed a nearly load-optimal LSH-based join algorithm and provided a small-scale experimental study in a distributed setting. This paper provides further analysis of their approach. It shows that the load-minimizing parameter settings by Hu et al. incur too much local work, rendering it impractical. To remedy this drawback, we propose two approaches: The first distributes work in a data-independent way, while the second adapts to the data distribution using LSH. Both schemes then use LSH to solve subproblems locally. This allows to balance load and the amount of local work. Through an experimental evaluation, we show that the transition from theory to practice for Hu et al.’s approach is challenging: it is hard to strike a good tradeoff between the load and the amount of local work of each processor, load balancing is itself an issue, and LSH may introduce duplicates in the output. Our extensive experimental evaluation is supported by an efficient open source implementation of all the methods we test. Our results highlight the need for an holistic approach: only focusing on the load, as tradition in the MPC model, might not make efficient use the available resources and better trade-offs between local work and load are possible.