Adaptive MapReduce Similarity Joins

Similarity joins are a fundamental database operation. Given data sets S and R, the goal of a similarity join is to find all points x ∈ S and y ∈ R with distance at most r. Recent research has investigated how locality-sensitive hashing (LSH) can be used for similarity join, and in particular two recent lines of work have made exciting progress on LSH-based join performance. Hu, Tao, and Yi (PODS 17) investigated joins in a massively parallel setting, showing strong results that adapt to the size of the output. Meanwhile, Ahle, Aumüller, and Pagh (SODA 17) showed a sequential algorithm that adapts to the structure of the data, matching classic bounds in the worst case but improving them significantly on more structured data. We show that this adaptive strategy can be adapted to the parallel setting, combining the advantages of these approaches. In particular, we show that a simple modification to Hu et al.'s algorithm achieves bounds that depend on the density of points in the dataset as well as the total outsize of the output. Our algorithm uses no extra parameters over other LSH approaches (in particular, its execution does not depend on the structure of the dataset), and is likely to be efficient in practice.

Adaptive MapReduce Similarity Joins

McCauley, Samuel;Silvestri, Francesco

2018

Abstract

Similarity joins are a fundamental database operation. Given data sets S and R, the goal of a similarity join is to find all points x ∈ S and y ∈ R with distance at most r. Recent research has investigated how locality-sensitive hashing (LSH) can be used for similarity join, and in particular two recent lines of work have made exciting progress on LSH-based join performance. Hu, Tao, and Yi (PODS 17) investigated joins in a massively parallel setting, showing strong results that adapt to the size of the output. Meanwhile, Ahle, Aumüller, and Pagh (SODA 17) showed a sequential algorithm that adapts to the structure of the data, matching classic bounds in the worst case but improving them significantly on more structured data. We show that this adaptive strategy can be adapted to the parallel setting, combining the advantages of these approaches. In particular, we show that a simple modification to Hu et al.'s algorithm achieves bounds that depend on the density of points in the dataset as well as the total outsize of the output. Our algorithm uses no extra parameters over other LSH approaches (in particular, its execution does not depend on the structure of the dataset), and is likely to be efficient in practice.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2018
			
	Titolo del Libro
	
				Proc. 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond
			
	Titolo convegno
	
				SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR)
			
	Codice DOI
	
				https://dx.doi.org/10.1145/3206333.3206340
			
	Codice Scopus
	
				2-s2.0-85063580810
			
	Codice ISBN
	
				9781450357036
			
	Appare nelle tipologie:
	
				04.01 - Contributo in atti di convegno

File in questo prodotto:

File	Dimensione	Formato
1804.05615.pdf accesso aperto Descrizione: Arxiv pre-print Tipologia: Preprint (submitted version) Licenza: Accesso gratuito Dimensione 156.9 kB Formato Adobe PDF Visualizza/Apri	156.9 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3277697

Citazioni

ND

5

ND

ND

social impact