Dense Information Retrieval approaches are considered state-of-the-art and are based on projecting the queries and documents in a latent space, where each dimension encodes a latent characteristic of the text. In this paper, we enunciate the Manifold Clustering (MC) Hypothesis: projecting queries and documents onto a subspace of the original representation space can improve retrieval effectiveness. Based on the MC hypothesis, we define the Dimension IMportance Estimators (DIME). DIMEs operate on the query representation to estimate the expected importance of each dimension. Such DIMEs can be used to truncate the representation only to the most important dimensions. We describe two DIMEs, one based on the response generated by a Large Language Model (LLM), and one that relies on the user’s active feedback. Our experiments show that the LLM-based DIME enables performance improvements of up to +11.5% (moving from 0.675 to 0.752 nDCG@10) compared to the baseline methods using all dimensions. Even more impressively, the DIME based on the active feedback allows us to outperform the baseline by up to +0.224 nDCG@10 points (+58.6%, moving from 0.384 to 0.608).
Turning on a DIME: Estimating Dimension Importance for Dense Information Retrieval
Faggioli G.;Ferro N.;
2024
Abstract
Dense Information Retrieval approaches are considered state-of-the-art and are based on projecting the queries and documents in a latent space, where each dimension encodes a latent characteristic of the text. In this paper, we enunciate the Manifold Clustering (MC) Hypothesis: projecting queries and documents onto a subspace of the original representation space can improve retrieval effectiveness. Based on the MC hypothesis, we define the Dimension IMportance Estimators (DIME). DIMEs operate on the query representation to estimate the expected importance of each dimension. Such DIMEs can be used to truncate the representation only to the most important dimensions. We describe two DIMEs, one based on the response generated by a Large Language Model (LLM), and one that relies on the user’s active feedback. Our experiments show that the LLM-based DIME enables performance improvements of up to +11.5% (moving from 0.675 to 0.752 nDCG@10) compared to the baseline methods using all dimensions. Even more impressively, the DIME based on the active feedback allows us to outperform the baseline by up to +0.224 nDCG@10 points (+58.6%, moving from 0.384 to 0.608).Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.