Large Language Models (LLMs) hugely impacted many research fields, including Information Retrieval (IR), where they are used for many sub-tasks, such as query rewriting and retrieval augmented generation. At the same time, the research community is investigating whether and how to use LLMs to support, or even replace, humans to generate relevance judgments. Indeed, generating relevance judgements automatically - or integrating an LLM in the annotation process - would allow us to improve the number of evaluation collections, also for scenarios where the annotation process is particularly challenging. To validate relevance judgements produced by an LLM they are compared with human-made relevance judgements, measuring the inter-assessor agreement between the human and the LLM. Our work introduces an innovative framework for estimating the quality of LLM-generated relevance judgments, providing statistical guarantees while minimizing human involvement. The proposed framework allows to: i) estimate the quality of LLM-generated relevance judgments with a defined confidence while minimizing human involvement; and ii) estimate the quality of LLM-generated relevance judgments with a fixed budget while providing bounds on the estimate. Our experimental results on three well-known IR collections using multiple LLMs as assessors show it is sufficient to assess 16% of the LLM-generated relevance judgments to estimate the LLM's performance with a 95% confidence.

A Cost-Effective Framework to Evaluate LLM-Generated Relevance Judgements

Merlo S.;Marchesin S.;Faggioli G.;Ferro N.
2025

Abstract

Large Language Models (LLMs) hugely impacted many research fields, including Information Retrieval (IR), where they are used for many sub-tasks, such as query rewriting and retrieval augmented generation. At the same time, the research community is investigating whether and how to use LLMs to support, or even replace, humans to generate relevance judgments. Indeed, generating relevance judgements automatically - or integrating an LLM in the annotation process - would allow us to improve the number of evaluation collections, also for scenarios where the annotation process is particularly challenging. To validate relevance judgements produced by an LLM they are compared with human-made relevance judgements, measuring the inter-assessor agreement between the human and the LLM. Our work introduces an innovative framework for estimating the quality of LLM-generated relevance judgments, providing statistical guarantees while minimizing human involvement. The proposed framework allows to: i) estimate the quality of LLM-generated relevance judgments with a defined confidence while minimizing human involvement; and ii) estimate the quality of LLM-generated relevance judgments with a fixed budget while providing bounds on the estimate. Our experimental results on three well-known IR collections using multiple LLMs as assessors show it is sufficient to assess 16% of the LLM-generated relevance judgments to estimate the LLM's performance with a 95% confidence.
2025
CIKM 2025 - Proceedings of the 34th ACM International Conference on Information and Knowledge Management
34th ACM International Conference on Information and Knowledge Management, CIKM 2025
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3571881
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact