Don't Use LLMs to Make Relevance Judgments

Authors

DOI:

https://doi.org/10.54195/irrj.19625

Keywords:

evaluation, automatic evaluation, relevance, llm

Abstract

Relevance judgments and other truth data for information retrieval (IR) evaluations are created manually. There is a strong temptation to use large language models (LLMs) as proxies for human judges. However, letting the LLM write your truth data handicaps the evaluation by setting that LLM as a ceiling on performance. There are ways to use LLMs in the relevance assessment process, but just generating relevance judgments with a prompt isn’t one of them.

Downloads

Download data is not yet available.

Downloads

Published

2025-03-25

Issue

Section

Articles

How to Cite

Soboroff, I. (2025). Don’t Use LLMs to Make Relevance Judgments. Information Retrieval Research, 1(1), 29-46. https://doi.org/10.54195/irrj.19625