Don't Use LLMs to Make Relevance Judgments
DOI:
https://doi.org/10.54195/irrj.19625Keywords:
evaluation, automatic evaluation, relevance, llmAbstract
Relevance judgments and other truth data for information retrieval (IR) evaluations are created manually. There is a strong temptation to use large language models (LLMs) as proxies for human judges. However, letting the LLM write your truth data handicaps the evaluation by setting that LLM as a ceiling on performance. There are ways to use LLMs in the relevance assessment process, but just generating relevance judgments with a prompt isn’t one of them.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Ian Soboroff (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
CC-BY 4.0