Don't Use LLMs to Make Relevance Judgments

Ian Soboroff

doi:10.54195/irrj.19625

Don't Use LLMs to Make Relevance Judgments

Authors

Ian Soboroff National Institute of Standards and Technology Author https://orcid.org/0000-0003-2363-3014

DOI:

https://doi.org/10.54195/irrj.19625

Keywords:

evaluation, automatic evaluation, relevance, llm

Abstract

Relevance judgments and other truth data for information retrieval (IR) evaluations are created manually. There is a strong temptation to use large language models (LLMs) as proxies for human judges. However, letting the LLM write your truth data handicaps the evaluation by setting that LLM as a ceiling on performance. There are ways to use LLMs in the relevance assessment process, but just generating relevance judgments with a prompt isn’t one of them.