Home | Publications | BBB+25

LLMs Instead of Human Judges? a Large Scale Empirical Study Across 20 NLP Evaluation Tasks

MCML Authors

Philipp Mondorf

→ Group Barbara Plank
AI and Computational Linguistics

Barbara Plank

Prof. Dr.

Core PI

AI and Computational Linguistics

Abstract

There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.

inproceedings BBB+25

ACL 2025

63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025.

Authors

A. Bavaresco • R. Bernardi • L. Bertolazzi • D. Elliott • R. Fernández • A. Gatt • E. Ghaleb • M. Giulianelli • M. Hanna • A. Koller • A. F. T. Martins • P. Mondorf • V. Neplenbroek • S. Pezzelle • B. Plank • D. Schlangen • A. Suglia • A. K. Surikuchi • E. Takmaz • A. Testoni