Home  | Publications | RPB+25

BoN Appetit Team at LeWiDi-2025: Best-of-N Test-Time Scaling Can Not Stomach Annotation Disagreements (Yet)

MCML Authors

Abstract

Test-time scaling is a family of techniques to improve LLM outputs at inference time by performing extra computation. To the best of our knowledge, test-time scaling has been limited to domains with verifiably correct answers, like mathematics and coding. We transfer test-time scaling to the LeWiDi-2025 tasks to evaluate annotation disagreements. We experiment with three test-time scaling methods: two benchmark algorithms (Model Averaging and Majority Voting), and a Best-of-N sampling method. The two benchmark methods improve LLM performance consistently on the LeWiDi tasks, but the Best-of-N method does not. Our experiments suggest that the Best-of-N method does not currently transfer from mathematics to LeWiDi tasks, and we analyze potential reasons for this gap.

misc


Preprint

Oct. 2025

Authors

T. Ruiz • S. PengB. Plank • C. Schwemmer

Links


Research Area

 B2 | Natural Language Processing

BibTeXKey: RPB+25

Back to Top