Home | Publications | NKK+26a

Who Flips? Self- And Cross-Model Counterarguments Reveal Answer Instability in LLMs

MCML Authors

Nafiseh Nikeghbal

→ Group Jana Diesner
Human-Centered Computing

Amir Hossein Kargaran

→ Group Hinrich Schütze
Computational Linguistics

Shaghayegh Kolli

→ Group Jana Diesner
Human-Centered Computing

Jana Diesner

Prof. Dr.

Collaborating PI

Human-Centered Computing

Abstract

Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with that answer when presented with a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge it with a coherent argument for an incorrect option and measure whether the model flips. The setup isolates argumentative content from overt social pressure and varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not reflected by accuracy alone. Self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling challenges across models can yield stronger adversarial examples than any single source. We further construct MAXFLIP, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MAXFLIP to support stability evaluation alongside standard accuracy benchmarks.

inproceedings NKK+26a

AI4GOOD @ICML 2026

Workshop on Trustworthy AI for Good at the 43rd International Conference on Machine Learning. Seoul, South Korea, Jul 06-11, 2026. To be published. Preprint available.