Home | Publications | FBH+26

Causal Methods for LLM Development and Evaluation

MCML Authors

Dennis Frauen

Dr.

* Former Member

→ Group Stefan Feuerriegel
Artificial Intelligence in Management

Marie Brockschmidt

→ Group Stefan Feuerriegel
Artificial Intelligence in Management

Haorui Ma

→ Group Stefan Feuerriegel
Artificial Intelligence in Management

Yuchen Ma

→ Group Stefan Feuerriegel
Artificial Intelligence in Management

Abdurahman Maarouf

→ Group Stefan Feuerriegel
Artificial Intelligence in Management

Maresa Schröder

→ Group Stefan Feuerriegel
Artificial Intelligence in Management

Jonas Schweisthal

→ Group Stefan Feuerriegel
Artificial Intelligence in Management

Yuxin Wang

→ Group Stefan Feuerriegel
Artificial Intelligence in Management

Stefan Feuerriegel

Prof. Dr.

Core PI

Artificial Intelligence in Management

Abstract

Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM development and evaluation are inherently causal: What is the effect of adding a data domain during pretraining? How do annotator preferences change when LLMs generate text in a different style? Should a prompt be routed to a larger or smaller model given inference cost constraints? In general, causal methods are well-suited to such settings where interventions change outcomes but, surprisingly, are underrepresented in LLM development. Our contribution is threefold: (1) We explain how causal methods can help develop modern LLM development and evaluation: LLM development relies heavily on logged data, which are often subject to confounding and distribution shifts; evaluation uses learned but potentially biased judges; and deployment environments are non-stationary. These conditions make purely predictive approaches fragile and create opportunities for principled identification and estimation methods from causal inference. (2) We further map opportunities for causal methods in the entire LLM development pipeline, including pretraining, alignment, routing, agentic workflows, and evaluation. (3) We discuss new research opportunities around leveraging causal methods for LLM development and evaluation. Overall, we argue that causal methods are potentially underutilized for the LLM development and evaluation pipeline, despite the fact that such methods can ensure a reliable and scientifically grounded design.

inproceedings FBH+26

KDD 2026

32nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Jeju Island, Republic of Korea, Aug 09-13, 2026. To be published. Preprint available.

Authors

D. Frauen • M. Brockschmidt • K. Hess • H. Ma • Y. Ma • A. Maarouf • M. Schröder • J. Schweisthal • Y. Wang • A. Deviyani • S. Parbhoo • R. G. Krishnan • S. Feuerriegel

Links

arXiv

Research Area

A1 | Statistical Foundations & Explainability

BibTeXKey: FBH+26

#p-feuerriegel