Home  | Publications | CST+25

Fairness Aware Reward Optimization

MCML Authors

Link to Profile Stefanie Jegelka PI Matchmaking

Stefanie Jegelka

Prof. Dr.

Principal Investigator

Abstract

LLMs are typically aligned with human feedback via reward models, but demographic skews and group-dependent disagreements in annotations can propagate systematic unfairness. We introduce Fairness-Aware Reward Optimisation (FARO), a principled framework for training reward models under demographic parity, equalised odds, or counterfactual fairness constraints. Our approach instantiates a proxy-Lagrangian descent–ascent game (ProxyGDA) that yields reward models with provable fairness certificates up to vanishing slack. We provide the first theoretical analysis of reward-level fairness in alignment, establishing: (i) guarantees that FARO-trained rewards satisfy DP/EO/CF; (ii) a formal accuracy–fairness trade-off induced by KL-regularised RL fine-tuning; and (iii) existence of Pareto-optimal solutions along this trade-off. Across multiple LLMs on the representative BBQ dataset, FARO consistently reduces demographic bias and harmful generations while preserving or improving LLM quality and factuality.

misc CST+25


Preprint

Nov. 2025

Authors

C. L. Choi • V. Subramaniam • A. Torralba • P. Isola • S. Jegelka

Links

PDF

Research Area

 A3 | Computational Models

BibTeXKey: CST+25

Back to Top