Home | Publications | BH26a

Don't Walk the Line: Boundary Guidance for Filtered Generation

MCML Authors

Sarah Ball

→ Group Frauke Kreuter
Social Data Science and AI

Abstract

Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.

inproceedings BH26a

ICML 2026

43rd International Conference on Machine Learning. Seoul, South Korea, Jul 06-11, 2026. To be published. Preprint available.

Authors

S. Ball • A. Haupt

Links

arXiv

Research Area

C4 | Computational Social Sciences

BibTeXKey: BH26a

#p-kreuter