Safety Cost of Steering Vectors Is Separable and Reducible
MCML Authors
Abstract
Abstract
Steering vectors are a lightweight tool for controlling LLM behavior. However, emerging evidence shows that steering vectors can unintentionally compromise a model's safety mechanisms and increase compliance with harmful requests, while no effective mitigation yet exists. In this work, we show that this safety degradation arises from a separable component in the vector that disrupts the model's safety mechanisms but contributes little to the steering objective. We identify and remove this safety-degrading component, formulating the task as a constrained optimization problem solved through primal-dual updates, subject to preserving the intended steering effect and bounding false refusal. The resulting solution is both interpretable and surgical: the optimization recovers a single direction whose ablation from the steering vector restores model safety with minimal utility cost. Across models, steering behaviors, and attack suites, including unseen attacks types, our method substantially reduces steering-induced safety degradation while preserving the original steering effect with minimal impact on false refusal. Our method offers a post-hoc correction to steering vectors that mitigates their safety cost, and more broadly, it provides a general recipe for applying activation-level model interventions without paying a safety tax.
inproceedings LK26a
CompLearn @ICML 2026
2nd Workshop on Compositional Learning: Safety, Interpretability, and Agents at the 43rd International Conference on Machine Learning. Seoul, South Korea, Jul 06-11, 2026. To be published. Preprint available.Authors
Y. Li • G. KasneciLinks
URLResearch Area
BibTeXKey: LK26a