Home | Publications | LK26a

Safety Cost of Steering Vectors Is Separable and Reducible

MCML Authors

Yuxiao Li

→ Group Gjergji Kasneci
Responsible Data Science

Gjergji Kasneci

Prof. Dr.

Core PI

Responsible Data Science

Abstract

Steering vectors are a lightweight tool for controlling LLM behavior. However, emerging evidence shows that steering vectors can unintentionally compromise a model's safety mechanisms and increase compliance with harmful requests, while no effective mitigation yet exists. In this work, we show that this safety degradation arises from a separable component in the vector that disrupts the model's safety mechanisms but contributes little to the steering objective. We identify and remove this safety-degrading component, formulating the task as a constrained optimization problem solved through primal-dual updates, subject to preserving the intended steering effect and bounding false refusal. The resulting solution is both interpretable and surgical: the optimization recovers a single direction whose ablation from the steering vector restores model safety with minimal utility cost. Across models, steering behaviors, and attack suites, including unseen attacks types, our method substantially reduces steering-induced safety degradation while preserving the original steering effect with minimal impact on false refusal. Our method offers a post-hoc correction to steering vectors that mitigates their safety cost, and more broadly, it provides a general recipe for applying activation-level model interventions without paying a safety tax.

inproceedings LK26a

CompLearn @ICML 2026

2nd Workshop on Compositional Learning: Safety, Interpretability, and Agents at the 43rd International Conference on Machine Learning. Seoul, South Korea, Jul 06-11, 2026. To be published. Preprint available.

Authors

Y. Li • G. Kasneci

Links

URL

Research Area

A1 | Statistical Foundations & Explainability

BibTeXKey: LK26a

#p-kasneci-gjergji