Home  | Publications | WWS+25

Enhancing LLM Steering Through Sparse Autoencoder-Based Vector Refinement

MCML Authors

Abstract

Steering has emerged as a promising approach in controlling large language models (LLMs) without modifying model parameters. However, most existing steering methods rely on large-scale datasets to learn clear behavioral information, which limits their applicability in many real-world scenarios. The steering vectors extracted from small dataset often contain task-irrelevant noising features, which degrades their effectiveness. To refine the steering vectors learned from limited data, we introduce Refinement of Steering Vector via Sparse Autoencoder (SAE-RSV) that leverages SAEs to semantically denoise and augment the steering vectors. In our framework, we first remove task-irrelevant features according to their semantics provided by SAEs, and then enrich task-relevant features missing from the small dataset through their semantic similarity to the identified relevant features. Extensive experiments demonstrate that the proposed SAE-RSV substantially outperforms all the baseline methods including supervised fine-tuning. Our findings show that effective steering vector can be constructed from limited training data by refining the original steering vector through SAEs.

misc


Preprint

Sep. 2025

Authors

A. Wang • X. Wu • D. Shu • Y. Ma • N. Liu

Links


Research Area

 A3 | Computational Models

BibTeXKey: WWS+25

Back to Top