Home | Publications | YZK26

Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models

MCML Authors

Chenchen Yuan

→ Group Gjergji Kasneci
Responsible Data Science

Zheyu Zhang

→ Group Gjergji Kasneci
Responsible Data Science

Gjergji Kasneci

Prof. Dr.

Core PI

Responsible Data Science

Abstract

Large language models often display heterogeneous moral preferences across settings. We study inference-time steering toward a desired ethical framework while preserving general competence. We present Convergent-Divergent Routing, which traces and edits minimal branch points inside transformer blocks where ethical-framework-related pathways first converge and then diverge. Gating non-target branches at these loci blocks the downstream propagation while leaving upstream computations intact. We find that this intervention alone increases targeted ethical-framework reasoning. To achieve fine-grained control, we adapt Common Spatial Patterns to the residual stream and extract, for each branch-point layer, a pair of directions that discriminate between utilitarian and deontological frameworks. We then introduce Dual Logit Calibration, a closed-form, minimum-ℓ2-norm update that moves the residual within this two-dimensional subspace so the resulting directional projections align with user-specified preference weights. Experiments on real-life moral dilemmas show that our method reliably achieves preference calibration and largely preserves general capabilities, outperforming recent baselines while providing an interpretable mechanism.

misc YZK26