Home  | Publications | YZK26

Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models

MCML Authors

Abstract

Large language models often display heterogeneous moral preferences across settings. We study inference-time steering toward a desired ethical framework while preserving general competence. We present Convergent-Divergent Routing, which traces and edits minimal branch points inside transformer blocks where ethical-framework-related pathways first converge and then diverge. Gating non-target branches at these loci blocks the downstream propagation while leaving upstream computations intact. We find that this intervention alone increases targeted ethical-framework reasoning. To achieve fine-grained control, we adapt Common Spatial Patterns to the residual stream and extract, for each branch-point layer, a pair of directions that discriminate between utilitarian and deontological frameworks. We then introduce Dual Logit Calibration, a closed-form, minimum-ℓ2-norm update that moves the residual within this two-dimensional subspace so the resulting directional projections align with user-specified preference weights. Experiments on real-life moral dilemmas show that our method reliably achieves preference calibration and largely preserves general capabilities, outperforming recent baselines while providing an interpretable mechanism.

misc YZK26


Preprint

May. 2026

Authors

C. YuanZ. ZhangG. Kasneci

Links

arXiv GitHub

Research Area

 A1 | Statistical Foundations & Explainability

BibTeXKey: YZK26

Back to Top