Home  | Publications | BKP26

Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models

MCML Authors

Link to Profile Frauke Kreuter PI Matchmaking

Frauke Kreuter

Prof. Dr.

Principal Investigator

Abstract

Conversational large language models are trained to refuse to answer harmful questions. However, emergent jailbreaking techniques can still elicit unsafe outputs, presenting an ongoing challenge for model alignment. This paper aims to deepen our understanding of how different jailbreak types circumvent safeguards by analyzing model activations on different jailbreak inputs. We find that it is possible to extract a jailbreak vector from a single class of jailbreaks that works to mitigate jailbreak effectiveness from other, semantically-dissimilar classes. This suggests that diverse jailbreaks may exploit a common internal mechanism. We investigate a potential common mechanism of harmfulness feature suppression, and find evidence that effective jailbreaks noticeably reduce a model’s perception of prompt harmfulness. These insights pave the way for developing more robust jailbreak countermeasures and lay the groundwork for a deeper, mechanistic understanding of jailbreak dynamics in language models.

inproceedings BKP26


EACL 2026

19th Conference of the European Chapter of the Association for Computational Linguistics. Rabat, Morocco, Mar 24-29, 2026.
Conference logo
A Conference

Authors

S. BallF. Kreuter • N. Panickssery

Links

DOI

Research Area

 C4 | Computational Social Sciences

BibTeXKey: BKP26

Back to Top