Karsten Roth
* Former Member
Multimodal Large Language Models (MLLMs) exhibit in-context learning (ICL) abilities. Yet we lack understanding of how these models actually perform multi-modal ICL. We train modern transformer models on synthetic classification tasks, systematically varying data statistics and model architecture. We find that pretraining on a highly diverse primary modality installs the ICL circuit, so the secondary modality can attain comparable ICL with much less data complexity. Scaling up multimodal decoders improves ICL capacity while the encoder for the second modality sets the ceiling, as weak representations bottleneck multimodal ICL. Rotary position embeddings (RoPE) actively harm ICL by disrupting attention circuits with fixed data complexity. Through mechanistic analysis with progress measurements that track the formation of ICL circuits, we demonstrate that both unimodal and multimodal ICLs rely on a common induction-style circuit that copies the label from the in-context exemplar that matches the query. Multimodal training primarily refines this behavior rather than introducing new circuitry. These results offer a clear, mechanism-level account and practical levers for engineering ICL in modern multimodal transformers.
inproceedings HRB+25
BibTeXKey: HRB+25