Home | Publications | HRB+25

Towards Understanding Multimodal In-Context Learning

MCML Authors

Yiran Huang

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Karsten Roth

* Former Member

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Quentin Bouniot

Dr.

→ Group Zeynep Akata
Interpretable and Reliable Machine Learning

Zeynep Akata

Prof. Dr.

Principal Investigator

Interpretable and Reliable Machine Learning

Abstract

Multimodal Large Language Models (MLLMs) exhibit in-context learning (ICL) abilities. Yet we lack understanding of how these models actually perform multi-modal ICL. We train modern transformer models on synthetic classification tasks, systematically varying data statistics and model architecture. We find that pretraining on a highly diverse primary modality installs the ICL circuit, so the secondary modality can attain comparable ICL with much less data complexity. Scaling up multimodal decoders improves ICL capacity while the encoder for the second modality sets the ceiling, as weak representations bottleneck multimodal ICL. Rotary position embeddings (RoPE) actively harm ICL by disrupting attention circuits with fixed data complexity. Through mechanistic analysis with progress measurements that track the formation of ICL circuits, we demonstrate that both unimodal and multimodal ICLs rely on a common induction-style circuit that copies the label from the in-context exemplar that matches the query. Multimodal training primarily refines this behavior rather than introducing new circuitry. These results offer a clear, mechanism-level account and practical levers for engineering ICL in modern multimodal transformers.

inproceedings HRB+25

WCTD @NeurIPS 2025

What Can('t) Transformers Do? Workshop at the 39th Conference on Neural Information Processing Systems. San Diego, CA, USA, Nov 30-Dec 07, 2025.

Authors

Y. Huang • K. Roth • Q. Bouniot • W. Xu • Z. Akata

Links

PDF

Research Area

B1 | Computer Vision

BibTeXKey: HRB+25

#p-akata