Home | Publications | BBF26

Attributions All the Way Down? The Metagame of Interpretability

MCML Authors

Fabian Fumagalli

Fabian Fumagalli

Prof. Dr.

Thomas Bayes Fellow

* Former Thomas Bayes Fellow

Abstract

We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution ϕ(f) explaining a model f, we measure the directional influence of feature j on the attribution of feature i, denoted as meta-attribution φj→i(f), by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in multimodal diffusion transformers.

misc BBF26


Preprint

May. 2026

Authors

H. Baniecki • P. Biecek • F. Fumagalli

Links

arXiv GitHub

Research Area

 A1 | Statistical Foundations & Explainability

BibTeXKey: BBF26

Back to Top