Home | Research | Groups | Stefanie Jegelka

Research Group Stefanie Jegelka

Stefanie Jegelka

Prof. Dr.

Principal Investigator

A3 | Computational Models

Foundations of Deep Neural Networks

Stefanie Jegelka

is a Humboldt Professor at TU Munich.

Her research is in algorithmic machine learning, and spans modeling, optimization algorithms, theory and applications. In particular, she has been working on exploiting mathematical structure for discrete and combinatorial machine learning problems, for robustness and for scaling machine learning algorithms.

Team members @MCML

PhD Students

Andreas Bergmeister

A3 | Computational Models
→ Group Stefanie Jegelka

Foundations of Deep Neural Networks

Vincent Bürgin

A3 | Computational Models
→ Group Stefanie Jegelka

Foundations of Deep Neural Networks

Valerie Engelmayer

A3 | Computational Models
→ Group Stefanie Jegelka

Foundations of Deep Neural Networks

Daniel Herbst

A3 | Computational Models
→ Group Stefanie Jegelka

Foundations of Deep Neural Networks

Eduardo Santos Escriche

A3 | Computational Models
→ Group Stefanie Jegelka

Foundations of Deep Neural Networks

Publications @MCML

2025

[19]

F. Kiwitt, B. Tahmasebi and S. Jegelka.
Symmetries in Weight Space Learning: To Retain or Remove?
HiLD @ICML 2025 - Workshop on High-dimensional Learning Dynamics at the 42nd International Conference on Machine Learning (ICML 2025). Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. URL

Abstract

Weight space learning, an emerging paradigm that seeks to understand neural networks through their space of parameters (weights), has shown promise in a variety of applications, including but not limited to predicting model behavior and addressing privacy concerns. However, weight spaces often exhibit inherent symmetries that impact both theory and practice, such as the scale and rotational invariances found in the Low-Rank Adaptation (LoRA) method, which is the state-of-the-art fine-tuning algorithm for Large Language Models (LLMs). In this work, we investigate a general weight space learning problem under symmetries, focusing on a fundamental question: What is the appropriate formulation for this problem in the presence of symmetries (such as those in LoRA), and should redundant representations that encode the same end-to-end function be removed? We address this question by fully characterizing a new space of symmetric weights, demonstrating that the relevance of redundancy depends on the function being predicted. Specifically, we show that end-to-end symmetries (such as those in LoRA) should not always be removed, as doing so may compromise the universality of the weight space learning problem. To our knowledge, this is the first time this phenomenon has been formally identified and presented, yielding insights into a broad class of weight space learning problems.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

[18]

A. Soleymani, B. Tahmasebi, S. Jegelka and P. Jaillet.
Learning with Exact Invariances in Polynomial Time.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv

Abstract

We study the statistical-computational trade-offs for learning with exact invariances (or symmetries) using kernel regression. Traditional methods, such as data augmentation, group averaging, canonicalization, and frame-averaging, either fail to provide a polynomial-time solution or are not applicable in the kernel setting. However, with oracle access to the geometric properties of the input space, we propose a polynomial-time algorithm that learns a classifier with emph{exact} invariances. Moreover, our approach achieves the same excess population risk (or generalization error) as the original kernel regression problem. To the best of our knowledge, this is the first polynomial-time algorithm to achieve exact (not approximate) invariances in this context. Our proof leverages tools from differential geometry, spectral theory, and optimization. A key result in our development is a new reformulation of the problem of learning under invariances as optimizing an infinite number of linearly constrained convex quadratic programs, which may be of independent interest.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

[17]

A. Bergmeister, M. K. Lal, S. Jegelka and S. Sra.
A projection-based framework for gradient-free and parallel learning.
Preprint (Jun. 2025). arXiv

Abstract

We present a feasibility-seeking approach to neural network training. This mathematical optimization framework is distinct from conventional gradient-based loss minimization and uses projection operators and iterative projection algorithms. We reformulate training as a large-scale feasibility problem: finding network parameters and states that satisfy local constraints derived from its elementary operations. Training then involves projecting onto these constraints, a local operation that can be parallelized across the network. We introduce PJAX, a JAX-based software framework that enables this paradigm. PJAX composes projection operators for elementary operations, automatically deriving the solution operators for the feasibility problems (akin to autodiff for derivatives). It inherently supports GPU/TPU acceleration, provides a familiar NumPy-like API, and is extensible. We train diverse architectures (MLPs, CNNs, RNNs) on standard benchmarks using PJAX, demonstrating its functionality and generality. Our results show that this approach is as a compelling alternative to gradient-based training, with clear advantages in parallelism and the ability to handle non-differentiable operations.

MCML Authors

Andreas Bergmeister

A3 | Computational Models
→ Group Stefanie Jegelka

Foundations of Deep Neural Networks

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

Suvrit Sra

Prof. Dr.

A2 | Mathematical Foundations

Resource Aware Machine Learning

[16]

E. S. E. Eduardo Santos Escriche and S. Jegelka.
Learning equivariant models by discovering symmetries with learnable augmentations.
Preprint (Jun. 2025). arXiv

Abstract

Recently, a trend has emerged that favors learning relevant symmetries from data in geometric domains instead of designing constrained architectures. To do so, two popular options are (1) to modify the training protocol, e.g., with a specific loss and data augmentations (soft equivariance), or (2) to ignore equivariance and infer it only implicitly. However, both options have limitations: soft equivariance requires a priori knowledge about relevant symmetries, while inferring symmetries merely via the task and larger data lacks interpretability. To address both limitations, we propose SEMoLA, an end-to-end approach that jointly (1) discovers a priori unknown symmetries in the data via learnable data augmentations, and (2) softly encodes the respective approximate equivariance into an arbitrary unconstrained model. Hence, it does not need prior knowledge about symmetries, it offers interpretability, and it maintains robustness to distribution shifts. Empirically, we demonstrate the ability of SEMoLA to robustly discover relevant symmetries while achieving high prediction accuracy across various datasets, encompassing multiple data modalities and underlying symmetry groups.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

[15]

X. Guo, A. Li, Y. Wang, S. Jegelka and Y. Wang.
G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning.
Preprint (May. 2025). arXiv GitHub

Abstract

Although Large Language Models (LLMs) have demonstrated remarkable progress, their proficiency in graph-related tasks remains notably limited, hindering the development of truly general-purpose models. Previous attempts, including pretraining graph foundation models or employing supervised fine-tuning, often face challenges such as the scarcity of large-scale, universally represented graph data. We introduce G1, a simple yet effective approach demonstrating that Reinforcement Learning (RL) on synthetic graph-theoretic tasks can significantly scale LLMs’ graph reasoning abilities. To enable RL training, we curate Erdõs, the largest graph reasoning dataset to date comprising 50 diverse graph-theoretic tasks of varying difficulty levels, 100k training data and 5k test data, all drived from real-world graphs. With RL on Erdõs, G1 obtains substantial improvements in graph reasoning, where our finetuned 3B model even outperforms Qwen2.5-72B-Instruct (24x size). RL-trained models also show strong zero-shot generalization to unseen tasks, domains, and graph encoding schemes, including other graph-theoretic benchmarks as well as real-world node classification and link prediction tasks, without compromising general reasoning abilities. Our findings offer an efficient, scalable path for building strong graph reasoners by finetuning LLMs with RL on graph-theoretic tasks, which combines the strengths of pretrained LLM capabilities with abundant, automatically generated synthetic data, suggesting that LLMs possess graph understanding abilities that RL can elicit successfully.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

[14]

L. Fang, Y. Wang, Z. Liu, C. Zhang, S. Jegelka, J. Gao, B. Ding and Y. Wang.
What is Wrong with Perplexity for Long-context Language Modeling?
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL GitHub

Abstract

Handling long-context inputs is crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. The underlying cause of this limitation has remained unclear. In this work, we provide a comprehensive explanation for this issue. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose textbf{LongPPL}, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of -0.96), significantly outperforming traditional PPL in predictive accuracy. Additionally, we introduce textbf{LongCE} (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens, leading to consistent improvements across diverse benchmarks. In summary, these contributions offer deeper insights into the limitations of PPL and present effective solutions for accurately evaluating and enhancing the long-context capabilities of LLMs.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

[13]

L. Rauchwerger, S. Jegelka and R. Levie.
Generalization, Expressivity, and Universality of Graph Neural Networks on Attributed Graphs.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL

Abstract

We analyze the universality and generalization of graph neural networks (GNNs) on attributed graphs, i.e., with node attributes. To this end, we propose pseudometrics over the space of all attributed graphs that describe the fine-grained expressivity of GNNs. Namely, GNNs are both Lipschitz continuous with respect to our pseudometrics and can separate attributed graphs that are distant in the metric. Moreover, we prove that the space of all attributed graphs is relatively compact with respect to our metrics. Based on these properties, we prove a universal approximation theorem for GNNs and generalization bounds for GNNs on any data distribution of attributed graphs. The proposed metrics compute the similarity between the structures of attributed graphs via a hierarchical optimal transport between computation trees. Our work extends and unites previous approaches which either derived theory only for graphs with no attributes, derived compact metrics under which GNNs are continuous but without separation power, or derived metrics under which GNNs are continuous and separate points but the space of graphs is not relatively compact, which prevents universal approximation and generalization analysis.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

[12]

B. Tahmasebi and S. Jegelka.
Generalization Bounds for Canonicalization: A Comparative Study with Group Averaging.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. URL

Abstract

Canonicalization, a popular method for generating invariant or equivariant function classes from arbitrary function sets, involves initial data projection onto a reduced input space subset, followed by applying any learning method to the projected dataset. Despite recent research on the expressive power and continuity of functions represented by canonicalization, its generalization capabilities remain less explored. This paper addresses this gap by theoretically examining the generalization benefits and sample complexity of canonicalization, comparing them with group averaging, another popular technique for creating invariant or equivariant function classes. Our findings reveal two distinct regimes where canonicalization may outperform or underperform compared to group averaging, with precise quantification of this phase transition in terms of sample size, group action characteristics, and a newly introduced concept of alignment. To the best of our knowledge, this study represents the first theoretical exploration of such behavior, offering insights into the relative effectiveness of canonicalization and group averaging under varying conditions.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

[11]

D. Herbst and S. Jegelka.
Higher-Order Graphon Neural Networks: Approximation and Cut Distance.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. Spotlight Presentation. URL

Abstract

Graph limit models, like graphons for limits of dense graphs, have recently been used to study size transferability of graph neural networks (GNNs). While most literature focuses on message passing GNNs (MPNNs), in this work we attend to the more powerful higher-order GNNs. First, we extend the -WL test for graphons (Böker, 2023) to the graphon-signal space and introduce signal-weighted homomorphism densities as a key tool. As an exemplary focus, we generalize Invariant Graph Networks (IGNs) to graphons, proposing Invariant Graphon Networks (IWNs) defined via a subset of the IGN basis corresponding to bounded linear operators. Even with this restricted basis, we show that IWNs of order are at least as powerful as the -WL test, and we establish universal approximation results for graphon-signals in distances. This significantly extends the prior work of Cai & Wang (2022), showing that IWNs—a subset of their IGN-small—retain effectively the same expressivity as the full IGN basis in the limit. In contrast to their approach, our blueprint of IWNs also aligns better with the geometry of graphon space, for example facilitating comparability to MPNNs. We highlight that, while typical higher-order GNNs are discontinuous w.r.t. cut distance—which causes their lack of convergence and is inherently tied to the definition of -WL—their transferability remains comparable to MPNNs.

MCML Authors

Daniel Herbst

A3 | Computational Models
→ Group Stefanie Jegelka

Foundations of Deep Neural Networks

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

[10]

Q. Zhang, Y. Wang, J. Cui, X. Pan, Q. Lei, S. Jegelka and Y. Wang.
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness.
ICLR 2025 - 13th International Conference on Learning Representations. Singapore, Apr 24-28, 2025. To be published. Preprint available. URL

Abstract

Deep learning models often suffer from a lack of interpretability due to polysemanticity, where individual neurons are activated by multiple unrelated semantics, resulting in unclear attributions of model behavior. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability but are commonly believed to compromise accuracy. In this work, we challenge the prevailing belief of the accuracy-interpretability tradeoff, showing that monosemantic features not only enhance interpretability but also bring concrete gains in model performance. Across multiple robust learning scenarios-including input and label noise, few-shot learning, and out-of-domain generalization-our results show that models leveraging monosemantic features significantly outperform those relying on polysemantic features. Furthermore, we provide empirical and theoretical understandings on the robustness gains of feature monosemanticity. Our preliminary analysis suggests that monosemanticity, by promoting better separation of feature representations, leads to more robust decision boundaries. This diverse evidence highlights the generality of monosemanticity in improving model robustness. As a first step in this new direction, we embark on exploring the learning benefits of monosemanticity beyond interpretability, supporting the long-standing hypothesis of linking interpretability and robustness.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

2024

[9]

G. Ma, Y. Wang, D. Lim, S. Jegelka and Y. Wang.
A Canonicalization Perspective on Invariant and Equivariant Learning.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub

Abstract

In many applications, we desire neural networks to exhibit invariance or equivariance to certain groups due to symmetries inherent in the data. Recently, frame-averaging methods emerged to be a unified framework for attaining symmetries efficiently by averaging over input-dependent subsets of the group, i.e., frames. What we currently lack is a principled understanding of the design of frames. In this work, we introduce a canonicalization perspective that provides an essential and complete view of the design of frames. Canonicalization is a classic approach for attaining invariance by mapping inputs to their canonical forms. We show that there exists an inherent connection between frames and canonical forms. Leveraging this connection, we can efficiently compare the complexity of frames as well as determine the optimality of certain frames. Guided by this principle, we design novel frames for eigenvectors that are strictly superior to existing methods – some are even optimal – both theoretically and empirically. The reduction to the canonicalization perspective further uncovers equivalences between previous methods. These observations suggest that canonicalization provides a fundamental understanding of existing frame-averaging methods and unifies existing equivariant and invariant learning methods.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

[8]

Y. Wang, K. Hu, S. Gupta, Z. Ye, Y. Wang and S. Jegelka.
Understanding the Role of Equivariance in Self-supervised Learning.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub

Abstract

Contrastive learning has been a leading paradigm for self-supervised learning, but it is widely observed that it comes at the price of sacrificing useful features (eg colors) by being invariant to data augmentations. Given this limitation, there has been a surge of interest in equivariant self-supervised learning (E-SSL) that learns features to be augmentation-aware. However, even for the simplest rotation prediction method, there is a lack of rigorous understanding of why, when, and how E-SSL learns useful features for downstream tasks. To bridge this gap between practice and theory, we establish an information-theoretic perspective to understand the generalization ability of E-SSL. In particular, we identify a critical explaining-away effect in E-SSL that creates a synergy between the equivariant and classification tasks. This synergy effect encourages models to extract class-relevant features to improve its equivariant prediction, which, in turn, benefits downstream tasks requiring semantic features. Based on this perspective, we theoretically analyze the influence of data transformations and reveal several principles for practical designs of E-SSL. Our theory not only aligns well with existing E-SSL methods but also sheds light on new directions by exploring the benefits of model equivariance. We believe that a theoretically grounded understanding on the role of equivariance would inspire more principled and advanced designs in this field.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

[7]

Y. Wang, Y. Wu, Z. Wei, S. Jegelka and Y. Wang.
A Theoretical Understanding of Self-Correction through In-context Alignment.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub

Abstract

Going beyond mimicking limited human experiences, recent studies show initial evidence that, like humans, large language models (LLMs) are capable of improving their abilities purely by self-correction, i.e., correcting previous responses through self-examination, in certain circumstances. Nevertheless, little is known about how such capabilities arise. In this work, based on a simplified setup akin to an alignment task, we theoretically analyze self-correction from an in-context learning perspective, showing that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way. Notably, going beyond previous theories on over-simplified linear transformers, our theoretical construction underpins the roles of several key designs of realistic transformers for self-correction: softmax attention, multi-head attention, and the MLP block. We validate these findings extensively on synthetic datasets. Inspired by these findings, we also illustrate novel applications of self-correction, such as defending against LLM jailbreaks, where a simple self-correction step does make a large difference. We believe that these findings will inspire further research on understanding, exploiting, and enhancing self-correction for building better foundation models.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

[6]

M. Yau, N. Karalias, E. Lu, J. Xu and S. Jegelka.
Are Graph Neural Networks Optimal Approximation Algorithms?
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL

Abstract

In this work we design graph neural network architectures that capture optimal approximation algorithms for a large class of combinatorial optimization problems, using powerful algorithmic tools from semidefinite programming (SDP). Concretely, we prove that polynomial-sized message-passing algorithms can represent the most powerful polynomial time algorithms for Max Constraint Satisfaction Problems assuming the Unique Games Conjecture. We leverage this result to construct efficient graph neural network architectures, OptGNN, that obtain high-quality approximate solutions on landmark combinatorial optimization problems such as Max-Cut, Min-Vertex-Cover, and Max-3-SAT. Our approach achieves strong empirical results across a wide range of real-world and synthetic datasets against solvers and neural baselines. Finally, we take advantage of OptGNN’s ability to capture convex relaxations to design an algorithm for producing bounds on the optimal solution from the learned embeddings of OptGNN.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

[5]

A. H. Berger, L. Lux, N. Stucki, V. Bürgin, S. Shit, A. Banaszaka, D. Rückert, U. Bauer and J. C. Paetzold.
Topologically faithful multi-class segmentation in medical images.
MICCAI 2024 - 27th International Conference on Medical Image Computing and Computer Assisted Intervention. Marrakesh, Morocco, Oct 06-10, 2024. DOI

Abstract

Topological accuracy in medical image segmentation is a highly important property for downstream applications such as network analysis and flow modeling in vessels or cell counting. Recently, significant methodological advancements have brought well-founded concepts from algebraic topology to binary segmentation. However, these approaches have been underexplored in multi-class segmentation scenarios, where topological errors are common. We propose a general loss function for topologically faithful multi-class segmentation extending the recent Betti matching concept, which is based on induced matchings of persistence barcodes. We project the N-class segmentation problem to N single-class segmentation tasks, which allows us to use 1-parameter persistent homology, making training of neural networks computationally feasible. We validate our method on a comprehensive set of four medical datasets with highly variant topological characteristics. Our loss formulation significantly enhances topological correctness in cardiac, cell, artery-vein, and Circle of Willis segmentation.

MCML Authors

Laurin Lux

C1 | Medicine
→ Group Daniel Rückert

Artificial Intelligence in Healthcare and Medicine

Nico Stucki

A2 | Mathematical Foundations
→ Group Ulrich Bauer

Applied Topology and Geometry

Vincent Bürgin

A3 | Computational Models
→ Group Stefanie Jegelka

Foundations of Deep Neural Networks

Daniel Rückert

Prof. Dr.

C1 | Medicine

Artificial Intelligence in Healthcare and Medicine

Ulrich Bauer

Prof. Dr.

A2 | Mathematical Foundations

Applied Topology and Geometry

[4]

K. Gatmiry, Z. Li, S. J. Reddi and S. Jegelka.
Simplicity Bias via Global Convergence of Sharpness Minimization.
Preprint (Oct. 2024). arXiv

Abstract

The remarkable generalization ability of neural networks is usually attributed to the implicit bias of SGD, which often yields models with lower complexity using simpler (e.g. linear) and low-rank features. Recent works have provided empirical and theoretical evidence for the bias of particular variants of SGD (such as label noise SGD) toward flatter regions of the loss landscape. Despite the folklore intuition that flat solutions are ‘simple’, the connection with the simplicity of the final trained model (e.g. low-rank) is not well understood. In this work, we take a step toward bridging this gap by studying the simplicity structure that arises from minimizers of the sharpness for a class of two-layer neural networks. We show that, for any high dimensional training data and certain activations, with small enough step size, label noise SGD always converges to a network that replicates a single linear feature across all neurons; thereby, implying a simple rank one feature matrix. To obtain this result, our main technical contribution is to show that label noise SGD always minimizes the sharpness on the manifold of models with zero loss for two-layer networks. Along the way, we discover a novel property – a local geodesic convexity – of the trace of Hessian of the loss at approximate stationary points on the manifold of zero loss, which links sharpness to the geometry of the manifold. This tool may be of independent interest.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

[3]

K. Gatmiry, N. Saunshi, S. J. Reddi, S. Jegelka and S. Kumar.
On the Role of Depth and Looping for In-Context Learning with Task Diversity.
Preprint (Oct. 2024). arXiv

Abstract

The intriguing in-context learning (ICL) abilities of deep Transformer models have lately garnered significant attention. By studying in-context linear regression on unimodal Gaussian data, recent empirical and theoretical works have argued that ICL emerges from Transformers’ abilities to simulate learning algorithms like gradient descent. However, these works fail to capture the remarkable ability of Transformers to learn multiple tasks in context. To this end, we study in-context learning for linear regression with diverse tasks, characterized by data covariance matrices with condition numbers ranging from [1,κ], and highlight the importance of depth in this setting. More specifically, (a) we show theoretical lower bounds of log(κ) (or κ√) linear attention layers in the unrestricted (or restricted) attention setting and, (b) we show that multilayer Transformers can indeed solve such tasks with a number of layers that matches the lower bounds. However, we show that this expressivity of multilayer Transformer comes at the price of robustness. In particular, multilayer Transformers are not robust to even distributional shifts as small as O(e−L) in Wasserstein distance, where L is the depth of the network. We then demonstrate that Looped Transformers – a special class of multilayer Transformers with weight-sharing – not only exhibit similar expressive power but are also provably robust under mild assumptions. Besides out-of-distribution generalization, we also show that Looped Transformers are the only models that exhibit a monotonic behavior of loss with respect to depth.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

[2]

T. Putterman, D. Lim, Y. Gelberg, S. Jegelka and H. Maron.
Learning on LoRAs: GL-Equivariant Processing of Low-Rank Weight Spaces for Large Finetuned Models.
Preprint (Oct. 2024). arXiv

Abstract

Low-rank adaptations (LoRAs) have revolutionized the finetuning of large foundation models, enabling efficient adaptation even with limited computational resources. The resulting proliferation of LoRAs presents exciting opportunities for applying machine learning techniques that take these low-rank weights themselves as inputs. In this paper, we investigate the potential of Learning on LoRAs (LoL), a paradigm where LoRA weights serve as input to machine learning models. For instance, an LoL model that takes in LoRA weights as inputs could predict the performance of the finetuned model on downstream tasks, detect potentially harmful finetunes, or even generate novel model edits without traditional training methods. We first identify the inherent parameter symmetries of low rank decompositions of weights, which differ significantly from the parameter symmetries of standard neural networks. To efficiently process LoRA weights, we develop several symmetry-aware invariant or equivariant LoL models, using tools such as canonicalization, invariant featurization, and equivariant layers. We finetune thousands of text-to-image diffusion models and language models to collect datasets of LoRAs. In numerical experiments on these datasets, we show that our LoL architectures are capable of processing low rank weight decompositions to predict CLIP score, finetuning data attributes, finetuning data membership, and accuracy on downstream tasks.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks

[1]

M. Yau, E. Akyürek, J. Mao, J. B. Tenenbaum, S. Jegelka and J. Andreas.
Learning Linear Attention in Polynomial Time.
Preprint (Oct. 2024). arXiv

Abstract

Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key–value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.

MCML Authors

Stefanie Jegelka

Prof. Dr.

A3 | Computational Models

Foundations of Deep Neural Networks