Home  | News

19.02.2026

Teaser image to COSMOS – Teaching Vision-Language Models to Look Beyond the Obvious

COSMOS – Teaching Vision-Language Models to Look Beyond the Obvious

MCML Research Insight - With Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, and Zeynep Akata

Today’s AI systems are remarkably good at recognizing what stands out most. Yet understanding how details relate to each other remains surprisingly difficult. Imagine asking an AI assistant to describe a crowded street photo. It quickly mentions the car in the foreground — but misses the cyclist waiting at the light or the small sign that changes the meaning of the moment.

 Key Insight


AI often captures the obvious first — but meaning lives in the relationships between details.


MCML Junior Members Sanghwan Kim, Rui Xiao and former members Mariana-Iuliana Georgescu and Stephan Alaniz, and MCML PI Zeynep Akata address this challenge in their work on COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-Training, presented at CVPR 2025. Their approach explores how AI can learn from both the big picture and fine-grained context to connect images and language more thoughtfully.


The Idea Behind the Paper

Most modern vision–language models learn by comparing entire images with entire captions. This strategy has enabled impressive progress, but it can overlook subtle context. Smaller objects, background actions, or nuanced descriptions may not influence the model’s understanding enough when everything is treated as a single unit.

COSMOS starts from a simple question: What if AI learned from both the whole scene and the smaller pieces inside it? Instead of focusing only on overall similarity, the research encourages models to explore how individual image regions relate to individual parts of language — shifting from surface matching toward deeper understanding.


 Core Idea


Understanding emerges when models learn to connect parts, not just compare wholes.


Teaching AI to See in Context

Rather than designing a completely new architecture, the researchers rethought how existing models learn. The training setup works a bit like a classroom:

  • A teacher model focuses on stable, high-level understanding.
  • A student model explores finer details by looking at smaller image regions and individual sentences.

A key ingredient is cross-attention, which allows visual information and language to guide each other during training. In simple terms, the model learns to ask which words describe a specific part of an image — and which visual details support a sentence. You can think of it like watching a film with subtitles, where each line of dialogue connects to a precise moment on screen.

An overview of COSMOS

Figure 1: An overview of COSMOS. Left: The VLM pre-training mechanism is based on the student-teacher framework with contrastive loss for multi-modal alignment, and cross-modality self-distillation loss for fine-grained representation learning. Right: The architecture of the student and teacher model with cross-attention modules that extract cross-modality information from the student.


Key Findings & Results

This change in training strategy led to noticeable improvements. Models trained with COSMOS became better at recognizing subtle details and distinguishing scenes that look similar but carry different meanings. Importantly, these gains did not come from simply scaling model size or adding more data — they came from improving how the model learns.

Qualitative results comparing RGNet with a baseline method on long videos

Figure 2: Qualitative results. Visualization of attention map in cross-attention modules. Attention weights are normalized between 0 and 1.


 Takeaway


Sometimes innovation in AI is not about building bigger systems — but about helping them learn to pay attention.


Why This Matters

As AI systems become part of everyday tools — from accessibility features and search engines to creative applications — understanding context becomes increasingly important. Missing small visual cues can lead to inaccurate descriptions or misleading outputs.

COSMOS reflects a broader shift in AI research: progress is not only about building larger models, but also about designing smarter ways for machines to learn. By helping AI connect images and language more precisely, this work moves technology toward more reliable and nuanced understanding.


Further Reading & Reference

If you would like to learn more about how COSMOS connects visual details and language through smarter training strategies, you can explore the full paper presented at CVPR 2025 — one of the leading international conferences in computer vision.

A* Conference
S. KimR. XiaoM.-I. GeorgescuS. AlanizZ. Akata
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. DOI GitHub

Share Your Research!


Get in touch with us!

Are you an MCML Junior Member and interested in showcasing your research on our blog?

We’re happy to feature your work—get in touch with us to present your paper.

#blog #research #akata
Subscribe to RSS News feed

Related

Link to Needle in a Haystack: Finding Exact Moments in Long Videos

05.02.2026

Needle in a Haystack: Finding Exact Moments in Long Videos

ECCV 2024 research introduces RGNet, an AI model that finds exact moments in long videos using unified retrieval and grounding.

Read more
Link to Benjamin Busam Leads Design of Bavarian Earth Observation Satellite Network “CuBy”

04.02.2026

Benjamin Busam Leads Design of Bavarian Earth Observation Satellite Network “CuBy”

Benjamin Busam leads the scientific design of the “CuBy” satellite network, delivering AI-ready Earth observation data for Bavaria.

Read more
Link to Cracks in the foundations of cosmology

30.01.2026

Cracks in the Foundations of Cosmology

Daniel Grün examines cosmological tensions that challenge the Standard Model and may point toward new physics.

Read more
Link to How Machines Can Discover Hidden Rules Without Supervision

29.01.2026

How Machines Can Discover Hidden Rules Without Supervision

ICLR 2025 research shows how self-supervised learning uncovers hidden system dynamics from unlabeled, high-dimensional data.

Read more
Link to Matthias Nießner Co-Founds AI Startup Synthesia

28.01.2026

Matthias Nießner Co-Founds AI Startup Synthesia

Julien Gagneur comments on DeepMind’s AlphaGenome, highlighting its precision and remaining challenges in genome prediction.

Read more
Back to Top