07.08.2025

Teaser image to Precise and Subject-Specific Attribute Control in AI Image Generation

Precise and Subject-Specific Attribute Control in AI Image Generation

MCML Research Insight - With Vincent Tao Hu and Björn Ommer

Text-to-image (T2I) models like Stable Diffusion have become masters at turning prompts like “a happy man and a red car” into vivid, detailed images. But what if you want the man to look just a little older, or the car to appear slightly more luxurious without changing anything else? Until now, that level of subtle, subject-specific control was surprisingly hard.

A new method led by first-author Stefan Andreas Baumann, developed in collaboration with co-authors Felix Krause, Michael Neumayr, Nick Stracke, and Melvin Sevi, as well as MCML Junior Member Vincent Tao Hu, and MCML PI Björn Ommer changes this by giving us smooth, precise dials for individual attributes, like age, mood, or color, applied to specific subjects in an image.


«Currently, a fundamental gap exists: no method provides fine-grained modulation and subject-specific localization simultaneously.»


Stefan Andreas Baumann et al.

The Insight: Words as Vectors You Can Gently Push

The models rely on CLIP, which encoder turns each word in a prompt into a vector embedding - a numerical representation of meaning in a high-dimensional embedding space. The authors discovered that within this space, you can identify semantic directions corresponding to attributes like older, happier, or more expensive.

Want to make just the man older, not the woman? You can shift the embedding of the word “man” slightly along the “age” direction in the embedding space. The result: only the changes in the generated image - smoothly and precisely (see Figure 1).

Example for fine-grained control of attribute expression

Figure 1: The authors augment the prompt input of image generation models with fine-grained control of attribute expression in generated images (unmodified images are marked in green) in a subject-specific manner without additional cost during generation. Previous methods only allow either fine-grained expression control or fine-grained localization when starting from the image generated from a basic prompt.


Two Ways to Find These Directions

  1. Semantic Prompt Differences: The team compares prompts like “man” vs “old man” to find the direction that changes the embedding with no training required.
  2. Learning from the Model Itself: For more robust control, they generate images using slightly altered prompts, look at how the model’s internal noise changes, and backtrack to find the best direction that causes just that effect - like reverse-engineering the model’s thought process.

«Since we only modify the tokenwise CLIP text embedding along pre-identified directions, we enable more fine-grained manipulation at no additional cost in the generation process.»


Stefan Andreas Baumann et al.

Why It’s a Big Deal

This method doesn’t need to retrain or modify the model at all. It plugs right into existing T2I systems, adds zero overhead during generation, and works even on real photos.

It gives creators intuitive, fine-grained control over how things appear and not just what appears. That means more expressive storytelling, more precise edits, and better alignment with human intent.


Interested in Exploring Further?

Check out the full paper presented at the prestigious A* conference CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition, one of the highest-ranked AI/ML conferences.

S. A. Baumann, F. Krause, M. Neumayr, N. Stracke, M. Sevi, V. T. Hu and B. Ommer.
Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. URL GitHub
Abstract

In recent years, advances in text-to-image (T2I) diffusion models have substantially elevated the quality of their generated images. However, achieving fine-grained control over attributes remains a challenge due to the limitations of natural language prompts (such as no continuous set of intermediate descriptions existing between person'' and old person’’). Even though many methods were introduced that augment the model or generation process to enable such control, methods that do not require a fixed reference image are limited to either enabling global fine-grained attribute expression control or coarse attribute expression control localized to specific subjects, not both simultaneously. We show that there exist directions in the commonly used token-level CLIP text embeddings that enable fine-grained subject-specific control of high-level attributes in text-to-image models. Based on this observation, we introduce one efficient optimization-free and one robust optimization-based method to identify these directions for specific attributes from contrastive text prompts. We demonstrate that these directions can be used to augment the prompt text input with fine-grained control over attributes of specific subjects in a compositional manner (control over multiple attributes of a single subject) without having to adapt the diffusion model.

MCML Authors
Link to website

Felix Krause

Computer Vision & Learning

Link to website

Vincent Tao Hu

Dr.

Computer Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Computer Vision & Learning

Explore an overview of the method, visual examples, and detailed explanations at the project website or try the method interactively in your browser.

Project Website
Colab Demo

Share Your Research!


Get in touch with us!

Are you an MCML Junior Member and interested in showcasing your research on our blog?

We’re happy to feature your work—get in touch with us to present your paper.

07.08.2025


Subscribe to RSS News feed

Related

Link to What is intelligence—and what kind of intelligence do we want in our future? With Sven Nyholm

06.08.2025

What Is Intelligence—and What Kind of Intelligence Do We Want in Our Future? With Sven Nyholm

Sven Nyholm explores how AI reshapes authorship, responsibility and creativity, calling for democratic oversight in shaping our AI future.

Link to AI for better Social Media - with researcher Dominik Bär

04.08.2025

AI for Better Social Media - With Researcher Dominik Bär

Dominik Bär develops AI for real-time counterspeech to combat hate and misinformation, part of the KI Trans project on AI in education.

Link to From Vulnerable to Verified: Exact Certificates Shield Models from Label‑Flipping

31.07.2025

From Vulnerable to Verified: Exact Certificates Shield Models From Label‑Flipping

Published as a spotlight presentation at ICLR 2025, the paper certifies neural networks against label poisoning.

Link to Tracking Our Changing Planet from Space - with Xiaoxiang Zhu

30.07.2025

Tracking Our Changing Planet From Space - With Xiaoxiang Zhu

In this video, Xiaoxiang Zhu shares how her team extracts geo-information from petabytes of data, with real impact on global challenges.

Link to AI for Enhanced Eye Diagnostics - with researcher Lucie Huang

29.07.2025

AI for Enhanced Eye Diagnostics - With Researcher Lucie Huang

Lucie Huang develops AI for faster eye scans and earlier diagnoses, featured in a new KI Trans video on real-world AI impact.