Research Stay at Stanford University

Kun Yuan – Funded by the MCML AI X-Change Program

During my research stay at Stanford University from July to September 2025, I had the pleasure of being part of the research group led by Assistant Professor Serena Yeung in the Department of Biomedical Data Science. My two-month stay in California gave me the opportunity to investigate how public scientific articles can be leveraged to build biomedical vision-language foundation models, culminating in a paper accepted to ML4H. Together with the group, I also laid the groundwork for several projects.

“Stanford Engineering 1925–2025” centennial projection—a reminder of the university’s long tradition of innovation and its vibrant campus atmosphere.

The collaboration between Munich and Serena Yeung’s group at Stanford quickly grew into a unified research effort on biomedical vision-language models, spanning three tightly connected projects. Together we developed a zoom-in approach for scientific figures, explored long-context pretraining from full scientific articles, and probed whether modern VLMs genuinely perceive or simply recall visual patterns. Each project drew directly on Munich’s strength in large-scale scientific data mining and Stanford’s expertise in multimodal modeling and clinical AI, creating a partnership that not only produced concrete research outputs but also set the foundation for deeper joint work moving forward. All of these achievements were made possible through the generous support of the MCML AI X-Change program.

The Stanford Oval on a bright summer day, with the red “S” in full bloom

The Biomedical Data Science department at Stanford, and in particular Serena Yeung’s group, is an example of a research environment where machine learning and medicine come together in a very concrete way. The group is dedicated to developing methods that connect modern computer vision and vision-language models with real clinical and biomedical questions, with practical needs in healthcare often driving new algorithmic ideas and, in turn, new models opening up possibilities for how medical data can be used. Working closely with clinicians, hospitals and large-scale datasets, the lab sits at the intersection of methodology and application, and is at the forefront of building multimodal, clinically grounded AI systems that aim not only to perform well on benchmarks, but also to make a real impact in medical practice.

The iconic “Red Hoop” fountain near the Stanford Engineering buildings, catching the late-afternoon light as water patterns shimmer in the breeze

Scientific Figures as Multimodal Training Grounds

We built a unified training corpus that links entire figures, their constituent regions, captions, and surrounding text, enabling foundation-style pretraining tailored to biomedical imagery rather than generic vision-language alignment. This data-centric view treats scientific figures as structured documents: global context, panel content, and localized evidence are all captured and supervised jointly, yielding models that transfer more reliably across modalities and tasks.

From Panel to Pixel: Learning to Read Biomedical Figures

We developed a zoom-in pretraining strategy that decomposes multi-panel images into semantically meaningful regions paired with hierarchical descriptions. The resulting representation mirrors expert reading: move from global context to panel summaries, then down to localized features. This hierarchy provides stronger supervision than coarse image–caption pairs, improving both panel-level retrieval/classification and region-level grounding.

A framed view of Hoover Tower through the massive “Stone River”

Do Vision-Language Models Perceive or Recall?

To probe whether large vision–language models truly perceive or merely recall patterns, we designed controlled visual-illusion and perturbation tests. Models that excel on standard benchmarks showed consistent failures under these perceptual stressors. The results argue for fine-grained datasets and localized reasoning pathways, precisely the kind supplied by zoom-in, region-aware pretrainingif we want robustness beyond memorization.

Late-night teamwork in the lab—proof that some of the best ideas (and the best memories) happen long after midnight with good friends and shared excitement for the project.

Beyond the research, living in Stanford for the summer was an experience of its own. The long, warm evenings, the quiet walks through the palm-lined campus, and the weekend trips around the Bay Area made the stay feel both energizing and grounding. I met a group of friends, including Ph.D. students, interns, and visitors from all over the world, who turned everyday moments into memories, whether it was late-night debugging sessions, spontaneous dinners in Palo Alto, or simple conversations on the way back from the lab. The city, the campus, and the people created a sense of community that made the months in California feel remarkably full, and it is something I will carry with me long after the research stay has ended. I am deeply grateful to the MCML AI X-Change program for making this entire experience possible. Their support not only enabled the research itself, but also gave me the chance to immerse myself in a new environment, build lasting collaborations, and grow both personally and scientifically.

#ai-x-change #blog #navab

Subscribe to RSS News feed