Home  | Publications | YSC+26

From Panel to Pixel: Zoom-in Vision-Language Pretraining From Biomedical Scientific Literature

MCML Authors

Abstract

There is growing interest in biomedical vision--language models trained on scientific literature. However, most pipelines compress rich multi-panel figures and long captions into coarse figure-level pairs, discarding the fine-grained correspondences clinicians rely on when zooming into local structures. We introduce Panel2Patch, a data pipeline that mines hierarchical structure from multi-panel, marker-heavy biomedical figures and their surrounding text, and converts them into multi-granular supervision. Given figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs aligned image--text pairs at the figure, panel, and region levels, preserving local semantics instead of treating each figure as a single sample. Built on this corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases in a shared embedding space. Applying Panel2Patch to a small subset of literature figures yields substantially better performance than prior pipelines, demonstrating that exploiting hierarchical figure structure can provide more effective supervision with less pretraining data.

inproceedings YSC+26


CVPR 2026

IEEE/CVF Conference on Computer Vision and Pattern Recognition. Denver, CO, USA, Jun 03-07, 2026. To be published. Preprint available.
Conference logo
A* Conference

Authors

K. Yuan • M. Sun • Z. Chen • A. Lozano • X. He • S. Li • N. Navab • X. Sun • N. Padoy • S. Yeung-Levy

Links

URL

Research Area

 C1 | Medicine

BibTeXKey: YSC+26

Back to Top