Independent Benchmarking of Prompt-Based Medical Segmentation Models
MCML Authors
Abstract
Abstract
Medical image segmentation rapidly shifts toward vision(-language) foundation models that unify diverse modalities and tasks within a single framework. In this work, we systematically benchmark high-impact vision-language and segment-anything-based architectures across multiple clinically relevant CT and MRI tasks. We show that while these models achieve strong performance, each comes with specific (dis)advantages. Non-3D models are highly flexible but require substantial user guidance and are prone to over- or under-detection. 3D architectures offer overall more reliable volumetric consistency, but can still have detection problems. Vision-language models appear sensitive to the coverage of training data, whereas click-prompted SAM-based models are more universal, with a, though limited, ability to address zero-shot targets. When tested with more complex text prompts, most vision-language models exhibit missing semantic language understanding. Overall, these models hold considerable promise but still express limitations. Our work highlights key areas where future research is needed to advance vision(-language) foundation models.
misc ESB+25
Preprint
Oct. 2025Authors
A. C. Erdur • D. Scholz • J. A. Buchner • D. Bernhardt • S. E. Combs • B. Wiestler • D. Rückert • J. C. PeekenLinks
DOIResearch Area
BibTeXKey: ESB+25