Home | Publications | WSH+25a

RSCLIP for Training-Free Open-Vocabulary Remote Sensing Image Semantic Segmentation

MCML Authors

Xiaoxiang Zhu

Prof. Dr.

Principal Investigator

Data Science in Earth Observation

Abstract

Open-vocabulary semantic segmentation (OVSS) is revolutionizing the image segmentation domain by overcoming the rigid predefined category constraints. However, OVSS methods lack sensitivity in the local regions due to the text-image alignment of CLIP, and suffer with significant object scale variations on remote sensing (RS) images. To address these issues, we propose RSCLIP, a RS domain-adaptive training-free semantic segmentation model through three key innovations, i.e., Neighboraware patches (NAP), semantic correlation enhancement (SCE), and multi-head multi-scale attention (MMA). Specifically, NAP adaptively aggregates feature information from each patch and its neighbors, while SCE strengthens each patch's attention to patches within the same semantic region, thereby improving local discriminability. Moreover, MMA introduces multi-head multiscale attention in OVSS tasks, where different heads employ distinct dilation rates to capture features at various scales. Extensive experiments on RS datasets demonstrate that our RSCLIP significantly outperforms state-of-the-art methods in both quantitative and qualitative metrics. In particular, our method achieves superior performance in LoveDA, Potsdam and UDD5 datasets while exhibiting strong generalization capabilities across multiple datasets. Our method significantly boosts performance without requiring additional data or extra training.

misc WSH+25a