Earth observation (EO) data encompass a vast range of remotely sensed information, featuring multisensor and multitemporal data, playing an indispensable role in understanding our planet’s dynamics. Recently, vision–language models (VLMs) have achieved remarkable success in perception and reasoning tasks, bringing new insights and opportunities to the EO field. However, their potential for EO applications, especially for scientific regression-related applications, remains largely unexplored. This article bridges that gap by systematically examining the challenges and opportunities of adapting VLMs for EO regression tasks (see Figure 1). The discussion first contrasts the distinctive properties of EO data with conventional computer vision (CV) datasets, then identifies four core obstacles in applying VLMs to EO regression: 1) the absence of dedicated benchmarks, 2) the discrete-versus-continuous representation mismatch, 3) cumulative error accumulation, and 4) the suboptimal nature of text-centric training objectives for numerical tasks. Next, a series of methodological insights and potential subtle pitfalls are explored. Finally, we offer some promising future directions for designing robust domain-aware solutions. Our findings highlight the promise of VLMs for scientific regression in EO, setting the stage for more precise and interpretable modeling of critical environmental processes.
article
BibTeXKey: XZ25