The recent advancements in multimodal large language models (MLLMs) offer new opportunities for Earth observation (EO) tasks by enhancing reasoning and analysis capabilities. However, fair and systematic evaluation of these models remains challenging. Existing assessments often suffer from dataset biases, which can lead to an overestimation of model performance and inconsistent comparisons across MLLMs. To address this issue, we introduce ChatEarthBench, a comprehensive benchmark dataset specifically designed for zero-shot evaluation of MLLMs in EO. ChatEarthBench comprises 10 image-text datasets spanning three data modalities. Importantly, these datasets are unseen by the evaluated MLLMs in our work to enable rigorous and fair zero-shot evaluation across diverse real-world EO tasks. By systematically analyzing MLLM performance across various EO tasks, we provide critical insights into their capabilities and limitations. Our findings offer essential guidance for the development of more robust and generalizable MLLMs for EO applications.
article YXD+26
BibTeXKey: YXD+26