Monocular height estimation is among the most efficient and cost-effective approaches for three-dimensional perception in remote sensing, and it has gained significant attention with the advent of deep learning. However, training neural networks for this task requires a large volume of annotated data, while high-quality labels are scarce and typically available only in developed regions. Models trained solely on such data often exhibit limited generalizability, constraining their applicability at large scales. In this work, we address the limitation for the first time by incorporating imperfect labels from out-of-domain regions into the training of pixel-wise height estimation networks for buildings. These labels may be incomplete, inexact, or inaccurate relative to high-quality annotations. We introduce an ensemble-based pipeline that can be integrated with any monocular height estimation network. To cope with the challenges posed by noisy labels, domain shifts, and the long-tailed distribution of height values, we design the architecture and loss functions to exploit the information embedded in imperfect labels. This is achieved through weak supervision with balanced soft losses and ordinal constraints. Extensive experiments are conducted on two datasets with different spatial resolutions—DFC23 (0.5–1 m) and GBH (3 m). The proposed pipeline demonstrates more balanced performance across domains, reducing the average root mean square error by up to 22.94% on DFC23 and 18.62fi% on GBH compared with baseline methods. The contribution of each design component is further verified through ablation studies.
article CSZ26
BibTeXKey: CSZ26