Robust and reliable medical diagnosis using artificial intelligence is crucial, yet real-world clinical environments present significant challenges due to dynamic covariate shifts affecting multi-modal data (images, text, tabular). Existing methods, including the single-modal robust classifier LaDiNE, often fail under these complex, multi-modal shifts, lacking mechanisms for cross-modal invariance, dynamic modality fusion, and fine-grained uncertainty attribution. To address this gap, we propose DyMoLaDiNE (Dynamic Multi-Modal Latent Diffusion Nested-Ensembles), a framework designed for reliable medical diagnosis under dynamic multi-modal covariate shifts. DyMoLaDiNE introduces four key innovations: (1) a Cross-Modal Invariant Feature Extractor leveraging multi-modal Vision Transformers and contrastive learning to derive robust latent representations, (2) a Dynamic Modality Weighting Mechanism that adaptively adjusts modality contributions based on instance-specific reliability scores, (3) a Robust Multi-Modal Diffusion Ensemble utilizing conditional diffusion models conditioned on multi-modal inputs and reliability scores for flexible, calibrated density estimation, and (4) Modality-Attributed Uncertainty Quantification to decompose predictive uncertainty by input source. Extensive evaluations on diverse datasets (MedMD&RadMD, MultiCaRe, PadChest, TCIA RE-MIND, BRaTS, Camelyon16, PANDA) demonstrate that DyMoLaDiNE significantly outperforms (p0.005) state-of-the-art methods (LDM, CMCL, CGMCL, CIIM, DTTL, FFL, ALDM, LaDiNE) in terms of classification accuracy, robustness under dynamic perturbations, confidence calibration (ECE), and precise uncertainty quantification (CPIW, CNPV), while providing superior modality attribution fidelity. Ablation studies confirm the necessity of each component. DyMoLaDiNE represents a significant advancement in trustworthy, robust multi-modal medical AI. Code supporting this study DyMoLaDiNE.
article IZK+25
BibTeXKey: IZK+25