leads the group for Clinical Data Science at the Department of Radiology at LMU Munich.
His team employs advanced statistics, machine learning and computer vision techniques in the context of clinical radiology to enable fast and precise AI-supported diagnosis and prognostication. The research areas focus on applying computer vision techniques in radiology for diagnosis and prognosis, as well as using biostatistical methods to rigorously analyze clinical data. Additionally, the work includes leveraging large language models for clinical text analysis and developing multimodal deep learning models that integrate diverse data types, such as imaging and text, to improve AI model accuracy and applicability.
tbd
Statistical Learning and Data Science
Statistics, Data Science and Machine Learning
Purpose: To compare the contrast media opacification and diagnostic quality in lower-extremity runoff CT angiography (CTA) between bolus-tracking using conventional fixed trigger delay and patient-specific individualized post-trigger delay.
Methods: In this prospective study, lower-extremity runoff CTA was performed in two cohorts, using either fixed or individualized trigger delay. Both cohorts had identical CT protocols, contrast media applications, and image reconstructions. Objective image quality (mean contrast opacification in HU), and subjective image quality (5-point Likert-scale), were assessed in six vessels: abdominal aorta (AA), common iliac artery (CIA), superficial femoral artery (SFA), popliteal artery (PA), posterior tibial artery (PTA), and dorsalis pedis artery (DPA) by one rater for objective and two raters for subjective image quality. Objective image quality was analyzed using Student t-tests, while subjective ratings were compared with Fisher’s exact test.
Results: Overall, 65 patients were included (mean age: 71 ± 14; 39 men), 35 in the individualized cohort and 30 in the fixed cohort. No differences were found between the groups regarding demographics or radiation exposure. Individualized trigger delay ranged from 2 to 23 s (mean: 8.7 ± 4.0 s) and was 10 s in the fixed cohort. The individualized cohort showed higher opacification in the peripheral arteries (PTA: 479 ± 140 HU vs. 379 ± 106 HU; p = 0.009; DPA: 477 ± 191 HU vs. 346 ± 137 HU; p = 0.009). Overall subjective “image quality” was rated higher in the individualized group (“excellent” or “good” in Rater 1: 97% vs. 57%; p < 0.001; and Rater 2: 89% vs. 53%; p = 0.002).
Conclusion: Individualized post-trigger delay enhances diagnostic quality, by improving vessel opacification in peripheral arteries and increasing subjective image quality in lower extremity runoff CTA.
This study investigates the predictive capability of radiomics in determining programmed cell death ligand 1 (PD-L1) expression (>=1%) status in non-small cell lung cancer (NSCLC) patients using a newly collected [18F]FDG PET/CT dataset. We aimed to replicate and validate the radiomics-based machine learning (ML) model proposed by Zhao et al. [2] predicting PD-L1 status from PET/CT-imaging.
An independent cohort of 254 NSCLC patients underwent [18F]FDG PET/CT imaging, with primary tumor segmentation conducted using lung tissue window (LTW) and more conservative soft tissue window (STW) methods. Radiomics models (“Rad-score” and “complex model”) and a clinical-stage model from Zhao et al. were evaluated via 10-fold cross-validation and AUC analysis, alongside a benchmark-study comparing different ML-model pipelines. Clinicopathological data were collected from medical records.
On our data, the Rad-score model yielded mean AUCs of 0.593 (STW) and 0.573 (LTW), below Zhao et al.’s 0.761. The complex model achieved mean AUCs of 0.505 (STW) and 0.519 (LTW), lower than Zhao et al.’s 0.769. The clinical model showed a mean AUC of 0.555, below Zhao et al.’s 0.64. All models performed significantly lower than Zhao et al.’s findings. Our benchmark study on four ML pipelines revealed consistently low performance across all configurations.
Our study failed to replicate original findings, suggesting poor model performance and questioning predictive value of radiomics features in classifying PD-L1 expression from PET/CT imaging. These results highlight challenges in replicating radiomics-based ML models and stress the need for rigorous validation
Objectives: Adenomatous colorectal polyps require endoscopic resection, as opposed to non-adenomatous hyperplastic colorectal polyps. This study aims to evaluate the effect of artificial intelligence (AI)-assisted differentiation of adenomatous and non-adenomatous colorectal polyps at CT colonography on radiologists’ therapy management.
Materials and methods: Five board-certified radiologists evaluated CT colonography images with colorectal polyps of all sizes and morphologies retrospectively and decided whether the depicted polyps required endoscopic resection. After a primary unassisted reading based on current guidelines, a second reading with access to the classification of a radiomics-based random-forest AI-model labelling each polyp as ’non-adenomatous’ or ‘adenomatous’ was performed. Performance was evaluated using polyp histopathology as the reference standard.
Results: 77 polyps in 59 patients comprising 118 polyp image series (47% supine position, 53% prone position) were evaluated unassisted and AI-assisted by five independent board-certified radiologists, resulting in a total of 1180 readings (subsequent polypectomy: yes or no). AI-assisted readings had higher accuracy (76% +/− 1% vs. 84% +/− 1%), sensitivity (78% +/− 6% vs. 85% +/− 1%), and specificity (73% +/− 8% vs. 82% +/− 2%) in selecting polyps eligible for polypectomy (p < 0.001). Inter-reader agreement was improved in the AI-assisted readings (Fleiss’ kappa 0.69 vs. 0.92).
Conclusion: AI-based characterisation of colorectal polyps at CT colonography as a second reader might enable a more precise selection of polyps eligible for subsequent endoscopic resection. However, further studies are needed to confirm this finding and histopathologic polyp evaluation is still mandatory.
Age estimations are relevant for pre-trial detention, sentencing in criminal cases and as part of the evaluation in asylum processes to protect the rights and privileges of minors. No current method can determine an exact chronological age due to individual variations in biological development. This study seeks to develop a validated statistical model for estimating an age relative to key legal thresholds (15, 18, and 21 years) based on a skeletal (CT-clavicle, radiography-hand/wrist or MR-knee) and tooth (radiography-third molar) developmental stages. The whole model is based on 34 scientific studies, divided into examinations of the hand/wrist (15 studies), clavicle (5 studies), distal femur (4 studies), and third molars (10 studies). In total, data from approximately 27,000 individuals have been incorporated and the model has subsequently been validated with data from 5,000 individuals. The core framework of the model is built upon transition analysis and is further developed by a combination of a type of parametric bootstrapping and Bayesian theory. Validation of the model includes testing the models on independent datasets of individuals with known ages and shows a high precision with separate populations aligning closely with the model’s predictions. The practical use of the complex statistical model requires a user-friendly tool to provide probabilities together with the margin of error. The assessment based on the model forms the medical component for the overall evaluation of an individual’s age.
We address the computational barrier of deploying advanced deep learning segmentation models in clinical settings by studying the efficacy of network compression through tensor decomposition. We propose a post-training Tucker factorization that enables the decomposition of pre-existing models to reduce computational requirements without impeding segmentation accuracy. We applied Tucker decomposition to the convolutional kernels of the TotalSegmentator (TS) model, an nnU-Net model trained on a comprehensive dataset for automatic segmentation of 117 anatomical structures. Our approach reduced the floating-point operations (FLOPs) and memory required during inference, offering an adjustable trade-off between computational efficiency and segmentation quality. This study utilized the publicly available TS dataset, employing various downsampling factors to explore the relationship between model size, inference speed, and segmentation performance. The application of Tucker decomposition to the TS model substantially reduced the model parameters and FLOPs across various compression rates, with limited loss in segmentation accuracy. We removed up to 88% of the model’s parameters with no significant performance changes in the majority of classes after fine-tuning. Practical benefits varied across different graphics processing unit (GPU) architectures, with more distinct speed-ups on less powerful hardware. Post-hoc network compression via Tucker decomposition presents a viable strategy for reducing the computational demand of medical image segmentation models without substantially sacrificing accuracy. This approach enables the broader adoption of advanced deep learning technologies in clinical practice, offering a way to navigate the constraints of hardware capabilities.
Statistics, Data Science and Machine Learning
This study evaluates the clinical value of a deep learning–based artificial intelligence (AI) system that performs rapid brain volumetry with automatic lobe segmentation and age- and sex-adjusted percentile comparisons.
Methods: Fifty-five patients—17 with Alzheimer’s disease (AD), 18 with frontotemporal dementia (FTD), and 20 healthy controls—underwent cranial magnetic resonance imaging scans. Two board-certified neuroradiologists (BCNR), two board-certified radiologists (BCR), and three radiology residents (RR) assessed the scans twice: first
without AI support and then with AI assistance.
Results: AI significantly improved diagnostic accuracy for AD (area under the curve −AI: 0.800, +AI: 0.926, p < 0.05), with increased correct diagnoses (p < 0.01) and reduced errors (p < 0.03). BCR and RR showed notable performance gains (BCR:
p < 0.04; RR: p < 0.02). For the diagnosis FTD, overall consensus (p < 0.01), BCNR (p < 0.02), and BCR (p < 0.05) recorded significantly more correct diagnoses.
Discussion: AI-assisted volumetry improves diagnostic performance in differentiating AD and FTD, benefiting all reader groups, including BCNR.
In this multi-center study, we proposed a structured reporting (SR) framework for non-small cell lung cancer (NSCLC) and developed a software-assisted tool to automatically translate image-based findings and annotations into TNM classifications. The aim of this study was to validate the software-assisted SR tool for NSCLC, assess its potential clinical impact in a proof-of-concept study, and evaluate current reporting standards in participating institutions.
Hintergrund: Die medizinische Codierung von radiologischen Befunden ist essenziell für eine gute Qualität der Versorgung und die korrekte Abrechnung, gleichzeitig aber eine aufwändige und fehleranfällige Aufgabe.
Ziel der Arbeit: Bewertung der Anwendbarkeit natürlicher Sprachverarbeitung (Natural Language Processing, NLP) für die ICD-10-Codierung von radiologischen Befunden in deutscher Sprache durch Finetuning geeigneter Sprachmodelle.
Material und Methoden: In dieser retrospektiven Studie wurden alle Magnetresonanztomographie(MRT)-Befunde unseres Instituts zwischen 2010 und 2020 berücksichtigt. Die ICD-10-Codes bei Entlassung wurden den jeweiligen Befunden zugeordnet, um einen Datensatz für eine Multiclass-Klassifizierung zu erstellen. Finetuning von GermanBERT und flanT5 wurde auf dem Gesamtdatensatz (dstotal) mit 1035 verschiedenen ICD-10-Codes und zwei reduzierten Datensätzen mit den 100 (ds100) und 50 (ds50) häufigsten Codes durchgeführt. Die Performance der Modelle wurde mit Top-k-Genauigkeit für k = 1, 3, 5 evaluiert. In einer Ablationsstudie wurden beide Modelle einmal auf den zugehörigen Metadaten und dem Befund allein trainiert.
Ergebnisse: Der Gesamtdatensatz bestand aus 100.672 radiologischen Befunden, die reduzierten Datensätze ds100 aus 68.103 und ds50 aus 52.293 Berichten. Die Modellperformance stieg, wenn mehrere der besten Voraussagen des Modells in Betracht gezogen wurden, die Anzahl der Zielklassen reduziert wurde und die Metadaten mit dem Befund kombiniert wurden. FlanT5 übertraf GermanBERT in allen Datensätzen und Metriken und eignet sich am besten als medizinischer Codierungsassistent, wobei eine Top-3-Genauigkeit von fast 70% im realitätsnahen Datensatz dstotal erreicht wurde.
Schlussfolgerung: Finetuning von Sprachmodellen verspricht eine zuverlässige Vorhersage von ICD-10-Codes deutscher radiologischer MRT-Befunde in unterschiedlichen Szenarien. Als Codierungsassistent kann flanT5 medizinischen Codierern helfen, informierte Entscheidungen zu treffen und potenziell ihre Arbeitsbelastung reduzieren.
Statistical Learning and Data Science
Radiation-induced pneumonitis (RP), diagnosed 6–12 weeks after treatment, is a complication of lung tumor radiotherapy. So far, clinical and dosimetric parameters have not been reliable in predicting RP. We propose using non-contrast enhanced magnetic resonance imaging (MRI) based functional parameters acquired over the treatment course for patient stratification for improved follow-up.
Medical domain applications require a detailed understanding of the decision making process, in particular when data-driven modeling via machine learning is involved, and quantifying uncertainty in the process adds trust and interpretability to predictive models. However, current uncertainty measures in medical imaging are mostly monolithic and do not distinguish between different sources and types of uncertainty. In this paper, we advocate the distinction between so-called aleatoric and epistemic uncertainty in the medical domain and illustrate its potential in clinical decision making for the case of PET/CT image classification.
Objectives: To assess the quality of simplified radiology reports generated with the large language model (LLM) ChatGPT and to discuss challenges and chances of ChatGPT-like LLMs for medical text simplification.
Methods: In this exploratory case study, a radiologist created three fictitious radiology reports which we simplified by prompting ChatGPT with ‘Explain this medical report to a child using simple language.’’ In a questionnaire, we tasked 15 radiologists to rate the quality of the simplified radiology reports with respect to their factual correctness, completeness, and potential harm for patients. We used Likert scale analysis and inductive free-text categorization to assess the quality of the simplified reports.
Results: Most radiologists agreed that the simplified reports were factually correct, complete, and not potentially harmful to the patient. Nevertheless, instances of incorrect statements, missed relevant medical information, and potentially harmful passages were reported.
Conclusion: While we see a need for further adaption to the medical field, the initial insights of this study indicate a tremendous potential in using LLMs like ChatGPT to improve patient-centered care in radiology and other medical domains.
Clinical relevance statement: Patients have started to use ChatGPT to simplify and explain their medical reports, which is expected to affect patient-doctor interaction. This phenomenon raises several opportunities and challenges for clinical routine.
Machine learning can address limitations in radiology where traditional methods fall short, as shown by this work’s focus on two clinical problems: differentiating premalignant from benign colorectal polyps and continuous age prediction through clavicle ossification in CT scans. For colorectal polyps, a random forest classifier and CNN models enabled non-invasive differentiation between benign and premalignant types in CT colonography, potentially supporting more precise cancer prevention. For age assessment, a deep learning model trained on automatically detected clavicle regions achieved superior accuracy compared to human estimates, demonstrating machine learning’s potential to enhance radiological diagnostics in complex cases. (Shortened).
Undersampling is a common method in Magnetic Resonance Imaging (MRI) to subsample the number of data points in k-space, reducing acquisition times at the cost of decreased image quality. A popular approach is to employ undersampling patterns following various strategies, e.g., variable density sampling or radial trajectories. In this work, we propose a method that directly learns the under-sampling masks from data points, thereby also providing task- and domain-specific patterns. To solve the resulting discrete optimization problem, we propose a general optimization routine called ProM: A fully probabilistic, differentiable, versatile, and model-free framework for mask optimization that enforces acceleration factors through a convex constraint. Analyzing knee, brain, and cardiac MRI datasets with our method, we discover that different anatomic regions reveal distinct optimal undersampling masks, demonstrating the benefits of using custom masks, tailored for a downstream task. For example, ProM can create undersampling masks that maximize performance in downstream tasks like segmentation with networks trained on fully-sampled MRIs. Even with extreme acceleration factors, ProM yields reasonable performance while being more versatile than existing methods, paving the way for data-driven all-purpose mask generation.
Statistical Learning and Data Science
Statistics, Data Science and Machine Learning
Background: Radiological age assessment using reference studies is inherently limited in accuracy due to a finite number of assignable skeletal maturation stages. To overcome this limitation, we present a deep learning approach for continuous age assessment based on clavicle ossification in computed tomography (CT).
Methods: Thoracic CT scans were retrospectively collected from the picture archiving and communication system. Individuals aged 15.0 to 30.0 years examined in routine clinical practice were included. All scans were automatically cropped around the medial clavicular epiphyseal cartilages. A deep learning model was trained to predict a person’s chronological age based on these scans. Performance was evaluated using mean absolute error (MAE). Model performance was compared to an optimistic human reader performance estimate for an established reference study method.
Results: The deep learning model was trained on 4,400 scans of 1,935 patients (training set: mean age =
24.2 years ± 4.0, 1132 female) and evaluated on 300 scans of 300 patients with a balanced age and sex distribution (test set: mean age = 22.5 years ± 4.4, 150 female). Model MAE was 1.65 years, and the highest absolute error was 6.40 years for females and 7.32 years for males. However, performance could be attributed to norm-variants or pathologic disorders. Human reader estimate MAE was 1.84 years and the highest absolute error was 3.40 years for females and 3.78 years for males.
Conclusions: We present a deep learning approach for continuous age predictions using CT volumes highlighting the medial clavicular epiphyseal cartilage with performance comparable to the human reader estimate.
Optimizing a machine learning (ML) pipeline for radiomics analysis involves numerous choices in data set composition, preprocessing, and model selection. Objective identification of the optimal setup is complicated by correlated features, interdependency structures, and a multitude of available ML algorithms. Therefore, we present a radiomics-based benchmarking framework to optimize a comprehensive ML pipeline for the prediction of overall survival. This study is conducted on an image set of patients with hepatic metastases of colorectal cancer, for which radiomics features of the whole liver and of metastases from computed tomography images were calculated. A mixed model approach was used to find the optimal pipeline configuration and to identify the added prognostic value of radiomics features.
Statistics, Data Science and Machine Learning
Machine Learning Consulting Unit (MLCU)
Statistical Learning and Data Science
Purpose: To analyze and remove protected feature effects in chest radiograph embeddings of deep learning models. Methods: An orthogonalization is utilized to remove the influence of protected features (e.g., age, sex, race) in CXR embeddings, ensuring feature-independent results. To validate the efficacy of the approach, we retrospectively study the MIMIC and CheXpert datasets using three pre-trained models, namely a supervised contrastive, a self-supervised contrastive, and a baseline classifier model. Our statistical analysis involves comparing the original versus the orthogonalized embeddings by estimating protected feature influences and evaluating the ability to predict race, age, or sex using the two types of embeddings. Results: Our experiments reveal a significant influence of protected features on predictions of pathologies. Applying orthogonalization removes these feature effects. Apart from removing any influence on pathology classification, while maintaining competitive predictive performance, orthogonalized embeddings further make it infeasible to directly predict protected attributes and mitigate subgroup disparities. Conclusion: The presented work demonstrates the successful application and evaluation of the orthogonalization technique in the domain of chest X-ray image classification.
Statistical Learning and Data Science
Statistics, Data Science and Machine Learning
Radiomics, involving analysis of calculated, quantitative features from medical images with machine learning tools, shares the instability challenge with other high-dimensional data analyses due to variations in the training set. This instability affects model interpretation and feature importance assessment. To enhance stability and interpretability, we introduce grouped feature importance, shedding light on tool limitations and advocating for more reliable radiomics-based analysis methods.
While recent advances in large-scale foundational models show promising results, their application to the medical domain has not yet been explored in detail. In this paper, we progress into the realms of large-scale modeling in medical synthesis by proposing Cheff - a foundational cascaded latent diffusion model, which generates highly-realistic chest radiographs providing state-of-the-art quality on a 1-megapixel scale. We further propose MaCheX, which is a unified interface for public chest datasets and forms the largest open collection of chest X-rays up to date. With Cheff conditioned on radiological reports, we further guide the synthesis process over text prompts and unveil the research area of report-to-chest-X-ray generation.
Statistical Learning and Data Science
Statistics, Data Science and Machine Learning
This thesis advances the quantification and prediction of hemodynamic parameters in dynamic contrast-enhanced (DCE) imaging through two innovative approaches. The Bayesian Tofts model (BTM) improves the reliability and uncertainty estimation of perfusion parameters, demonstrating its potential for enhanced treatment response assessment in cancer care. Additionally, the development of a deep learning model offers a promising alternative by directly predicting clinical endpoints from raw DCE-CT data, eliminating the need for traditional tracer-kinetic modeling and paving the way for more efficient and accurate clinical applications in stroke and other conditions. (Shortened.)
Generative models allow for the creation of highly realistic artificial samples, opening up promising applications in medical imaging. In this work, we propose a multi-stage encoder-based approach to invert the generator of a generative adversarial network (GAN) for high resolution chest radiographs. This gives direct access to its implicitly formed latent space, makes generative models more accessible to researchers, and enables to apply generative techniques to actual patient’s images. We investigate various applications for this embedding, including image compression, disentanglement in the encoded dataset, guided image manipulation, and creation of stylized samples. We find that this type of GAN inversion is a promising research direction in the domain of chest radiograph modeling and opens up new ways to combine realistic X-ray sample synthesis with radiological image analysis.
Statistical Learning and Data Science
Statistics, Data Science and Machine Learning
Deep learning excels in the analysis of unstructured data and recent advancements allow to extend these techniques to survival analysis. In the context of clinical radiology, this enables, e.g., to relate unstructured volumetric images to a risk score or a prognosis of life expectancy and support clinical decision making. Medical applications are, however, associated with high criticality and consequently, neither medical personnel nor patients do usually accept black box models as reason or basis for decisions. Apart from averseness to new technologies, this is due to missing interpretability, transparency and accountability of many machine learning methods. We propose a hazard-regularized variational autoencoder that supports straightforward interpretation of deep neural architectures in the context of survival analysis, a field highly relevant in healthcare. We apply the proposed approach to abdominal CT scans of patients with liver tumors and their corresponding survival times.
Statistical Learning and Data Science
Statistics, Data Science and Machine Learning
The application of deep learning in survival analysis (SA) allows utilizing unstructured and high-dimensional data types uncommon in traditional survival methods. This allows to advance methods in fields such as digital health, predictive maintenance, and churn analysis, but often yields less interpretable and intuitively understandable models due to the black-box character of deep learning-based approaches. We close this gap by proposing 1) a multi-task variational autoencoder (VAE) with survival objective, yielding survival-oriented embeddings, and 2) a novel method HazardWalk that allows to model hazard factors in the original data space. HazardWalk transforms the latent distribution of our autoencoder into areas of maximized/minimized hazard and then uses the decoder to project changes to the original domain. Our procedure is evaluated on a simulated dataset as well as on a dataset of CT imaging data of patients with liver metastases.
Statistical Learning and Data Science
Statistics, Data Science and Machine Learning
Background: Yttrium-90 radioembolization (RE) plays an important role in the treatment of liver malignancies. Optimal patient selection is crucial for an effective and safe treatment. In this study, we aim to validate the prognostic performance of a previously established random survival forest (RSF) with an external validation cohort from a different national center. Furthermore, we compare outcome prediction models with different established metrics. Methods: A previously established RSF model, trained on a consecutive cohort of 366 patients who had received RE due to primary or secondary liver tumor at a national center (center 1), was used to predict the outcome of an independent consecutive cohort of 202 patients from a different national center (center 2) and vice versa. Prognostic performance was evaluated using the concordance index (C-index) and the integrated Brier score (IBS). The prognostic importance of designated baseline parameters was measured with the minimal depth concept, and the influence on the predicted outcome was analyzed with accumulated local effects plots. RSF values were compared to conventional cox proportional hazards models in terms of C-index and IBS. Results: The established RSF model achieved a C-index of 0.67 for center 2, comparable to the results obtained for center 1, which it was trained on (0.66). The RSF model trained on center 2 achieved a C-index of 0.68 on center 2 data and 0.66 on center 1 data. CPH models showed comparable results on both cohorts, with C-index ranging from 0.68 to 0.72. IBS validation showed more differentiated results depending on which cohort was trained on and which cohort was predicted (range: 0.08 to 0.20). Baseline cholinesterase was the most important variable for survival prediction. Conclusion: The previously developed predictive RSF model was successfully validated with an independent external cohort. C-index and IBS are suitable metrics to compare outcome prediction models, with IBS showing more differentiated results. The findings corroborate that survival after RE is critically determined by functional hepatic reserve and thus baseline liver function should play a key role in patient selection.
Machine Learning Consulting Unit (MLCU)
©all images: LMU | TUM