Large language models (LLMs) are increasingly used in clinical settings, yet their effect on diagnostic accuracy of physicians has not been systematically quantified. We conducted a systematic review and meta-analysis of studies analyzing LLM-assisted diagnosis published between January 2020 and June 2025. Across 15 studies (43 effect sizes; 498 physicians; 7,274 case evaluations), LLM assistance significantly improved diagnostic accuracy compared to physicians without LLM support (Hedges’ g = 0.20, 95% CI 0.12–0.29; P < .001). Although improvements were observed across multiple LLMs (e.g., GPT-4, AMIE, MedFound-DX-PA), medical fields (general medicine, radiology), and career stages of physicians (residents and attendings), the magnitude of the benefit varied substantially. These findings show that LLMs can improve diagnostic accuracy of physicians, but conditions for successful LLM assistance remain unclear. Further clinical evidence is needed to guide safe and effective integration into practice.
misc TSF+25
BibTeXKey: TSF+25