Introduction: Leveraging AI to tackle rare hematologic diseases
Rare diseases pose substantial diagnostic challenges due to their low prevalence, diverse presentations, and often multisystem involvement. This study investigates how large language models (LLMs), especially new-generation transformers with chain-of-thought (CoT) capabilities, perform in diagnosing rare hematologic diseases and how their outputs shape physician decision-making in real-world settings.
Study design: Retrospective analysis and prospective evaluation
The researchers conducted two complementary components. First, a retrospective analysis used deidentified admission records from a single center, feeding processed text to seven publicly available LLMs to generate the top 10 diagnostic candidates per case. Second, a prospective phase presented physicians with the LLMs’ top-10 lists and, in a stepwise fashion, the models’ CoT reasoning. The aim was to determine whether AI-assisted outputs improve diagnostic accuracy across experience levels and to assess safety and practicality for clinical workflow integration.
Retrospective phase: Diagnostic performance across models
Involving 158 admissions across nine rare hematologic diseases, the study found that the best-performing model, ChatGPT-o1-preview, achieved a top-10 accuracy of 0.703 and an mean reciprocal rank (MRR) of 0.577. While overall performance was strong, accuracy was notably lower for AL amyloidosis, Castleman disease, Erdheim-Chester disease, and POEMS syndrome—conditions characterized by multisystem involvement and nonspecific presentations. In contrast, diagnoses such as Waldenström macroglobulinemia, acquired hemophilia, Langerhans cell histiocytosis, cutaneous T-cell lymphoma, and thrombotic thrombocytopenic purpura fared better, highlighting data sparsity and literature gaps as limiting factors for certain diseases.
Model stability varied: higher accuracy often came with greater variability in outputs, particularly for models like ChatGPT-4o. The study also revealed a moderate to strong relationship between physicians’ initial diagnostic performance and LLM metrics (top-10 accuracy and MRR), suggesting that AI quality correlates with clinician performance in this context.
Prospective phase: Do LLMs help clinicians?
In the prospective study, 28 physicians across four experience levels evaluated five cases each. Across all participants, diagnostic accuracy improved from the first to the second and third attempts when LLM outputs and stepwise reasoning were provided, with statistical gains most pronounced among less-experienced clinicians (postresidency and nonhematology attendings). Hematology attendings and consultant hematologists, who started with higher baseline accuracy, showed smaller gains, implying a ceiling effect.
Importantly, the study highlighted a double-edged sword: biased LLM outputs could mislead clinicians, undermining gains even when CoT reasoning was exposed. In such cases, biases produced worse second attempts and negative user ratings, underscoring the need for safeguards, uncertainty labeling, and selective presentation of AI conclusions.
Clinical implications: Safeguards and integration into practice
The findings support the potential of LLMs as practical clinical decision-support tools for rare hematologic diseases, especially to assist trainees and non-specialists. Unlike traditional rule-based decision support, LLMs can process unstructured text and offer multiple differential diagnoses and reasoning pathways, aiding education and clinical reasoning. However, to transplant this into routine care, several safeguards are essential: confidence thresholds, uncertainty labeling, multi-model consensus strategies, and a hybrid human-AI workflow that preserves clinician judgment while leveraging AI strengths.
Limitations and future directions
Limitations include single-center data, a relatively small and language-specific sample, and reliance on text-only admission records without multimodal data such as imaging or labs. Further multicenter, multilingual studies with multimodal inputs are needed to validate these results and refine integration strategies for rare diseases. The study also highlights the necessity of ongoing clinician education on critically appraising AI outputs and recognizing when to override model suggestions.
Conclusion: A cautious but promising path forward
Publicly available new-generation LLMs, without task-specific fine-tuning, can propose correct diagnoses for rare hematologic diseases at levels similar to human physicians when given unstructured admission texts. Prospective findings suggest LLMs can boost diagnostic performance for less-experienced clinicians, provided robust safeguards address bias and transparency. As AI tools become more embedded in clinical workflows, careful design, training, and governance will be key to maximizing patient benefit while protecting clinical autonomy.