Categories: Medicine / AI in Healthcare

AI in Rare Hematologic Diagnosis: How Large Language Models Perform and Shape Physician Decision-Making

AI in Rare Hematologic Diagnosis: How Large Language Models Perform and Shape Physician Decision-Making

Overview

Advances in large language models (LLMs) are reshaping how clinicians approach rare hematologic diseases. A combined retrospective and prospective study from a Chinese medical center evaluated the diagnostic performance of seven publicly available LLMs—some with chain-of-thought (CoT) capabilities—using deidentified admission records. The study also tested whether presenting the models’ outputs to physicians could improve diagnostic accuracy in real time, especially for less-experienced clinicians.

What was studied and why it matters

Rare hematologic diseases pose diagnostic challenges due to their low prevalence, multisystem manifestations, and non-specific symptoms. Traditional diagnostic aids often rely on structured data and predefined rules, which can miss atypical presentations. The research team asked: (1) Can modern LLMs generate clinically useful differential diagnoses from free-text admission records without task-specific fine-tuning? (2) Do LLM-provided outputs influence physician accuracy, and does CoT reasoning help or hinder clinical judgment?

Retrospective phase: LLMs versus expert clinicians

The study analyzed 158 inpatient records spanning nine rare hematologic diseases. Seven publicly available LLMs were prompted to produce the top 10 likely primary diagnoses for each case in five separate runs, using the original Chinese text of the admission records. Key findings include:

  • Top-10 accuracy: ChatGPT-o1-preview led with 0.703, meaning the correct diagnosis appeared in the top 10 in about 70% of cases. Other strong performers included DeepSeek-R1 and Gemini Experimental 1206.
  • Mean reciprocal rank (MRR): ChatGPT-o1-preview again showed the best placement of the correct diagnosis within the list (MRR around 0.58).
  • Difficult cases: LLMs struggled with AL amyloidosis, Castleman disease, Erdheim-Chester disease, and POEMS syndrome, which were often underrepresented in literature and had lower diagnostic keyword density in admission notes.
  • Stability and variability: Higher accuracy often came with more output variability. For example, ChatGPT-4o produced wider ranges of suggested diagnoses, signaling a trade-off between creativity and reliability.

Physician scores were correlated with LLM performance: higher top-10 accuracy and higher MRR tended to align with higher physician diagnostic accuracy, though discordant cases existed where a clinician correctly diagnosed a disease not emphasized by the model.

Prospective phase: can LLMs uplift physician performance?

In the prospective study, 28 clinicians across four experience levels evaluated five cases each in three steps: (1) initial diagnosis without AI input; (2) a second diagnosis after seeing the LLM’s top 10 list; (3) a final diagnosis after reviewing the LLM’s step-by-step reasoning and analysis (CoT). The main takeaways were:

  • Overall improvement: Across participants, diagnostic scores improved significantly from the first to the second and remained elevated after the third step. The largest improvements were seen among less-experienced clinicians (postresidency and nonhematology attendings).
  • Specialists benefit less: Hematology attendings and consultant hematologists, who started with higher baseline accuracy, showed limited additional gains.
  • Potential risks: When LLM outputs were biased, physician performance could decline, and CoT did not reliably mitigate this bias. Safeguards are essential to prevent over-reliance on AI or misinterpretation of stepwise reasoning.

The study concluded that non-fine-tuned, publicly available LLMs can aid diagnosis for rare hematologic diseases using text-only admission records. The integration of LLMs into clinical workflows appeared feasible, especially as an educational aid for less-experienced clinicians. However, the authors warn that the “double-edged sword” nature of LLMs requires careful governance, transparency about uncertainty, and human oversight to preserve clinician judgment.

Clinical and educational implications

For hospitals exploring AI-assisted diagnostics, the findings suggest a cautious path to deployment: use LLMs to generate candidate diagnoses and explainable reasoning while applying confidence thresholds and highlighting uncertainty. Training clinicians to critically appraise AI outputs and to preserve independent clinical judgment will be essential. In education, LLMs could become valuable tools in problem-based learning and resident teaching when used under supervision.

Limitations and future directions

The study was single-center, relatively small, and utilized only text-based admission notes in Chinese. Multicenter trials, inclusion of imaging and laboratory data, and assessments in other languages will help determine generalizability. Ongoing work should also address methods to detect and mitigate biased outputs, improve stability without sacrificing accuracy, and establish best-practice guidelines for AI-assisted diagnosis in rare diseases.