Large Language Models in Lung Cancer: Systematic Review

Introduction: The promise and challenge of LLMs in lung cancer care

Lung cancer remains a leading cause of cancer mortality worldwide. Advances in early detection, such as low-dose CT screening, and integrated full‑cycle management offer the potential to improve survival and quality of life. At the same time, realizing these gains requires handling vast, heterogeneous data and coordinating complex workflows across prevention, screening, diagnosis, treatment, and supportive care. Large language models (LLMs) have emerged as a potential enabler—capable of processing text, extracting structured insights from clinical data, and facilitating patient–clinician communication. Yet biases, hallucinations, and safety concerns temper their adoption. This article presents a systematic review of the latest evidence on LLMs in lung cancer (LC), summarizing use cases, models, prompting strategies, limitations, and future directions to guide clinicians and researchers.

Methods in brief: How the review was conducted

The review followed established guidelines for systematic reviews, drawing evidence from six major databases up to January 1, 2025. Eligible studies explored LLMs in LC beyond pure diagnostics, including knowledge‑base questions, information extraction, and decision support. A mixed‑methods quality framework assessed risk of bias and applicability, with tools such as QUADAS‑2, PROBAST, and ROBINS‑I applied according to study design. Data were synthesized descriptively due to heterogeneity in aims and methodologies. In total, 28 studies published between 2023 and 2024 were included, with a mix of conference papers, preprints, and journal articles from the United States, Korea, Germany, China, and other countries. The analyses emphasize practical applications, not merely theoretical potential.

What LLMs are being used, and in which LC domains?

Across the included studies, a variety of LLMs were tested, including OpenAI’s GPT‑3.5 and GPT‑4 series, GPT‑4V for multimodal inputs, Meta AI’s LLaMA‑2, and Google’s Bard. Specialized medical models such as ClinicalBERT and other domain‑specific architectures were sometimes favored in task‑specific benchmarks. Seven major LC application domains emerged, often overlapping in practice: auxiliary diagnosis, information extraction, knowledge‑base question answering, scientific education and communication, nursing and patient support, and treatment decision assistance. Across these domains, the strongest evidence exists for diagnostic and screening aid, supported by information extraction capabilities that distill relevant features from radiology, pathology, and electronic health records.

Diagnostics, screening, and information extraction

LLMs have been used to extract features from radiology reports, pathology notes, and electronic medical record fields to assist in staging, histology typing, and metastasis mapping. In screening workflows, ChatGPT and similar models have been tested for generating or enriching lung‑nodule records, and for translating imaging reports into actionable scoring like Lung-RADS. While some studies report promising accuracy in generating structured outputs, the field notes that robust prospective multicenter validation remains limited. The potential lies in reducing clinician burden and enabling scalable triage, while safety concerns about misclassification or missing critical findings persist.

Knowledge answering, education, and patient communication

Several studies evaluated LLMs as knowledge‑base assistants for clinicians and as patient education tools. Multimodal capabilities (text plus images) show growing promise for interpreting images and related reports, enabling preliminary treatment considerations under careful governance. The results suggest LLMs can support general explanations of LC concepts, summarization of guidelines, and rapid literature digests for clinicians and students alike. Nevertheless, in high‑stakes decisions, outputs must be validated against expert consensus and guideline recommendations.

Research support, trial screening, and data synthesis

Information extraction and trial annotation tasks—such as classifying trial eligibility from pathology and radiology narratives or converting abstracts into computable data—are common themes. By accelerating screening and evidence synthesis, LLMs may improve the efficiency of research pipelines; however, reproducibility and data provenance require rigorous benchmarking and transparent reporting of limitations.

Treatment planning and follow‑up support

Emerging studies explore how LLMs could align patient data with clinical guidelines to propose treatment scenarios, estimate survival indicators, or outline follow‑up plans. While these tools hold appeal for personalized medicine, they have not yet demonstrated consistent safety and effectiveness in routine practice and should operate within a human‑in‑the‑loop framework to avoid overreliance on automated recommendations.

Prompt engineering, training data, and modalities

Prompt design emerges as a crucial determinant of performance. Many studies employed templates, role descriptions, and task‑specific instructions, with some using zero‑ to few‑shot learning or fine‑tuning on domain data. Text‑based tasks dominated the literature, but a growing subset explored images and text in multimodal models. The balance between general‑purpose LLMs and specialized clinical models remains an active area, with retrieval‑augmented generation (RAG) offering a pathway to integrate external evidence and improve reliability.

Limitations, safety, and human oversight

Across the board, issues of bias, hallucinations, and data privacy loom large. The reviewed studies frequently relied on retrospective data and single‑center sources, underscoring the need for prospective, multicenter validation. Human oversight, regulatory consideration (e.g., HIPAA‑compliant data handling), and explicit responsibility attribution are essential. A practical approach is a human‑on‑the‑loop paradigm that leverages LLMs for data processing and education while preserving clinician judgment for final decisions.

Future directions and practical recommendations

The review identifies three priorities: (1) rigorous prospective multicenter validation for diagnostic/screening and treatment planning tools; (2) development of patient‑facing and follow‑up support applications to address long‑term management gaps; (3) improvements in interpretability, bias reduction, and deployment strategies to ensure safe adoption across diverse health systems. Leveraging multimodal LLMs, robust data governance, and retrieval‑augmented frameworks will be central to realizing this vision while maintaining patient safety and privacy.

Conclusion: A measured path to integrating LLMs into LC care

LLMs hold meaningful potential to augment LC care—aid in interpreting tests, propose evidence‑based treatment considerations, and support education and research. Yet maturity, evidence base, and real‑world readiness vary. Progress will come from rigorous validation, responsible deployment with human oversight, and continued development of multimodal and domain‑specific models. When guided by ethics, transparency, and patient‑centered goals, LLMs can become a valuable component of full‑cycle LC management rather than a replacement for clinicians.