Human-LLM Assisted Annotation for Precision Oncology RCTs

Introduction

Systematic reviews in precision oncology rely on identifying randomized controlled trials (RCTs) to compare therapies and guide clinical decisions. However, screening thousands of articles to find eligible RCTs is labor-intensive and prone to human error. While supervised learning can accelerate this process, it often demands large labeled datasets and careful tuning, especially for nuanced oncology trial designs. Large language models (LLMs) such as GPT-3.5 can rapidly triage abstracts, but their reliability varies with study terminology and endpoint definitions. A human-LLM collaborative annotation approach combines the speed of AI with rigorous human oversight, offering a scalable, transparent workflow for screening precision oncology RCT articles.

The Challenge in Screening Precision Oncology RCTs

Oncology randomized trials frequently use heterogeneous endpoints, complex inclusion criteria, and evolving terminology. Abstracts may misrepresent trial design or population, leading to false positives or negatives if relied on AI alone. Additionally, the need for explainable decisions is paramount in evidence synthesis to ensure reproducibility and trust among reviewers, editors, and clinicians. The challenge is to balance automation that accelerates screening with human judgment that safeguards accuracy and context.

The Human-LLM Collaborative Annotation Framework

The proposed framework integrates a fast, AI-driven pre-screen with careful human validation and consensus-building. It comprises three core stages designed to screen precision oncology RCT articles efficiently while maintaining high-quality inclusions and exclusions.

Step 1: Pre-screening with LLMs

In this stage, an LLM processes title and abstract data to classify articles as include, exclude, or uncertain for potential eligibility. The model also outputs a confidence score and a brief rationale focused on study design (RCT vs non-RCT), population (e.g., cancer type and stage), intervention, comparator, and key outcomes relevant to precision oncology. The goal is to produce a compact, human-interpretable triage stream that flags ambiguous cases for closer human review and reduces the volume of articles requiring manual scrutiny.

Step 2: Human-in-the-Loop Validation

Two independent human annotators review the LLM’s decisions, examining PICO elements, trial design, and endpoint relevance. Annotators record include/exclude decisions and provide concise reasons for any disagreement. If there is no consensus, a third senior reviewer mediates, and final decisions are documented with justification. This step preserves domain expertise, ensures applicability to precision oncology, and creates a corrected, high-quality labeled set for future model refinement.

Step 3: Consensus Scoring and Model Refinement

Disagreement data are analyzed to measure inter-annotator agreement and model reliability. Metrics such as Cohen’s kappa guide adjustments to prompts, instruction clarity, and confidence thresholds. The labeled decisions feed back into iterative model fine-tuning or prompt engineering, improving future pre-screen accuracy. Over time, the system evolves from a static tool to a learning-assisted workflow that adapts to emerging oncology trial designs and terminology.

Benefits and Limitations

Adopting a human-LLM collaborative approach brings several advantages. It dramatically speeds up article screening, reduces manual workload, and enhances consistency across reviewers. The audit trail of LLM rationales and human decisions improves transparency and reproducibility in evidence synthesis. However, limitations include potential AI biases, dependence on the quality of the initial labeled data, and the need for ongoing human oversight to correct edge cases and evolving medical terminology. Regular performance evaluations and carefully designed prompts are essential to mitigate these risks.

Practical Considerations and Future Directions

Practical deployment should start with a well-curated training set of precision oncology RCTs and a clearly defined screening protocol. Interfaces should support easy extraction of PICO terms, confidence scores, and rationale, along with straightforward disagreement resolution. Looking ahead, adaptive prompting, few-shot learning with curated examples, and occasional model recalibration can sustain performance as the oncology literature evolves. Expanding to other domains of clinical research is feasible with domain-specific prompts and validation workflows.

Conclusion

A human-LLM collaborative annotation framework offers a pragmatic path to faster, more reliable screening of precision oncology RCT articles. By combining AI-powered triage with structured human validation and consensus-building, researchers can achieve rigorous inclusion criteria, maintain transparency, and shorten the time from search to evidence synthesis.