Introduction
Large language models (LLMs) promise to transform exercise and health coaching by delivering personalized training plans, real-time feedback, and motivational support. Yet translating this potential into safe, effective practice requires rigorous evaluation that can handle multimodal inputs—text reports, video-based posture analysis, and physiological sensor data—while ensuring safety and personalization. This scoping review maps the current landscape of evaluation strategies for LLM-based health and exercise coaching, identifies gaps, and proposes a framework to guide future validation efforts.
Rationale for a Scoping Review
Standard benchmarks such as MMLU or domain-specific medical tests do not capture the unique, prescriptive, safety-critical nature of coaching AI. While sports benchmarks exist (eg, SCBench, SPORTU) and knowledge bases (SportQA) assess descriptive tasks, they fall short for evaluating safe, real-time corrective feedback and dynamic plan adaptation. The field thus exhibits fragmentation, varying from movement classification metrics to user surveys and expert panels, complicating cross-study comparisons and evidence-based refinement.
Objectives
The review seeks to:
- Identify evaluation methods used for AI health and exercise coaches;
- Summarize strengths, limitations, and validation approaches (eg, user feedback, expert ratings, real-world testing);
- Develop a conceptual framework to guide future evaluations;
- Highlight gaps and directions for research to support rigorous, standardized validation.
Methods at a Glance
Following Arksey and O’Malley’s framework and the PRISMA-ScR extension, we searched PubMed, Web of Science, Google Scholar, arXiv, medRxiv, and bioRxiv for studies evaluating LLM-based coaching systems in exercise, fitness, sport, and rehabilitation. A pilot screening clarified the inclusion criterion for performance evaluation, ensuring articles explicitly reporting evaluation strategies (accuracy, expert scoring, usability, benchmarks) were included. Data extraction captured model type, application, inputs/outputs, datasets, and evaluation methods. An interrater agreement test yielded substantial reliability (Cohen’s Kappa = 0.88).
Key Findings: A Multidimensional but Fragmented Landscape
We identified 20 eligible studies, published mainly between 2023 and 2025. The evidence shows two dominant clusters:
- Application-focused work emphasizing human-centered metrics, such as usability and perceived personalization, often via expert panels or user surveys.
- Technique-focused work emphasizing automated, objective metrics, such as movement analysis accuracy or plan generation quality.
Across studies, methodological depth varied considerably. A central finding is the lack of real-world validation contexts in many evaluations, and limited reporting of interrater reliability for subjective judgments. Real-user data were used in less than half of the studies, and only a minority included validated human-rating instruments. Consequently, a gap persists between technical capability and clinically meaningful validation.
A Conceptual Framework for Future Evaluations
To address fragmentation, we propose a multidimensional evaluation framework that links AI coach capabilities to three core metric pillars:
- Automated performance—objective benchmarks assessing task-specific accuracy and safety.
- Human evaluation—validated scales and expert ratings capturing personalization, empathy, and usability.
- Study design metadata—real-world or simulated contexts, data provenance, and reliability reporting.
Bringing these together requires a shift toward multidimensional studies that preserve ecological validity while maintaining rigor. Framework-driven studies can enable better cross-study comparability and more robust conclusions about safety and effectiveness.
Risk Mitigation and Ethical Considerations
Privacy, explainability, and bias remain central concerns for AI coaching. Multimodal data (including video and sensors) heighten privacy risks, underscoring the need for secure deployment in open-source or on-premises environments and privacy-preserving training approaches such as federated learning. Explainability through retrieval-augmented generation (RAG) can ground advice in evidence and improve trust, while continuous bias auditing and diverse co-design processes help ensure equity across populations.
Future Directions
We highlight two near-term priorities:
- Integrate RAG to improve factual grounding and develop evaluation metrics that assess alignment with retrieved sources.
- Adopt scalable evaluation frameworks such as the Adaptive Precise Boolean Framework to replace time-consuming Likert-style measures, enabling faster, more reliable assessments with higher interrater agreement.
Practical progression should combine automated benchmarks, expert-driven assessments, and longitudinal real-world studies with shared datasets and standardized protocols to accelerate rigorous, comparable validation of AI coaching tools.
Limitations
Limitations include non-inclusion of commercial systems with limited public documentation, heterogeneity among studies, and the absence of long-term behavioral outcome data. The scoring tool used herein is internally developed and not externally validated, so its results should be interpreted as indicators of methodological depth rather than absolute quality.
Conclusions
Evaluating LLM-based exercise and health coaches demands a holistic approach that marries technical rigor with real-world validity. The scoping review offers a conceptual framework to bridge current gaps and calls for a layered, scalable evaluation paradigm to support safe, effective, and equitable AI coaching in health and fitness.