Categories: Health Tech / AI in Health

Evaluation Strategies for LLM-Based Exercise & Health Coaching: A Scoping Review

Evaluation Strategies for LLM-Based Exercise & Health Coaching: A Scoping Review

Introduction

Large language models (LLMs) hold promise for personalized exercise and health coaching, offering capabilities from adaptive training plan generation to real-time movement feedback and motivational support. Yet translating potential into safe, effective practice requires robust evaluation across multimodal data streams (text reports, posture analysis, sensor data) and personalized outcomes. This scoping review summarizes how researchers are currently evaluating LLM-based coaches in health and exercise, identifies methodological gaps, and proposes a framework to guide future validation efforts.

Why a Scoping Review?

The field is rapidly evolving and highly heterogeneous. Standard benchmarks used in general AI or medicine fail to capture the prescriptive, safety-critical, and user-centered nature of coaching. This review therefore aims to map the landscape, describe prevalent evaluation strategies (and their limits), and outline a multidimensional framework that can standardize future research while accommodating the field’s diversity.

Objectives

  • Identify evaluation methods used to assess LLM-based health and exercise coaches.
  • Summarize strengths, limitations, and validation approaches (e.g., user feedback, expert scoring, real-world testing).
  • Develop a conceptual framework to guide future evaluations and highlight gaps for research.

Methods Overview

Following Arksey and O’Malley’s scoping framework and the PRISMA-ScR extension, two reviewers identified studies from six databases (PubMed, Web of Science, Google Scholar, arXiv, medRxiv, bioRxiv) and used a three-domain search strategy (LLMs, exercise/health coaching, evaluation). A pilot screening ensured clarity in inclusion criteria, especially around what constitutes an evaluation rather than mere description of outputs. A 5-point Evaluation Rigor Score (ERS) was developed to gauge methodological depth, considering real-world validation, data sources, instrument validity, interrater reliability, and presence of a comparative baseline.

Key Findings

Across 20 included studies (2023–2025), the landscape shows two persistent clusters: (1) application-focused work emphasizing human-centered evaluation (personalization, engagement, empathy) and (2) technique-focused work emphasizing automated metrics (movement analysis, accuracy). Overall ERS scores revealed variability, with a median of 2.5 and only a minority reaching high rigor. Real-world or simulated user contexts and the use of real user data were not universal, and interrater reliability was inconsistently reported. Human-rating metrics dominated, while automated benchmarks were less common yet valuable for objective task assessment.

A Conceptual Framework for Future Evaluation

The review proposes a multidimensional framework that aligns AI coach capabilities with three evaluation pillars: automated performance metrics, human ratings, and study-design metrics. A truly robust validation would integrate objective benchmarks (e.g., movement analysis), validated human scales (e.g., usability, motivational interviewing integrity), and ecologically valid study designs (real-world settings with user data). This bridge between technical rigor and ecological validity is essential to establish trustworthy, effective AI coaching tools.

Risks, Ethical Considerations, and Recommendations

Privacy, explainability, and bias are critical barriers to real-world deployment. Multimodal inputs raise privacy concerns, particularly when data are processed by proprietary cloud services. Explanability is needed for trust, especially when AI provides exercise prescriptions. Bias can propagate inequities across populations. Recommended directions include: (1) prioritizing open-source or privacy-preserving approaches (e.g., federated learning); (2) grounding recommendations with Retrieval-Augmented Generation (RAG) to cite evidence-based sources; (3) adopting bias auditing and co-design with diverse user groups.

Future Directions

Two main paths emerge:
– Implement structured, evidence-grounded RAG systems with standardized evaluation metrics that measure factual accuracy and source grounding, guided by SCORE-like principles (Safety, Clinical Consensus, Objectivity, Reproducibility, Explainability).
– Move toward scalable evaluation via Adaptive Precise Boolean Frameworks, replacing long Likert scales with Yes/No rubrics, enabling faster, more reliable interrater agreement and broader deployment. Phased adoption—starting with automated benchmarks and expert reviews, then expanding to longitudinal, real-world studies—can balance feasibility with rigor.

Limitations and Conclusions

The scoping review excludes several commercial systems due to limited publicly available documentation, which may bias findings toward academia. The customized ERS, while useful for this context, is not externally validated. Heterogeneity in data and methods limits cross-study comparability. Still, the synthesis highlights a clear need for multidimensional, standardized evaluation that blends technical performance with human-centered validation. Embracing RAG, scalable evaluation methods, and inclusive design can accelerate the development of safe, effective LLM-based exercise and health coaching.