Introduction
Basal cell carcinoma (BCC) is the world’s most common nonmelanoma skin cancer, with incidence on the rise and significant implications for patients’ cosmetic outcomes and healthcare systems. Dermatoscopy enhances lesion visualization, but diagnostic accuracy can vary among observers. In recent years, deep learning (DL) algorithms trained on dermatoscopic images have emerged as a promising, objective tool to assist clinicians in identifying BCC. This article synthesizes the latest evidence from a systematic review and meta-analysis assessing the diagnostic performance of DL models versus human experts.
Aims
The review sought to quantify how well dermatoscopy-based DL models detect BCC and to compare their performance with that of dermatologists and, where available, pathologists. It also examined sources of variability and the generalizability of findings across internal and external validation settings.
Methods
The study followed PRISMA-DTA guidelines and registered its protocol in PROSPERO (CRD42025633947). A search of PubMed, Embase, and Web of Science up to November 2024 identified dermatoscopy-based DL studies focused on BCC. Eligibility required original work using DL on dermatoscopic images with extractable 2×2 diagnostic data (TP, FP, FN, TN) or equivalent metrics. Fifteen studies met inclusion, most employing retrospective designs and convolutional neural network (CNN) architectures. Internal validation datasets dominated (about 32,000 images), with a single external validation cohort (~200 images) reported. The reference standards were mainly histopathology, sometimes augmented by expert consensus or clinical follow-up.
What was analyzed
Key outcomes included sensitivity, specificity, and area under the receiver operating characteristic curve (AUC) for internal validation, external validation, and dermatologist performance. The authors used a bivariate random-effects model and evaluated heterogeneity with the I² statistic. They also explored potential moderators such as DL method, reference standard (RS), internal validation type, and image magnification.
Results
In internal validation, DL models achieved a pooled sensitivity of 0.96 and specificity of 0.98, with an AUC of 0.99, indicating near-perfect discrimination in controlled datasets. By comparison, dermatologists reached a sensitivity of 0.75 and specificity of 0.97, with an AUC of 0.96. The difference in AUC between DL and clinicians on internal data was statistically significant (P = 0.008). Heterogeneity was substantial for both sensitivity (I² ≈ 92%) and specificity (I² ≈ 93%), driven largely by variations in the RS across studies.
External validation data were limited, with only one study reporting an external set. In that cohort, DL models had a sensitivity of 0.88 and specificity of 0.99, suggesting strong performance but underscoring questions about generalizability beyond internal cohorts. Subgroup analyses showed similar performance for CNN vs non-CNN approaches and for different RS definitions, with no consistent superiority of a single DL architecture or RS. Publication bias tests did not indicate notable asymmetry.
Discussion
The meta-analysis indicates that dermatoscopy-based DL models can outperform dermatologists on internal validation datasets, largely due to their capacity to process large, high-dimensional image data and extract subtle features. However, the clinical generalizability of these findings remains uncertain, given the limited external validation and marked heterogeneity in study designs, gold standards, and image sources. The reliance on public datasets in many studies may also limit real-world applicability. The authors highlighted the need for prospective, externally validated studies and standardized reporting to ensure reliable performance across diverse clinical settings.
Clinical Implications and Future Directions
DL in dermatoscopy holds promise for supporting primary-care screening and improving diagnostic speed, especially in resource-limited or remote areas. Yet, before routine deployment, robust external validations, prospective study designs, and cost-effectiveness analyses are essential. Standardizing reference standards and reporting will help align future comparisons and guide integration with dermatology workflows.
Limitations
Most included studies were retrospective, with only one prospective study. Gold-standard definitions varied, and many analyses relied on public image datasets, potentially inflating performance estimates. Only a minority of studies compared AI performance against a broad panel of clinicians, complicating extrapolation to real-world practice.
Conclusion
Deep learning algorithms trained on dermatoscopic images show strong diagnostic potential for detecting BCC, especially on internal validation sets, and may augment clinical decision-making. External validation remains a critical gap, and future research should prioritize prospective designs, diverse populations, and real-world testing to determine whether DL tools can reliably improve patient outcomes in everyday dermatology practice.