Perspectives

Evaluating AI Triage Tools: Navigating the Ground Truth Gap

Written by Dr Annabelle Painter | Feb 10, 2026 3:57:18 PM

AI-driven triage is transforming the entry point into healthcare—whether through symptom checkers at home, digital intake systems in urgent care, or clinical decision-support tools in emergency departments.

These tools typically generate both a triage urgency and a differential diagnosis, aiming to guide patients to the right level of care.

Yet evaluating their performance remains challenging because, unlike laboratory tests, triage decisions lack an objective ground truth. What “should” have happened at first presentation is rarely observable, often subjective, and shaped by the triage decision itself.

The Purpose of Triage

The primary goal of triage is risk stratification, not diagnostic accuracy. It ensures that patients who could suffer serious harm if not seen urgently are prioritised, while lower-risk patients are safely managed with less urgency. Triage occurs before investigations, a full history, or often a definitive diagnosis.

Many serious conditions initially present with common or non-specific symptoms. For example, a thunderclap headache may ultimately be a benign migraine, but urgent triage is appropriate because it could represent a life-threatening subarachnoid haemorrhage.

Effective triage prioritises sensitivity to serious risk over diagnostic precision. Some degree of over-triage is acceptable to minimise missed high-acuity cases. Evaluation frameworks should judge triage based on the reasonableness of urgency given the information available, not on hindsight or final diagnosis.

The Ground Truth Gap

Triage is inherently probabilistic. Many presentations—chest pain, fever, headache—are “rule-out” scenarios where the goal is safe, timely action rather than precise diagnosis. Outcome-based evaluation is flawed for two reasons:

  1. Downstream events are influenced by triage. A patient triaged to emergency care undergoes investigations that would not have occurred had they been directed to self-care, creating circularity.

  2. Benign outcomes do not imply triage error. A patient ultimately diagnosed with a simple migraine may initially present like a subarachnoid haemorrhage; safe triage requires accounting for serious but unlikely possibilities.

Differential Diagnosis Challenges

Many tools generate a differential diagnosis in addition to urgency, but benchmarking this output is difficult:

  • In lower-acuity settings, patients often never receive a definitive diagnosis.

  • Metrics that count whether the “true” diagnosis appears on the list—or its rank—can be misleading. Differential lists are probability-ordered, not single-answer predictions. For rare conditions, it may be clinically appropriate to rank more common alternatives higher. Penalising the tool for this reflects hindsight bias rather than clinical reasoning.

Evaluation of differential diagnosis should therefore consider probabilistic reasoning and clinical utility, not simple list-position metrics.

Clinician Judgement as a Reference Standard

Because outcomes and diagnoses cannot serve as definitive ground truths, many studies benchmark tools against clinician decisions. While useful, this approach has limitations:

  • Subjectivity: Different clinicians have varying risk thresholds; inter-rater agreement is modest.

  • Vignette distortion: Text-based vignettes summarise cases and omit dynamic history-taking. Performance on vignettes often overestimates real-world performance.

  • Ceiling effect: Benchmarking against clinicians caps measurable performance; an algorithm capable of outperforming clinicians cannot be detected.

The Limitations of Vignette Testing

Vignettes are convenient but problematic for summative evaluation:

  • Low fidelity: Real patients provide incomplete, contradictory information.

  • Finite information: Vignettes are static; history-taking is iterative.

  • Training-data leakage: Some tools are tested on cases resembling their training data, inflating results.

  • Unrepresentative case mix: Vignettes often oversample rare or extreme presentations.

Vignette-based validation is insufficient to establish real-world safety or reliability.

Over-Triage and Under-Triage: Core Accuracy Dimensions

Triage decisions are multi-class and ordinal (emergency, urgent, same-day, routine, self-care). The most clinically meaningful metrics are:

  • Under-triage: Assigning lower urgency than appropriate; the primary safety concern.

  • Over-triage: Assigning higher urgency than necessary; affects operational efficiency.

Traditional metrics like sensitivity, specificity, or a single accuracy number are misleading:

  • Triage is multi-class, not binary.

  • Single numbers hide asymmetric risks—missing a life-threatening case is far worse than unnecessary overtriage.

  • Metrics are prevalence-dependent; rare high-acuity conditions can skew apparent accuracy.

Leading organisations and regulators increasingly prioritise under-triage rates, miss rates for high-acuity conditions, and clinically acceptable over-triage bounds.

Context Matters

Evaluation must consider clinical context:

  • Primary care: Focus on demand management and early detection.

  • Emergency care: Emphasises rapid escalation and resource flow.

  • Public-facing or virtual care: Prioritises safety, comprehension, and avoiding unnecessary attendance.

Comparisons across tools must account for prevalence, workflow, and operational environment; otherwise, results are misleading.

What Robust Evaluation Looks Like

High-quality evaluation combines multiple approaches:

  1. Real-world prospective case studies: Capture missing data, ambiguous histories, and variability.

  2. Multiple clinician judgements: Consensus or blinded majority voting provides a more stable reference; report inter-rater variability.

  3. Outcome-based safety checks: Monitor downstream events (deterioration, ED attendance, hospitalisation) to identify under-triage.

  4. Appropriate metrics: Use ordinal, weighted, and risk-sensitive metrics rather than single-number accuracy scores.

  5. Prospective deployment studies: Assess real-world reliability and patient safety across populations and settings.

Conclusion

Evaluating AI triage tools is inherently complex due to the absence of ground truth. Differential diagnosis adds further complexity, and vignette- or single-clinician benchmarks often inflate performance.

Safe, clinically meaningful evaluation requires real-world prospective studies, multi-source clinician input, outcome-based safety checks, and metrics that respect the probabilistic and ordinal nature of triage and differential diagnosis.

Accuracy in AI triage is less about a single number and more about a safety philosophy, encompassing under-triage, over-triage, red flag detection, probabilistic differential diagnosis, input robustness, and context-aware evaluation. Embracing this multidimensional framework ensures AI triage tools are safe, reliable, and clinically valuable.