These tools typically generate both a triage urgency and a differential diagnosis, aiming to guide patients to the right level of care.
Yet evaluating their performance remains challenging because, unlike laboratory tests, triage decisions lack an objective ground truth. What “should” have happened at first presentation is rarely observable, often subjective, and shaped by the triage decision itself.
The primary goal of triage is risk stratification, not diagnostic accuracy. It ensures that patients who could suffer serious harm if not seen urgently are prioritised, while lower-risk patients are safely managed with less urgency. Triage occurs before investigations, a full history, or often a definitive diagnosis.
Many serious conditions initially present with common or non-specific symptoms. For example, a thunderclap headache may ultimately be a benign migraine, but urgent triage is appropriate because it could represent a life-threatening subarachnoid haemorrhage.
Effective triage prioritises sensitivity to serious risk over diagnostic precision. Some degree of over-triage is acceptable to minimise missed high-acuity cases. Evaluation frameworks should judge triage based on the reasonableness of urgency given the information available, not on hindsight or final diagnosis.
Triage is inherently probabilistic. Many presentations—chest pain, fever, headache—are “rule-out” scenarios where the goal is safe, timely action rather than precise diagnosis. Outcome-based evaluation is flawed for two reasons:
Many tools generate a differential diagnosis in addition to urgency, but benchmarking this output is difficult:
Evaluation of differential diagnosis should therefore consider probabilistic reasoning and clinical utility, not simple list-position metrics.
Because outcomes and diagnoses cannot serve as definitive ground truths, many studies benchmark tools against clinician decisions. While useful, this approach has limitations:
Vignettes are convenient but problematic for summative evaluation:
Vignette-based validation is insufficient to establish real-world safety or reliability.
Triage decisions are multi-class and ordinal (emergency, urgent, same-day, routine, self-care). The most clinically meaningful metrics are:
Leading organisations and regulators increasingly prioritise under-triage rates, miss rates for high-acuity conditions, and clinically acceptable over-triage bounds.
Evaluation must consider clinical context:
Comparisons across tools must account for prevalence, workflow, and operational environment; otherwise, results are misleading.
High-quality evaluation combines multiple approaches:
Evaluating AI triage tools is inherently complex due to the absence of ground truth. Differential diagnosis adds further complexity, and vignette- or single-clinician benchmarks often inflate performance.
Safe, clinically meaningful evaluation requires real-world prospective studies, multi-source clinician input, outcome-based safety checks, and metrics that respect the probabilistic and ordinal nature of triage and differential diagnosis.
Accuracy in AI triage is less about a single number and more about a safety philosophy, encompassing under-triage, over-triage, red flag detection, probabilistic differential diagnosis, input robustness, and context-aware evaluation. Embracing this multidimensional framework ensures AI triage tools are safe, reliable, and clinically valuable.