Evaluating the real-world impact & effectiveness of Health AI
AI is transforming healthcare—but are the bold promises backed by real-world evidence? This blog explores different studies to evaluate effectiveness and impact of AI in real-world clinical settings.
Health AI—Promise vs. Reality
In tech-focused media, AI is frequently portrayed as a silver bullet for many global healthcare challenges—offering the promise of more accurate diagnoses, reducing the workload of healthcare professionals by automating time-consuming tasks, and increasing patient throughput, all with the aim of enhancing healthcare delivery and ultimately improving well-being for all. But have these promises become a reality?
Now that AI tools become increasingly embedded in healthcare—from disease surveillance to clinical decision support—there is a growing need to validate these promises of AI in healthcare with robust, context-sensitive evidence that demonstrates real-world impact beyond technical performance.
This blog post series is split into two parts. In this first part, we explore different study types used to evaluate AI in healthcare. The next part delves into key research questions that can help build stronger evidence on AI’s real-world impact and effectiveness.
Study Types to Evaluate AI in Healthcare
AI systems mature from a development stage to a deployment stage with each stage needing distinct types of evidence on the AI system’s capabilities and limitations as shown in figure 1.
The problem with non-deterministic algorithms, such as those used in the field of AI (e.g. convolutional neural networks for image analysis), is that they may exhibit excellent performance during the development stage, but perform much worse when being applied in a real-world healthcare setting. Especially in the medical field, rigorous validation studies are essential to evaluate an AI system’s performance at each stage to mitigate the risk of harm to patients due to unforeseen limitations.
Development Stage: Proving Technical Performance
During the development stage, developers typically want to know whether the AI system has sufficient performance before deploying it in a real-world healthcare setting. To answer this question, developers typically use previously collected datasets from a relevant clinical application setting to validate the performance of the AI system. If the dataset is a hold-out dataset from the algorithmic training process, this type of validation is referred to as internal validation.
External validation involves assessing an algorithm’s predictive performance using a separate, yet relevant, dataset that was not involved in the development process. External validation using existing datasets offers a convenient, low-cost way to assess an AI system’s technical performance. However, its value is limited if data quality is poor—such as missing predictors, non-representative measurement methods, or lack of event timing—and it provides little insight into real-world deployment performance or clinical impact.
Deployment Stage: Testing Real-World Clinical Impact
After developers are satisfied with the algorithm’s performance on an external validation dataset, they may be excited to try translating the algorithm into a real-world clinical setting. Only studies applying the AI tool prospectively in a real-world clinical setting can verify an AI system’s functionality within live clinical workflows. Apart from the technical performance, prospective studies can be implemented to assess whether the AI can be used reliably and efficiently by healthcare providers and capture how human-AI interactions affect medical decisions, and health outcomes.
Unfortunately, many vendors raise claims about the clinical effectiveness of their AI systems based on evidence from retrospective validation studies, as shown in studies such as this one from van Leeuwen, Ciecierski-Holmes's 2022 research, and more recent analyses by Han and Macheka. Prospective validation studies, however, remain rare for most existing AI systems
Prospective Study Designs
Multiple types of prospective studies exist to evaluate different aspects of the real-world impact and effectiveness of AI in healthcare.
Prospective cohort studies, for example, enrol study participants with certain characteristics, expose them to an AI tool and observe the resulting health outcomes over time. The disadvantage with this study type is that it is more time consuming and more expensive compared to retrospective validation studies.
Another type of prospective validations studies are Randomized Controlled Trials (RCTs). RCTs provide the highest level of evidence to demonstrate the functionality and clinical benefit for medical interventions, such as AI systems. In RCTs, participants are randomly assigned to either an intervention group (where AI is used) or a control group (where traditional methods are used) to investigate whether the AI-intervention leads to more improvement (e.g. better health outcomes, lower costs, faster diagnosis) compared to existing treatments or protocols. To date, only a few AI systems have been evaluated as an intervention in RCTs. AI-assisted colonoscopy is one of the few use cases for which several randomized controlled trials have been conducted to assess whether it improves the detection rate of colon lesions compared to procedures performed without AI support.
For other use cases, RCTs may not be a feasible study type, as they are expensive, time-consuming, or even unethical when the assignment to either the AI-group or control group poses a substantial disadvantage to the patient’s well-being. When conducted as longitudinal studies over several years, prospective studies can also evaluate the long-term impact of AI interventions, including effects on patient well-being and cost savings. However, the high costs for longitudinal studies are often an obstacle to conducting those.
Meta-Analyses and Systematic Reviews
Once several studies address a specific question, systematic reviews and meta-analyses can synthesize their findings to provide a more robust and comprehensive answer. However, meta-research often encounters the challenge that multiple studies lack comparability because they vary greatly in design, population, outcomes measured, and AI tools used and yield differing conclusions. To date, there is yet a lack of meta-studies that synthesize comprehensive evidence on the real-world effectiveness and impact of most AI systems in healthcare.
Real-World Impact: Case Examples
For a few AI applications, meta-analyses have pointed out an evidence gap on the impact of AI systems in healthcare. For example, while AI-assisted colonoscopy has been shown to improve adenoma detection rates, it also results in the increased removal of non-neoplastic polyps, and its overall impact on health outcomes remains uncertain. For AI-based detection of diabetic retinopathy (a common complication of diabetes type 1 that can lead to a loss of sight), evidence exists that it delivers better value for money than current manual grading of retina images, but evidence on social impacts is missing. For AI applications in global health, the limited evidence from prospective studies and of cost-effectiveness pose a barrier for a successful adoption of AI systems to strengthen healthcare systems in Low-Middle-Income Countries.
Thanks for reading this far! I hope this blog post provided some insights into the various approaches for validating medical AI tools, particularly in generating evidence about their clinical impact and effectiveness.
It's also important to recognize that many questions remain unanswered regarding how AI performs in real-world healthcare settings. If you're interested in exploring the key research questions that could help close this evidence gap, check out part two of this article, coming out next week!
Have questions or thoughts on the topic? We welcome your insights—feel free to share your comments or start a conversation below.
Jana Fehr is an AI researcher with a passion for the transformative potential of artificial intelligence in healthcare. With a background in biotechnology and machine learning, she enjoys bridging the gap between technology and medicine to develop responsible practices about the integration of AI into medical practice.







