New AI Monitoring Method Helps Convey When to Trust AI Predictions and When to Exercise Caution

Although AI tools are becoming increasingly integrated into medical imaging workflows, distrust in AI predictions often remains in the form of one key question: how can physicians know when to trust the prediction made by a black-box AI tool?

A new study from the Stanford Radiology AI Development and Evaluation (AIDE) Lab, published October 16 in npj Digital Medicine, illustrates how the Ensembled Monitoring Model (EMM) framework can act like a real-time second opinion system for deployed AI tools. EMM evaluates how much confidence can be placed in the the AI prediction, helping physicians decide whether to rely on the result or take a closer look.

Why This Matters

Radiology AI tools are often deployed without mechanisms to monitor their reliability in real-world use. Because it’s generally not possible to see how black-box AI tools make their decisions or what kinds of data they were trained on, the task of determining whether the AI prediction is trustworthy falls onto the physician.

“This extra burden on physicians can lead to reduced efficiency, misdiagnoses, and ultimately hesitancy to use AI tools,” said Zhongnan Fang, PhD, Principal Machine Learning Scientist and lead author of the study. “EMM is like a real-time quality check that tells you if you should be confident about what the AI tool is telling you.”

How It Works

Inspired by how doctors often seek second opinions, EMM is used in parallel to the primary AI model and contains an ensemble of five independent AI submodels that perform the same diagnostic task. These submodels process the same input image in parallel to the primary AI model and compare the results. If the majority of the submodels agree with the primary AI model’s prediction, then clinicians can have high confidence in the diagnosis. If there is little agreement, that signals a need for more scrutiny.

Crucially, EMM doesn’t need access to the inner workings of the commercial AI system it monitors, making it compatible with black-box systems already in use across hospitals.

Why It Matters

The study authors tested the EMM on over 2,900 CT scans, using it to quality-check both an FDA-cleared commercial AI model and an open-source model. They found that EMM reliably flagged cases where there was uncertainty in the primary model’s AI predictions – especially for images that contained subtle or ambiguous findings – while confirming high-confidence predictions for more obvious cases.

In more than half the cases, EMM increased the clinicians’ confidence in the AI’s output. In others, it helped flag cases that warranted extra attention. Importantly, EMM only failed to detect that the primary AI model output was incorrect in a small minority (4%) of cases.

Overview of EMM. Originally published in npj Digital Medicine.

Actionable Decision-Making to Improve Clinical Workflow

EMM doesn’t just assess confidence. It also suggests an action – a next step. Using confidence thresholds, EMM can categorize confidence in the primary AI model’s predictions into increased, similar, or decreased confidence categories, helping clinicians know when to trust the AI and when to use their own judgment.

The value of these suggested next steps was also evaluated by looking at the cognitive burden of reviewing “false alarms”, or predictions from the primary AI model that EMM flagged as “low confidence” but were actually correct. In this benefit-to-drawbacks analysis, EMM delivered substantial accuracy gains by drawing radiologist attention to cases that truly needed a second look. For images containing an intracranial hemorrhage, EMM increased relative accuracy by up to 38% while maintaining low false-alarm rates under 1%.

For images without intracranial hemorrhage, the benefits were more nuanced, with net gains at an intracranial hemorrhage prevalence rate typically seen in emergency departments. In contrast, at lower prevalence levels typically seen in inpatient or outpatient settings, the marginal accuracy improvements were outweighed by the burden of unnecessary reviews. These results highlight EMM’s effectiveness in high-stakes scenarios while also underscoring the importance of tailoring the confidence thresholds set for next steps to specific disease prevalence rates and use cases.

An example EMM use case is stratifying cases into categories of increased, similar, or decreased confidence in the primary model predictions based on the level of EMM agreement. Originally published in npj Digital Medicine.

Looking Ahead

EMM could become a key tool in the lifecycle monitoring of AI in medicine, which is a requirement recently emphasized by the FDA. The method could even help detect changes in AI behavior over time, such as performance drift caused by new scanners or shifting patient populations.

The AIDE Lab hopes EMM will be adapted for other diagnostic tasks beyond intracranial hemorrhage detection. To this end, the paper even includes data on how training dataset size, number of submodels, and submodel architecture affects EMM performance.

With the growing demand for transparency and accountability in AI-driven healthcare, EMM has the potential to make such systems both safer and more trustworthy.

For more information, contact the AI Development and Evaluation (AIDE) Lab at Stanford University.

This work has also been covered by AuntMinnie and in “The Wire” section of an ImagingWire newsletter.