The Utility of the Virtual Imaging Trials Methodology for Objective Characterization of AI Systems and Training Data

Fakrul Islam Tushar, Lavsen Dahal, Saman Sotoudeh-Paima, W. Paul Segars, Ehsan Abadi, Ehsan Samei, Joseph Y. Lo
Center for Virtual Imaging Trials, Carl E. Ravin Advanced Imaging Laboratories, Department of Radiology,Duke University School of Medicine, Durham, NC, 27708, USA

Virtual Imaging Trials Improved the Transparency and Reliability of AI Systems in COVID-19 Imaging. More resources at Center for Virtual Imaging Trial.

Abstract

Purpose: The credibility of Artificial Intelligence (AI) models for medical imaging continues to be a challenge, affected by the diversity of models, the data used to train the models, and applicability of their combination to produce reproducible results for new data. In this work, we aimed to explore whether emerging Virtual Imaging Trials (VIT) methodologies can provide an objective resource to approach this challenge. Approach: The study was conducted for the case example of COVID-19 diagnosis using clinical and virtual computed tomography (CT) and chest radiography (CXR) processed with convolutional neural networks. Multiple AI models were developed and tested using 3D ResNet-like and 2D EfficientNetv2 architectures across diverse datasets. Results: Model performance was evaluated using the area under the curve (AUC) and the DeLong method for AUC confidence intervals. The models trained on the most diverse datasets showed the highest external testing performance, with AUC values ranging from 0.73-0.76 for CT and 0.70-0.73 for CXR. Internal testing yielded higher AUC values (0.77-0.85 for CT and 0.77-1.0 for CXR), highlighting a substantial drop in performance during external validation, which underscores the importance of diverse and comprehensive training and testing data. Most notably, the VIT approach provided objective assessment of the utility of diverse models and datasets, while offering insight into the influence of dataset characteristics, patient factors, and imaging physics on AI efficacy. Conclusions: The VIT approach enhances model transparency and reliability, offering nuanced insights into the factors driving AI performance and bridging the gap between experimental and clinical settings.

Study design overview. 12,844 CT and 25,219 CXR images for COVID-19 diagnosis were collected from 13 multi-center clinical datasets. Models were evaluated using internal, external, and virtually simulated testing cohorts.

Study design overview

XCAT phantom example. Four-dimensional extended cardiac-torso (XCAT) phantom developed at Duke University (Abadi et al., AJR 2020).

XCAT phantom

Phantom development workflow. Overview of COVID-19 computational phantom formation and simulated CT/CXR image generation pipeline.

Phantom formation workflow

COVID-19 abnormalities in XCAT phantom. Three distinct COVID-19 abnormality patterns with varying shapes embedded in a four-dimensional XCAT phantom (Abadi et al., AJR 2020).

COVID abnormalities

Infection distribution. Spatial distribution of COVID-19 infections within the XCAT phantom model.

COVID infection distribution

CT and CXR model performance. Confusion matrix showing case-level COVID-19 detection performance across training and testing datasets. AUC values are reported with 95% confidence intervals.

CT vs CXR confusion matrix

Physics-based evaluation. Performance comparison across CT and CXR modalities under varying imaging dose conditions. Error bars represent 95% confidence intervals.

Physics-based evaluation results

BibTeX


@misc{tushar2025utilityvirtualimagingtrials,
      title={The Utility of the Virtual Imaging Trials Methodology for Objective Characterization of AI Systems and Training Data}, 
      author={Fakrul Islam Tushar and Lavsen Dahal and Saman Sotoudeh-Paima and Ehsan Abadi and W. Paul Segars and Ehsan Samei and Joseph Y. Lo},
      year={2025},
      eprint={2308.09730},
      archivePrefix={arXiv},
      primaryClass={eess.IV},
      url={https://arxiv.org/abs/2308.09730}, 
}