Purpose: The credibility of Artificial Intelligence (AI) models for medical imaging continues to be a challenge, affected by the diversity of models, the data used to train the models, and applicability of their combination to produce reproducible results for new data. In this work, we aimed to explore whether emerging Virtual Imaging Trials (VIT) methodologies can provide an objective resource to approach this challenge. Approach: The study was conducted for the case example of COVID-19 diagnosis using clinical and virtual computed tomography (CT) and chest radiography (CXR) processed with convolutional neural networks. Multiple AI models were developed and tested using 3D ResNet-like and 2D EfficientNetv2 architectures across diverse datasets. Results: Model performance was evaluated using the area under the curve (AUC) and the DeLong method for AUC confidence intervals. The models trained on the most diverse datasets showed the highest external testing performance, with AUC values ranging from 0.73-0.76 for CT and 0.70-0.73 for CXR. Internal testing yielded higher AUC values (0.77-0.85 for CT and 0.77-1.0 for CXR), highlighting a substantial drop in performance during external validation, which underscores the importance of diverse and comprehensive training and testing data. Most notably, the VIT approach provided objective assessment of the utility of diverse models and datasets, while offering insight into the influence of dataset characteristics, patient factors, and imaging physics on AI efficacy. Conclusions: The VIT approach enhances model transparency and reliability, offering nuanced insights into the factors driving AI performance and bridging the gap between experimental and clinical settings.
Study design overview. 12,844 CT and 25,219 CXR images for COVID-19 diagnosis were collected from 13 multi-center clinical datasets. Models were evaluated using internal, external, and virtually simulated testing cohorts.
XCAT phantom example. Four-dimensional extended cardiac-torso (XCAT) phantom developed at Duke University (Abadi et al., AJR 2020).
Phantom development workflow. Overview of COVID-19 computational phantom formation and simulated CT/CXR image generation pipeline.
COVID-19 abnormalities in XCAT phantom. Three distinct COVID-19 abnormality patterns with varying shapes embedded in a four-dimensional XCAT phantom (Abadi et al., AJR 2020).
Infection distribution. Spatial distribution of COVID-19 infections within the XCAT phantom model.
CT and CXR model performance. Confusion matrix showing case-level COVID-19 detection performance across training and testing datasets. AUC values are reported with 95% confidence intervals.
Physics-based evaluation. Performance comparison across CT and CXR modalities under varying imaging dose conditions. Error bars represent 95% confidence intervals.
@misc{tushar2025utilityvirtualimagingtrials,
title={The Utility of the Virtual Imaging Trials Methodology for Objective Characterization of AI Systems and Training Data},
author={Fakrul Islam Tushar and Lavsen Dahal and Saman Sotoudeh-Paima and Ehsan Abadi and W. Paul Segars and Ehsan Samei and Joseph Y. Lo},
year={2025},
eprint={2308.09730},
archivePrefix={arXiv},
primaryClass={eess.IV},
url={https://arxiv.org/abs/2308.09730},
}