Health AI Data Resource

Curated, preprocessed, and research-ready medical imaging datasets for AI/ML development in radiology and healthcare

13 Datasets Available
30K+ CT Scans
25K+ Patients
100% Open Access
Explore Datasets View on GitHub

Why Choose HAID?

Accelerate your AI research with standardized, high-quality medical imaging data

📹 Introduction to HAID

Watch this video to learn about HAID's mission, dataset collection, and how to get started with our resources

📄 HAID Presentation

Download our comprehensive PDF presentation covering the complete HAID dataset collection, technical specifications, and research applications.

  • Complete dataset overview and statistics
  • Technical specifications and preprocessing details
  • Research applications and use cases
  • Citation guidelines and licensing information
📥 Download PDF
Comprehensive presentation deck

📖 Preview HAID Presentation

Browse the presentation online or download the PDF for offline viewing

🤖

Interactive Learning with NotebookLLM

Explore HAID resources through AI-powered conversations and get instant answers to your questions

✨ What you can do:

💬 Ask questions about specific datasets
🔍 Compare preprocessing pipelines
📊 Explore technical specifications
🎓 Learn about annotation methods
📚 Get citation information
🔬 Discover research applications
🚀 Launch NotebookLLM
Powered by Google NotebookLLM - Interactive AI assistant for HAID resources
📊

Standardized Preprocessing

All datasets follow consistent preprocessing pipelines with uniform spacing and formats

📚

Comprehensive Documentation

Detailed documentation, Jupyter notebooks, and reference implementations

Ready-to-Use Annotations

Expert-validated segmentations, bounding boxes, and clinical labels

🔬

Train/Val/Test Splits

Pre-defined splits for reproducible benchmarking and model evaluation

Dataset Collection Overview

Comprehensive medical imaging datasets with standardized preprocessing and expert annotations for reproducible AI research

Research-Ready Collection

13
Curated Datasets
30K+
CT Volumes
25K+
Unique Patients
50K+
Annotated Lesions
10+
Countries
100%
Open Access

Research Applications & Tasks

🎯 Detection & Localization

  • Lung nodule detection (6 datasets)
  • Universal lesion detection (8 organs)
  • COVID-19 pattern recognition
  • Multi-organ localization

🔬 Segmentation

  • Multi-reader nodule segmentation
  • Tumor boundary delineation
  • Organ segmentation (Vista3D)
  • AI-powered PiNS segmentation

📊 Classification & Diagnosis

  • Benign vs. malignant classification
  • Lung-RADS scoring
  • Histopathology-confirmed labels
  • Multi-label finding prediction

💡 Advanced Research

  • Radiomics feature extraction
  • Survival prediction modeling
  • Multi-institutional validation
  • Federated learning studies

Dataset Composition by Clinical Domain

🫁 Lung Cancer & Screening

10 Datasets
📍 Coverage
20,074 patients
22,000+ CT scans
60K+ nodules
Includes NSCLC-Radiomics, UniToChest, IMDCT, LNDb, LUNGx, LIDC-IDRI, LUNA25, NLST-3D, DLCS 2024, DeepLesion

🦠 COVID-19 Imaging

3 Datasets
📍 Coverage
17,567 patients
20,397+ CT scans
RT-PCR confirmed
Includes BIMCV-R, MIDRC-RICORD, U-10 with multi-institutional validation and bilingual reports

🔬 Universal Lesion Detection

1 Dataset
📍 Coverage
10,594 patients
1,000 test volumes
8 body regions
DeepLesion with 4,927 annotated lesions spanning lung, liver, kidney, bone, abdomen, mediastinum, pelvis, soft tissue

Technical Specifications & Data Quality

📊

Imaging Protocols

  • CT: 12 datasets (various vendors)
  • LDCT: 1 dataset (screening protocol)
  • Slice thickness: 1.0-5.0mm
  • In-plane resolution: 0.3-1.0mm
  • Resampled spacing: [0.7, 0.7, 1.25]mm

Annotation Quality

  • Expert radiologists: Board-certified
  • Multi-reader: Up to 4 readers/case
  • Pathology-confirmed: 3 datasets
  • AI-augmented: PiNS + Vista3D
  • Clinical metadata: Age, staging, outcomes
🔧

Preprocessing Pipeline

  • Format: NIfTI standardization
  • Orientation: RAS coordinate system
  • Windowing: Lung & soft tissue
  • Normalization: HU standardization
  • Splits: 80/10/10 train/val/test
📚

Documentation

  • Processing notebooks: Jupyter
  • Data dictionaries: Complete metadata
  • Citation guidelines: BibTeX ready
  • License info: CC-BY, Apache 2.0
  • DOI references: Zenodo, TCIA

Geographic & Institutional Distribution

🌎 North America

6 datasets from USA
NIH, Duke, NLST (33 centers), MIDRC consortium, TCIA archives

🌍 Europe

4 datasets from 4 countries
Netherlands, Italy, Portugal, Spain - multi-institutional collaborations

🌏 Asia

1 dataset from China
Multi-institutional (5 clinical centers) with pathology confirmation

🌐 Multi-National

2 datasets (aggregated)
U-10 collection spanning 10 public datasets globally

Data Discovery Dashboard

Complete catalog of curated medical imaging datasets with full details

📋 Available Datasets Checklist

🫁 Lung Cancer & Screening

9 datasets • 20K+ patients

🦠 COVID-19 Imaging

3 datasets • 17K+ patients

🔬 Universal Lesion Detection

• 1,000 test volumes
• 4,927 lesions
• 8 body regions
1 dataset • 10K+ patients
13
Total Datasets
All Available
30K+
CT Scans
25K+
Patients
100%
Open Access
Showing 13 datasets