Model Card: NIH Chest X-ray Dataset
Overview
The chest X-ray is one of the most commonly accessible radiological examinations for screening and diagnosis of many lung diseases. However, computer-aided diagnosis(CAD) is still a work in progress. To that effort, the model analyzed in this card predicts if an input X-ray image should further be diagnosed for lung diseases or not.
Task Performed: Binary Classification
Input Type: Image
Output: Prediction score denoting either normal or abnormal lung.
Dataset
NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with 14 text-mined disease labels from 30,805 unique patients. The 14 diseases labels are Atelectasis, Cardiomegaly, Consolidation, Edema, Effusion, Emphysema, Fibrosis, Hernia, Infiltration, Mass, Nodule, Pleural Thickening, Pneumonia, Pneumothorax.
To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning.
License and Attribution
-
There are no restrictions on the use of the NIH chest x-ray images.
-
The data is provided by the NIH Clinical Center and is available through the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345
-
Wang, et al. “ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases” ArXiv:1705.02315 [cs.CV], May 2017. arXiv.org, https://arxiv.org/abs/1705.02315.
(The data is also available as Kaggle dataset)
Model: v0
Problem Formulation
With this model, the intent is to predict a given X-ray image as either normal(no disease associated) or abnormal(have one or more of diseases). This model is thus capable of performing binary classification.
Intended Usecase
-
Research: To further the research in the field of automatic Deep Learning based “reading chest X-rays” for computer-aided diagnosis(CAD).
-
Pretrained Weights: To provide pre-trained weights for downstream tasks involving X-ray images.
-
Promote: To promote the use of model cards for reporting models.
Uses to avoid
-
This version of the model is not to be used in production to determine if an X-ray image has any lung disease associated with it or not.
-
This is not to be considered state of the art(SOTA).
Training Data
51759 sample of the NIH Chest X-ray dataset is either labeled with one or more diseases(multi-labels). The label for such samples is converted to 1
.
The remaining samples are labeled No Finding
. The NLP based labeling technique used by the authors of the dataset could not associate any disease to these samples. The label for such samples is converted to 0
.
20,000 training images, 5000 validation images, and 10,000 test images were used to train, validate, and test the model:v0.
Preprocessing: The original image size is (1024 x 1024)
pixels. They are resized to (256 x 256)
pixels. The resized images are scaled-down.
Figure 1: Samples from training set for binary classification
Model Architecture
model:v0
is trained from scratch with ResNet-50 as the backbone architecture.
The output of the Global Max Pooling is passed through a relu
activated Dense
network with 512 units. It is followed by a dropout layer(drop rate of 0.2). The output layer is sigmoid
activated.
Figure 2: Model architecture
Training related specifics
- Adam optimizer with the learning rate of 0.001 is used.
- Cross-entropy loss is used.
- Model is trained with early stopping.
Evaluation
Evaluation is done on the held-out test set. ROC Curve and test error rate are used as evaluation metrics.
Model Bias
The Data_Entry_2017_v2020.csv
that comes with the NIH Chest X-ray contains class labels as well as patient data. The patient data provided are:
-
Gender: Male or Female
-
Age: Continuous value
No signal about the age or the gender was provided during training.
Bias towards Gender
The model is evaluated on the male-only(blue) as well as the female-only(orange) subset of the test data.
Observations
- The model will give a better prediction for an X-ray belonging to the male category.
- This shows the imbalance in the training dataset in the context of gender.
- The biasness is coming from the dataset.
Biasness towards Age groups
The continuous ages are bucketed: [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
.
The model is evaluated for each bucket to learn about the model performance in each bucket.
Observation
- Lung related sickness should be commonplace for certain age groups(mid-adult range).
- Test error rate is high for the 0-20 age group which is acceptable.
- For age groups 70-90 the number of data samples would be less.
This can be better quantified through domain knowledge adaptation.
Downloads
Download Dataset-Subset
Include .CSV
files with image names, labels, and other patient related information. You will have to download the images using the official source.
# initialize wandb run
run = wandb.init(entity='wandb',
project='model-card-NIH-Chest-X-ray-binary',
job_type='consumer')
# download dataset as artifact
artifact = run.use_artifact('dataset-subset:latest')
artifact_dir = artifact.download()
# close the run
run.join()
Download model:v0
# initialize wandb run
run = wandb.init(entity='wandb',
project='model-card-NIH-Chest-X-ray-binary',
job_type='consumer')
# download model_1.h5 as artifact
artifact = run.use_artifact('model/model:latest')
artifact_dir = artifact.download()
# close the run
run.join()
Limitations
Dataset
-
The dataset used for training is a small subset of the full dataset.
-
The image labels are NLP extracted so there could be some erroneous labels. NLP labeling accuracy is estimated to be >90%.
Model
- For images such as X-ray which are not naturally occurring, vanilla convolutional neural network based image classifiers are not sufficient.