Predicting Lung Disease with Binary Classification on the NIH Chest X-ray Dataset

In this report, we will perform binary classification on the NIH Chest X-ray dataset. . Made by Ayush Thakur using Weights & Biases
Ayush Thakur


Machine learning with medical imagery has been a promising domain for quite a while now. In fact, many in the field think ML-centric diagnoses are a matter of “when” not “if.” But since the consequences of false negatives and false positives are so detrimental for patients, the industry and researchers in this field are still fairly tentative.
Chest X-rays, like most medical images, are fairly ideal from a data perspective. They’re fairly uniform in size and angle and many are publicly available (with personally-identifying information redacted, of course).
Today, we’re going to look at if we can leverage an NIH dataset of those images to predict lung disease diagnoses. Specifically, here, our output is a prediction about whether we’re looking at a normal lung or an abnormal lung.
Task Performed: Binary Classification
Input Type: Image
Output: Prediction score denoting either normal or abnormal lung.
Let’s dig in:


NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with 14 text-mined disease labels from 30,805 unique patients. The 14 diseases labels are Atelectasis, Cardiomegaly, Consolidation, Edema, Effusion, Emphysema, Fibrosis, Hernia, Infiltration, Mass, Nodule, Pleural Thickening, Pneumonia, Pneumothorax.
To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning.

License and Attribution

(The data is also available as Kaggle dataset)

Model: v1

Problem Formulation

With this model, the intent is to predict a given X-ray image as either normal(no disease-associated) or abnormal(have one or more diseases). This model is thus capable of performing binary classification.

Intended Usecase

Uses to avoid

Training Data

51759 sample of the NIH Chest X-ray dataset is either labeled with one or more diseases(multi-labels). The label for such samples is converted to 1.
The remaining samples are labeled No Finding. The NLP-based labeling technique used by the authors of the dataset could not associate any disease with these samples. The label for such samples is converted to 0.
20,000 training images, 5000 validation images, and 10,000 test images were used to train, validate, and test the model:v0.
Preprocessing: The original image size is (1024 x 1024) pixels. They are resized to (256 x 256) pixels. The resized images are scaled-down.

Model Architecture

model:v1 is trained from scratch with ResNet-50 as the backbone architecture.
The output of the Global Max Pooling is passed through a relu activated Dense network with 512 units. It is followed by a dropout layer(drop rate of 0.2). The output layer is sigmoid activated.

Training related specifics


Evaluation is done on the held-out test set. ROC Curve and test error rate are used as evaluation metrics.

Model Bias

The Data_Entry_2017_v2020.csv that comes with the NIH Chest X-ray contains class labels as well as patient data. The patient data provided are:
No signal about the age or the gender was provided during training.

Bias Towards Gender

The model is evaluated on the male-only(blue) as well as the female-only(orange) subset of the test data.


Bias Towards Age Groups

The continuous ages are bucketed: [0, 10, 20, 30, 40, 50, 60, 70, 80, 90].
The model is evaluated for each bucket to learn about the model performance in each bucket.


This can be better quantified through domain knowledge adaptation.


Download model:v1

# initialize wandb runrun = wandb.init()# download model_nih_1.h5 as artifactartifact = run.use_artifact('wandb/model-card-NIH-Chest-X-ray-binary/model:latest')artifact_dir = close the runrun.join()