Skip to main content

Predicting Disaster Tweets

Using HuggingFace and Weights & Biases to predict whether or not a Tweet is about a disaster
Created on July 3|Last edited on July 31



Intro

Given the widespread use of smartphones, individuals can promptly report emergencies they witness. As a result, disaster relief organizations and news agencies are interested in programmatically monitoring social media for immediate updates on disasters.
To address this need, our project leverages BERT to predict disaster-related Tweets. By accurately identifying such tweets, this project aims to provide timely and crucial information to aid organizations and news agencies, facilitating effective disaster response and reporting.

BERT is a transformer-based model that learns contextualized word representations by considering bidirectional context, enabling it to capture rich language understanding and achieve state-of-the-art performance on various natural language processing tasks.


Use of Weights & Biases

  • Huggingface integration
  • Reports
  • Logging plotly figures
  • Wandb Tables
  • Sweeps


Exploratory Data Analysis


This set of panels contains runs from a private project, which cannot be shown in this report



A first attempt...

As an initial attempt at model training, I decided to use a modest batch size of 8, a learning rate of 0.001, and a weight decay of 0.0001.
Code to initialize wandb run and model training:
# Initialize wandb run
wandb.init(entity='uma-wandb', project='disaster')

# Hyperarameters - adjustable parameters of a model that influence model training
BATCH_SIZE = 8
EPOCHS = 10
LEARNING_RATE = 0.001
WEIGHT_DECAY = 0.0001
MAXLEN = 250

# Set training arguments
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=EPOCHS,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
learning_rate=LEARNING_RATE,
weight_decay=WEIGHT_DECAY,
evaluation_strategy='epoch',
remove_unused_columns=False,
report_to='wandb', # This line logs metrics to wandb
logging_steps=10,
)

# Define training loop
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=valid_dataset,
compute_metrics=compute_metrics,
optimizers=(optim, None),
callbacks=[WandbCallback()]) # Callback not fully necessary

trainer.train()

This set of panels contains runs from a private project, which cannot be shown in this report



Sweep

Problem: My initial run produced subpar results; I need to try different hyperparameters and don't know where to start.
Solution: Automate the process of trying a bunch of different hyperparameter combos by using W&B Sweeps!

Relevant Code

sweep_config = {
"method": "random",
"name": "disaster-sweep",
"metric": {
"goal": "minimize",
"name": "train/loss"
},
"parameters": {
"epochs": {
"values": [5, 10]
},
"batch_size": {
"values": [8, 16, 32, 64]
},
"learning_rate": {
"values": [0.005, 0.0001, 0.00005]
},
"weight_decay": {
"values": [0.0001, 0.1]
}
}
}


def train(config=None):
with wandb.init(config=config):
# set sweep configuration
config = wandb.config

# Set training arguments
training_args = TrainingArguments(
output_dir='./results',
report_to='wandb', # Turn on Weights & Biases logging
num_train_epochs=config.epochs,
learning_rate=config.learning_rate,
weight_decay=config.weight_decay,
per_device_train_batch_size=config.batch_size,
per_device_eval_batch_size=config.batch_size,
save_strategy='epoch',
evaluation_strategy='epoch',
logging_strategy='epoch',
load_best_model_at_end=True,
remove_unused_columns=False,
)


# Define training loop
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=valid_dataset,
compute_metrics=compute_metrics
)

# start training loop
trainer.train()

sweep_id = wandb.sweep(sweep_config, project='disaster-sweep')
wandb.agent(sweep_id, train, count=10)

Sweep Results

After initializing a random search on the batch size, learning, rate, # of epochs, and weight decay, I was able to find a combo of hyperparameters that actually worked

This set of panels contains runs from a private project, which cannot be shown in this report


Results & Next Steps

We were able to achieve a maximum accuracy of 81% on our val dataset, and our train/loss panel indicates to us that our model is properly training, unlike before.

This set of panels contains runs from a private project, which cannot be shown in this report


Next Steps

Code Sources

I used the code from https://www.kaggle.com/code/datafan07/disaster-tweets-nlp-eda-bert-with-transformers as a basis for my EDA (but implemented the plotly graphs myself)











This set of panels contains runs from a private project, which cannot be shown in this report




This set of panels contains runs from a private project, which cannot be shown in this report