Skip to main content

Emotions

A quick exploration of the Reddit GoEmotions dataset
Created on July 22|Last edited on October 21

Log and explore your data

Total: 58K, train: 43.4, test: 5.4
different taxonomies: 27+ neutral (28 total, so random chance is 0.036), hierarchical: (positive, negative, ambigious, neutral), ekman: like inside out.
could try fine-tuned models from HF, could try a bunch of inference, could try to save different models & see predictions on test set/across taxonomies for same phrases
challenge: pivot dataset Tables to log emotions instead of binary columns, then handle the multi-label scenario? top N?

Sample code to subsample via Pandas

from datasets import load_dataset
import wandb
dataset = load_dataset("go_emotions", "raw")
train = dataset['train']

train_df = train.to_pandas()
train_sample = train_df.sample(frac=0.05)

wandb.init(project="emotions", job_type="explore", name="test_0.05")
wandb.run.log({"sample_random_5%_train" : wandb.Table(dataframe=train_sample)})

Random dataset sample

Observations

  • "neutral" is very popular
  • some text samples appear more than once, because of threading? no, because of multiple reviewers
  • "[deleted]" is the most popular author

Queries to try

Group by subreddit

  • Which are the most positive subreddits? Add column of ratio approval to disapproval
  • Which subreddits show the most of a particular emotion? (use .sum)
  • Which are the most diverse? Add column to count unique authors

Surprising combinations of emotions

  • approval & disapproval
  • fear & gratitude
  • caring & surprise

Run: test_0.05
1


Sample 20%


Run set
1


Resources