Emotions
A quick exploration of the Reddit GoEmotions dataset
Created on July 22|Last edited on October 21
Comment
Log and explore your data
Total: 58K, train: 43.4, test: 5.4
different taxonomies: 27+ neutral (28 total, so random chance is 0.036), hierarchical: (positive, negative, ambigious, neutral), ekman: like inside out.
could try fine-tuned models from HF, could try a bunch of inference, could try to save different models & see predictions on test set/across taxonomies for same phrases
challenge: pivot dataset Tables to log emotions instead of binary columns, then handle the multi-label scenario? top N?
Sample code to subsample via Pandas
from datasets import load_datasetimport wandbdataset = load_dataset("go_emotions", "raw")train = dataset['train']train_df = train.to_pandas()train_sample = train_df.sample(frac=0.05)wandb.init(project="emotions", job_type="explore", name="test_0.05")wandb.run.log({"sample_random_5%_train" : wandb.Table(dataframe=train_sample)})
Random dataset sample
Observations
- "neutral" is very popular
- some text samples appear more than once, because of threading? no, because of multiple reviewers
- "[deleted]" is the most popular author
Queries to try
Group by subreddit
- Which are the most positive subreddits? Add column of ratio approval to disapproval
- Which subreddits show the most of a particular emotion? (use .sum)
- Which are the most diverse? Add column to count unique authors
Surprising combinations of emotions
- approval & disapproval
- fear & gratitude
- caring & surprise
Run: test_0.05
1
Sample 20%
Run set
1
Resources
Add a comment