Swift or Shakespeare?

Text classification with SpaCy by an ML newbie
Created on April 18|Last edited on October 2
Comment
﻿
Context﻿Swift or Shakespeare BuzzFeed Quiz 1﻿
﻿Swift or Shakespeare BuzzFeed Quiz 2﻿
﻿NLP Text Classification Tutorial﻿
DataGathered from Kaggle and processed with a simple Node script
﻿Shakespeare Plays﻿
105155 lines
Includes only spoken player lines (e.g. lines like "SCENE I. London. The Palace" or "ENTER King Henry" are filtered out)
Many lines are cut awkwardly, for example this sentence is split into 3 distinct "lines":
Well, Hal, well, and in some sort it jumps with my
humour as well as waiting in the court, I can tell
you.
﻿Shakespeare Sonnets﻿
2174 lines
﻿Taylor Swift Lyrics﻿
8358 lines (but pop music is repetitive, # of unique lines is lower)
Does not include newest album, Midnights (2022)
Much smaller data set compared to Shakespeare
Validation StrategyTwo strategies for testing the performance of the model:
Randomize the data and split into 90% for training and 10% for validation
Use all data for training and create a separate validation set from the BuzzFeed quizzes
The "random split" approach resulted in high accuracies across the board regardless of other variables. Meanwhile "quiz" mode had lower accuracies and more noticeable performance differences between runs. This makes "quiz" mode far more interesting so that's primarily what we'll be focused on going forward.
﻿
﻿
﻿
ArchitectureI tried training the model with three architectures: bow (bag-of-words), ensemble, and simple_cnn. The metrics that were most affected by architecture were accuracy and run duration. The bar graphs below show that bow did fairly poorly at only 50-60% accuracy, but was significantly faster to train compared to the other two. Meanwhile, CNN is the most accurate but also the slowest.
﻿
﻿
Batch SizeThe metrics that seem to have a strong correlation with batch size are loss and run duration. Smaller batch sizes appear to result in higher loss values and longer run durations. I was expecting to see a more obvious pattern with accuracy but there doesn't seem to be one.
﻿
﻿
﻿
﻿
﻿
Most Accurate Run: CNN 128The most accurate run was simple_cnn with batch size of 128. The model got 27 out of 32 correct, which is roughly 84%! Not bad. How did you do in comparison?
﻿
﻿
﻿
﻿
﻿
Add a comment