DeepForm: Track Political Advertising with Deep Learning
Fuzzy string search (or binary matching) on entity names from receipt PDFs. Made by Stacey Svetlichnaya using Weights & Biases
Overview: Automatically parse receipts for TV ads for political campaigns
In the United States, TV stations must publicize the receipts for any political advertisements they show: which organization paid—and how much—for a particular ad. However, these receipts are not required to be machine readable. So every election, the FCC Public File displays tens of thousands of PDF receipts
in hundreds of different formats. Can we use deep learning to make this information accessible to journalists and the public more efficiently?
In 2012, ProPublica ran Free The Files
, a crowdsourced annotation project where volunteers hand-labeled over 17,000 receipts to generate this dataset
. In this project, I am currently working with 9018 labeled examples from the 2012 election, 7218 for training and 1800 in validation.
This is volunteer collaboration and very much a work in progress. We're trying to improve these modeling efforts and move quickly as the 2020 election approaches. Please reach out to email@example.com if you want to contribute.
Example receipt from training data
Structured information—who paid, on what date, when was the ad aired, etc—is easy for a human to read but hard for a computer to extract from the OCR text alone (we are exploring adding document geometry signals via vision/segmentation models).
Goal: Which organization paid for a political ad on television?
There are several fields we need to extract from a receipt to understand it:
name of the organization that paid for the ad
total amount of money paid for the ad
invoice/contract id (uniquely identifies this ad deal)
dates the ad aired
potentially other fields (addresses, contact names, etc)
The goal of this specific model is to extract the name of the organization that paid for a particular TV ad. Given the receipt as an unstructured text string (parsed from a PDF document with OCR), can we learn if a particular named entity (in most cases, a committee) is mentioned in this receipt?
Try binary classification (label match? y/n) because we don't know the full label set
We frame this as a fuzzy string matching problem. We can't train a classification model because we don't know the full list of possible committees/paying entities in advance (especially for future elections that haven't happened yet). However, we do know the full list of committees officially registered with the FCC in a given election, and some metadata on those to further constrain the options (e.g. which committees support which presidential candidate). The current "deep pixel" model is a one-dimensional CNN
it reads in a sliding window of kernel_size characters along the receipt document
at each step/character, it also reads in a window of comm_input_len onto the candidate committee name (the first 10-25 characters), hopefully learning to match across the two.
I train the model with a variable number of distractors: training samples where the committee name is randomly chosen and does not match the document, to prevent overfitting. Here I describe my process for improving the "deep pixel" model. You can expand the tabs at the bottom of each section to see details on the individual runs.
Random performance to a solid B: Just clean the code
Our original implementation of "deep pixel" languished at random performance, until I cleaned up the code to achieve the "clean code" baseline in black. I experimented with a few manual modifications. The rainbow color gradient matches the order in which I ran my experiments, red being the first and violet being the last experiment.
Observations from manual sweep
- model is clearly overfitting: training accuracy is almost maxed out and validation loss decreases very minimally
- SGD over Adam: Adam overfits quickly on this task (perhaps because this is a shallow network for binary classification), SGD helped
- higher learning rate: increasing to 0.0025 helped and stabilized the learning
- smaller batch size: initial tests suggest 64 is a good balance between speed and accuracy (may want to keep tuning this in later sweeps)
Running on GPU
When I finally switched from Keras to TF.Keras, training moved to the GPU and got 12-40X faster. I tried playing with dropout and adding layers (either a second Conv1D or a second dense alayer—I think this may still be worth exploring). Again, the rainbow color spectrum indicates the order of the experiments (red is first, violet into magenta—"new baseline"—is last).
- larger kernel doesn't help much
- more layers help a little bit but will require a lot more tuning
- model is really wide and might be memorizing docs
- new baseline (magenta) is pretty good (87.66%)
- still running into overfitting
The solution? Add more distractors. This is the black line, "distractors for real", with 3 false matches for every true match in training. This finally got me above 90%. I include it here for context as it makes tuning the rest of these irrelevant. In fact, an initial sweep was fun to play with here but didn't really improve on my manual tuning (except leaning towards more parameters/more memorization :)
- add more dropout
- are there other ways of regularization? what if we increased the number of distractors?
- kernel size: try smaller windows
- committee label length: try larger (we'll need to get above 30 to be realistic/unique)
Distractors are the magic
Finally above 95% (95.4!)
Increasing the number of distractors to 3 (red), 5 (orange), and 9 (yellow) continues to improve the validation accuracy though there is still evidence of overfitting. Most alarmingly, the learning all happens in the first epoch (probably because we're still missing an embedding).
Also, as I tried to increase past 9 distractors, my runs started dying (maxing out GPU memory) because I wasn't loading training data efficiently. Note that training with 11 distractors on half the data is equivalent (maybe slightly noisier) to training with 9 distractors on full data, so refactoring the train data load should help us get to more distractors functionally.
- use an embedding, not raw characters, even if this requires reshaping
- refactor training code to load data differently
- increase committee name length to 30/35 or whatever we need to guarantee uniqueness
- reduce model width/weight (2GB is ridiculous)
- tell the story of the long sweep of 200+ runs, none of which were better than my manual tuning
How many distractors are best?
This plot averages accuracy by the number of distractors (incorrect matches of committee name/organization to receipt) per example (with correct match) used during training. The total number of runs is 243, with 2-29 runs falling into each group. The line colors follow a rainbow gradient in increasing number of distractors: red = 1 distractor, orange = 2 distractors, etc.
- generally, the more distractors per true label the better the model performs on the validation set
- the improvement in validation accuracy between the start and end of training is very small, so we may do well to simplify the model and look for more evidence of overfitting
- some of the runs are noisy and cannot be compared in a strict sense because 1) the number of runs being averaged per group, and their respective hyperparameters, vary so much and 2) most of these runs are the product of a sweep configured to maximize validation accuracy in its hyperparameter choices, so we can expect the sweep sample to contain more of the runs with more promising results
- diminishing returns: going from 2 to 3 distractors is a much bigger improvement than 8 to 9