DeepForm: Track Political Advertising with Deep Learning

Fuzzy string search (or binary matching) on entity names from receipt PDFs. Made by Stacey Svetlichnaya using Weights & Biases
Stacey Svetlichnaya

Overview: Automatically parse receipts for TV ads for political campaigns

In the United States, TV stations must publicize the receipts for any political advertisements they show: which organization paid—and how much—for a particular ad. However, these receipts are not required to be machine readable. So every election, the FCC Public File displays tens of thousands of PDF receipts in hundreds of different formats. Can we use deep learning to make this information accessible to journalists and the public more efficiently?

Read more about this project →

Raw data →

In 2012, ProPublica ran Free The Files, a crowdsourced annotation project where volunteers hand-labeled over 17,000 receipts to generate this dataset. In this project, I am currently working with 9018 labeled examples from the 2012 election, 7218 for training and 1800 in validation.

Project code →

This is volunteer collaboration and very much a work in progress. We're trying to improve these modeling efforts and move quickly as the 2020 election approaches. Please reach out to stacey@wandb.com if you want to contribute.

Example receipt from training data

Structured information—who paid, on what date, when was the ad aired, etc—is easy for a human to read but hard for a computer to extract from the OCR text alone (we are exploring adding document geometry signals via vision/segmentation models).

Goal: Which organization paid for a political ad on television?

There are several fields we need to extract from a receipt to understand it:
The goal of this specific model is to extract the name of the organization that paid for a particular TV ad. Given the receipt as an unstructured text string (parsed from a PDF document with OCR), can we learn if a particular named entity (in most cases, a committee) is mentioned in this receipt?

Try binary classification (label match? y/n) because we don't know the full label set

We frame this as a fuzzy string matching problem. We can't train a classification model because we don't know the full list of possible committees/paying entities in advance (especially for future elections that haven't happened yet). However, we do know the full list of committees officially registered with the FCC in a given election, and some metadata on those to further constrain the options (e.g. which committees support which presidential candidate). The current "deep pixel" model is a one-dimensional CNN:
I train the model with a variable number of distractors: training samples where the committee name is randomly chosen and does not match the document, to prevent overfitting. Here I describe my process for improving the "deep pixel" model. You can expand the tabs at the bottom of each section to see details on the individual runs.

Random performance to a solid B: Just clean the code

Our original implementation of "deep pixel" languished at random performance, until I cleaned up the code to achieve the "clean code" baseline in black. I experimented with a few manual modifications. The rainbow color gradient matches the order in which I ran my experiments, red being the first and violet being the last experiment.

Observations from manual sweep

Running on GPU

When I finally switched from Keras to TF.Keras, training moved to the GPU and got 12-40X faster. I tried playing with dropout and adding layers (either a second Conv1D or a second dense alayer—I think this may still be worth exploring). Again, the rainbow color spectrum indicates the order of the experiments (red is first, violet into magenta—"new baseline"—is last).

Observations

The solution? Add more distractors. This is the black line, "distractors for real", with 3 false matches for every true match in training. This finally got me above 90%. I include it here for context as it makes tuning the rest of these irrelevant. In fact, an initial sweep was fun to play with here but didn't really improve on my manual tuning (except leaning towards more parameters/more memorization :)

Sweep explorations

Distractors are the magic

Finally above 95% (95.4!)

Increasing the number of distractors to 3 (red), 5 (orange), and 9 (yellow) continues to improve the validation accuracy though there is still evidence of overfitting. Most alarmingly, the learning all happens in the first epoch (probably because we're still missing an embedding). Also, as I tried to increase past 9 distractors, my runs started dying (maxing out GPU memory) because I wasn't loading training data efficiently. Note that training with 11 distractors on half the data is equivalent (maybe slightly noisier) to training with 9 distractors on full data, so refactoring the train data load should help us get to more distractors functionally.

Next steps

How many distractors are best?

This plot averages accuracy by the number of distractors (incorrect matches of committee name/organization to receipt) per example (with correct match) used during training. The total number of runs is 243, with 2-29 runs falling into each group. The line colors follow a rainbow gradient in increasing number of distractors: red = 1 distractor, orange = 2 distractors, etc.

Observations