Overview: Automatically parse receipts for TV ads for political campaigns

In the United States, TV stations must publicize the receipts for any political advertisements they show: which organization paid—and how much—for a particular ad. However, these receipts are not required to be machine readable. So every election, the FCC Public File displays tens of thousands of PDF receipts in hundreds of different formats. Can we use deep learning to make this information accessible to journalists and the public more efficiently?

Read more about this project →

Raw data →

In 2012, ProPublica ran Free The Files, a crowdsourced annotation project where volunteers hand-labeled over 17,000 receipts to generate this dataset. In this project, I am currently working with 9018 labeled examples from the 2012 election, 7218 for training and 1800 in validation.

Project code →

This is volunteer collaboration and very much a work in progress. We're trying to improve these modeling efforts and move quickly as the 2020 election approaches. Please reach out to stacey@wandb.com if you want to contribute.

Example receipt from training data

Structured information—who paid, on what date, when was the ad aired, etc—is easy for a human to read but hard for a computer to extract from the OCR text alone (we are exploring adding document geometry signals via vision/segmentation models).

Screen Shot 2020-06-16 at 4.56.36 PM.png

Goal: Which organization paid for a political ad on television?

There are several fields we need to extract from a receipt to understand it:

The goal of this specific model is to extract the name of the organization that paid for a particular TV ad. Given the receipt as an unstructured text string (parsed from a PDF document with OCR), can we learn if a particular named entity (in most cases, a committee) is mentioned in this receipt?

Try binary classification (label match? y/n) because we don't know the full label set

We frame this as a fuzzy string matching problem. We can't train a classification model because we don't know the full list of possible committees/paying entities in advance (especially for future elections that haven't happened yet). However, we do know the full list of committees officially registered with the FCC in a given election, and some metadata on those to further constrain the options (e.g. which committees support which presidential candidate). The current "deep pixel" model is a one-dimensional CNN:

I train the model with a variable number of distractors: training samples where the committee name is randomly chosen and does not match the document, to prevent overfitting. Here I describe my process for improving the "deep pixel" model. You can expand the tabs at the bottom of each section to see details on the individual runs.

Random performance to a solid B: Just clean the code

Random performance to a solid B: Just clean the code

Running on GPU

Running on GPU

Distractors are the magic

Distractors are the magic

How many distractors are best?

Section 10