In the United States, TV stations must publicize the receipts for any political advertisements they show: which organization paid—and how much—for a particular ad. However, these receipts are not required to be machine readable. So every election, the FCC Public File displays tens of thousands of PDF receipts in hundreds of different formats. Can we use deep learning to make this information accessible to journalists and the public more efficiently?
In 2012, ProPublica ran Free The Files, a crowdsourced annotation project where volunteers hand-labeled over 17,000 receipts to generate this dataset. In this project, I am currently working with 9018 labeled examples from the 2012 election, 7218 for training and 1800 in validation.
This is volunteer collaboration and very much a work in progress. We're trying to improve these modeling efforts and move quickly as the 2020 election approaches. Please reach out to firstname.lastname@example.org if you want to contribute.
Structured information—who paid, on what date, when was the ad aired, etc—is easy for a human to read but hard for a computer to extract from the OCR text alone (we are exploring adding document geometry signals via vision/segmentation models).
There are several fields we need to extract from a receipt to understand it:
The goal of this specific model is to extract the name of the organization that paid for a particular TV ad. Given the receipt as an unstructured text string (parsed from a PDF document with OCR), can we learn if a particular named entity (in most cases, a committee) is mentioned in this receipt?
We frame this as a fuzzy string matching problem. We can't train a classification model because we don't know the full list of possible committees/paying entities in advance (especially for future elections that haven't happened yet). However, we do know the full list of committees officially registered with the FCC in a given election, and some metadata on those to further constrain the options (e.g. which committees support which presidential candidate). The current "deep pixel" model is a one-dimensional CNN:
I train the model with a variable number of distractors: training samples where the committee name is randomly chosen and does not match the document, to prevent overfitting. Here I describe my process for improving the "deep pixel" model. You can expand the tabs at the bottom of each section to see details on the individual runs.