DeepForm: Track Political Advertising with Deep Learning
Fuzzy string search (or binary matching) on entity names from receipt PDFs. Made by Stacey Svetlichnaya using Weights & Biases
Overview: Automatically parse receipts for TV ads for political campaigns
In the United States, TV stations must publicize the receipts for any political advertisements they show: which organization paid—and how much—for a particular ad. However, these receipts are not required to be machine readable. So every election, the FCC Public File displays tens of thousands of PDF receipts
in hundreds of different formats. Can we use deep learning to make this information accessible to journalists and the public more efficiently?
In 2012, ProPublica ran Free The Files
, a crowdsourced annotation project where volunteers hand-labeled over 17,000 receipts to generate this dataset
. In this project, I am currently working with 9018 labeled examples from the 2012 election, 7218 for training and 1800 in validation.
This is volunteer collaboration and very much a work in progress. We're trying to improve these modeling efforts and move quickly as the 2020 election approaches. Please reach out to firstname.lastname@example.org if you want to contribute.
Example receipt from training data
Structured information—who paid, on what date, when was the ad aired, etc—is easy for a human to read but hard for a computer to extract from the OCR text alone (we are exploring adding document geometry signals via vision/segmentation models).
Goal: Which organization paid for a political ad on television?
There are several fields we need to extract from a receipt to understand it:
name of the organization that paid for the ad
total amount of money paid for the ad
invoice/contract id (uniquely identifies this ad deal)
dates the ad aired
potentially other fields (addresses, contact names, etc)
The goal of this specific model is to extract the name of the organization that paid for a particular TV ad. Given the receipt as an unstructured text string (parsed from a PDF document with OCR), can we learn if a particular named entity (in most cases, a committee) is mentioned in this receipt?
Try binary classification (label match? y/n) because we don't know the full label set
We frame this as a fuzzy string matching problem. We can't train a classification model because we don't know the full list of possible committees/paying entities in advance (especially for future elections that haven't happened yet). However, we do know the full list of committees officially registered with the FCC in a given election, and some metadata on those to further constrain the options (e.g. which committees support which presidential candidate). The current "deep pixel" model is a one-dimensional CNN
it reads in a sliding window of kernel_size characters along the receipt document
at each step/character, it also reads in a window of comm_input_len onto the candidate committee name (the first 10-25 characters), hopefully learning to match across the two.
I train the model with a variable number of distractors: training samples where the committee name is randomly chosen and does not match the document, to prevent overfitting. Here I describe my process for improving the "deep pixel" model. You can expand the tabs at the bottom of each section to see details on the individual runs.
Random performance to a solid B: Just clean the code
Running on GPU
Distractors are the magic
How many distractors are best?