DeepForm: Understand Structured Documents at Scale

A benchmark to extract text from visually-structured forms, starting with political ad receipts. Made by Stacey Svetlichnaya using Weights & Biases
Stacey Svetlichnaya

TL;DR our latest benchmark, DeepForm, extracts political campaign finance data from scanned receipts. Join this collaboration on an impactful AI/ML solution for fields from investigative journalism/democracy to climate science, medicine, & beyond.

Introducing the DeepForm Benchmark

DeepForm is a project to extract visually-structured information from forms, starting with a large dataset of receipts for political campaign ads bought around US elections. By releasing this dataset and baseline code as an open Weights & Biases benchmark, we hope to support and accelerate wider collaboration on machine learning approaches to the general and surprisingly nontrivial problem of automatically extracting data from documents.

→Join the benchmark →Read more context →See the code

Why this matters

Concretely, improving accuracy on this particular model would make political campaign financing information for the US election in 2020 (and beyond) much more available. US TV stations are legally required to publicly disclose these ad sales but not to make them machine readable or easy to aggregate. Every election, tens of thousands of PDFs are posted to the FCC Public Files in hundreds of different formats. Sifting through them to understand the larger picture or find interesting trends is an incredibly time-consuming and/or expensive process for investigative journalists and regular folks alike. Building on earlier manual efforts (see below), we apply modern deep learning techniques to improve and massively scale our ability to parse and work with this data, which is a crucial part of the US information ecosystem and democratic process.

We provide the dataset and code as a public open-source benchmark to encourage and facilitate collaboration on both the immediate task—automate parsing these receipts so we can more easily aggregate and understand campaign spending data from US elections—and the longer-term vision of scalable information extraction. Many proprietary and domain-specific solutions exist, but we haven't yet found a single approach that is freely available and sufficiently powerful/tunable. If we can collaborate more broadly to train a robust model to understand hundreds of types of TV ad receipts, we can generalize it to other form types in important document contexts like medical records, climate logs, human rights archives, and much more.

The Dataset

The dataset contains ~20,000 labeled receipts from political television ads bought in the 2012, 2014, and 2020 US elections. The original PDFs are available on the FCC website. We have combined labeled data from several efforts, including the crowdsourced Free the Files project, hand-coded templatized extraction by Alex Byrnes, and 1000 docs we labeled ourselves. For each document, we provide the PDF page images, the OCRed text, the bounding box coordinates in pixel space of each word token, and the ground truth text and locations for several key types of semantic fields.

Visually-structured semantic data

These fields, like "advertiser" (who paid for the ad) and "gross amount" (total amount paid) contain the crucial information of the receipt that can be hard to extract from the text alone. For example, receipts list many names of individuals and organizations involved (including the names of TV shows during which the ads run), but only one is the correct advertiser (typically a political committee, but not always). In a long itemized receipt, the total paid might appear at the end of the list, or highlighted in the top right of the first page, and it may not even be the largest number on the page if the receipt records bidirectional transactions.

Some examples are below. You can scroll inside each panel or hover over the top right corner and click on the square to see a larger view. You can also click on the gear icon in the top left corner to toggle the ground truth labels (orange) vs model predictions (colored by field type) and adjust the minimum score threshold to see more/fewer bounding boxes.

Note the diversity of receipt layouts. Like all real-world datasets, this one has some noise and many missing fields (e.g. different subsets of fields available for different election years), and some of it is challenging to label correctly as a human. Please see the DeepForm Github repository for dataset access and further details.

Section 2

The Challenge

Given the noise and scale of this problem—there is a very long tail of possible form layouts and over 100,000 unlabeled PDFs from 2020 election ads alone—how can we apply deep learning to train the most general form-parsing model with the fewest hand-labeled examples? To start, we've created a benchmark around this dataset. We train on a fixed split of 15,000 documents and average prediction accuracy for tokens across 5 fields: advertiser, gross amount (or total paid), contract number (unique identifier for the transaction), and the start/end dates for the ad showing (flight_from and flight_to).

Some of these fields are more templatized/standard and easier to extract than others. Our current baseline achieves around 70% averaged across the five fields—see the learning curves and accuracy progression for the different fields below. An initial application of deep learning to this problem (by Jonathan Stray, summer 2019) reached 90% accuracy on extracting the total amounts alone from a held-out test set, showing that deep learning can generalize well to unseen form types and that we have considerable room for improvement. We're exploring several other architectures from the latest literature, as well as more traditional approaches like named entity recognition for the advertiser field. We welcome any contributions and especially comparisons to existing approaches—join the benchmark to get started!

Next steps

Some ideas for extensions we're pursuing and what to try next are below. We are actively looking for collaborators—if you're interested, please reach out via the comments at the bottom of this report, the benchmark discussion page, or directly to stacey@wandb.com.

Section 4