
Project DeepForm: Extract Information from Documents
Objective
DeepForm aims to extract information from TV and cable political advertising disclosure forms using deep learning and provide a challenging journalism-relevant dataset for NLP/ML researchers. This public data is valuable to journalists but locked in PDFs. Through this benchmark, we hope to accelerate collaboration on the concrete task of making this data accessible and longer-term solutions for general information extraction from visually-structured documents in fields like medicine, climate science, social science, and beyond.
- Example W&B Report with ideas for next steps
- More context & why this benchmark matters
- Dataset details and access
Dataset and metrics
This first iteration of the benchmark trains on a fixed split of 15,000 documents. The full dataset contains ~20,000 labeled receipts from political television ads bought in the 2012, 2014, and 2020 US elections. The original PDFs are available on the FCC website. We have combined labeled data from several efforts, including the crowdsourced Free the Files project, hand-coded templatized extraction by Alex Byrnes, and 1000 docs we labeled ourselves. For each document, we provide the PDF page images, the OCRed text, the bounding box coordinates in pixel space of each word token, and the ground truth text and bounding box locations for several key types of semantic fields:
amount
: gross_amount, or total amount paid for the adflight_from
andflight_to
: start and end air dates dates (often known as "flight dates")contractid
: contract_num, or the unique identifier of the contract (multiple documents can have the same number, as a contract for future air dates is revised)advertiser
: who paid for the ad (often the name of a political committee but not always)
We track and report the percent accuracy for each field type and rank submissions by the avg_acc
across all five fields. You can see the current standing in the leaderboard. We encourage you to clone our repo, try the code yourself, and submit any extensions to the leaderboard!
Some sample receipts appear below. In the first detailed view, Weights & Biases bounding boxes help interactively visualize annotations for ground truth (orange) vs predictions (amount
= blue, flight_from
= green,flight_to
= red, advertiser
= pink, contractid
= purple).
Quickstart
Check out the project-deepform repo and specifically the Running section to get started. Once you setup a Docker and/or Poetry developer environment, training a model is a simple as, in Docker:
make train
or in Poetry:
python -m deepform.train
By default, the code will train for 50 epochs on all 15000 documents. Our baseline trains for 200 epochs. You can add the following flags for a super quick test on 100 documents for 1 epoch:
python -m deepform.train --epochs 1 --len-train 100
Submission instructions
To get started, clone the project-deepform repo and follow the readme. Follow the instructions to train the baseline model, try some extensions, and submit your work through the W&B UI as follows:
- Identify the training run logged to W&B which you'd like to submit to the benchmark. It will have a URL of the form "wandb.ai/USER_NAME/PROJECT_NAME/runs/RUN_ID", where RUN_ID is a short alphanumeric string.
- Click on the "Submit a run" button at the top right of this page.
- Paste the run path (USER_NAME/PROJECT_NAME/runs/RUN_ID) into the submission field.
- Fill in the "Submission Notes": we'd love to hear some details on your approach and/or see a link to a W&B report for your work. Sharing the code for your submissions is the default and strongly encouraged for this open collaborative benchmark.
- Press "Confirm Submission" and you're done! We will review submissions as they come in.
More examples
Note the diversity of formats:
Initial attempts to annotate predictions (for
amount
in magenta) vs ground truth (other colors):