Named Entity Recognition with W&B and spaCy

Visualize and explore named entities with spaCy in W&B. Made by Stacey Svetlichnaya using Weights & Biases
Stacey Svetlichnaya

Overview

Named Entity Recognition (NER) is a common task in information extraction: given some text such as a news article, can we identify certain words or phrases (tokens) as the important proper nouns, locations, dates, etc? NER classifies tokens into specific predetermined entity types: people's names ("John", "Mr. Smith", "Dr. Jane Smith"), organization names ("New York Times", "the Sinclair Broadcast Group", "ABC" ), times ("7:30am", "23:00", "tonight"), and more. In this example, I use the DeepForm dataset of political ad receipts to illustrate NER using spaCy in W&B.

Visualizing named entities with spaCy

To view the raw text with spaCy annotations for NER, log a wandb spaCy plot directly to a wandb.Table: use wandb.plots.NER(docs=document), where document is the spaCy-parsed result of the raw text. Below is a snippet of an annotated document and a Table with 5 samples. Hover over a "spacy_plot" row below and click on the gray box in the top right corner to scroll through a full screen view. These sample documents are the unstructured streams of text extracted from the scanned receipts using optical character recognition. You'll notice very high recall and unfortunately many irrelevant false positives. See below for samples of the original receipts, which are more coherent and diverse with the visual/geometric information included (instead of the raw characters).
Almost any of these words could be an entity.

Sample receipts for TV ads

These examples from 2020 show the diversity of receipt formats and the level of noise in the ground truth data. Click on the box in the top right of each image to "View full screen" so you can see the details and the correct answer tokens annotated in orange.

Find the cost, buyer, and dates on a receipt

The concrete task in DeepForm is to extract certain meaningful fields from PDF receipts. Each receipt corresponds to a political television advertisement aired around a US election. We'd like to know which organization ("advertiser") paid how much ("amount") for which ad ("contractid") during which time period ("flight_from" and "flight_to" for the start and end of the ad's airtime). Some entity types automatically extracted by the spaCy library are relevant:
Below, I pull in the full OCR-ed text from a few sample PDF receipts from 2020 and show how W&B Tables can help us understand the data and develop a simple baseline model with spaCy's NER functionality.

Note: first OCR the PDF image to extract raw text

As a first pass, I use spaCy's NER model to extract entities from the raw text—just the strings detected in the image of the receipt using pdfplumber's optical character recognition (OCR). Focusing on the parsed characters and not the image pixels throws away important signal—the visual layout of the receipt, vertical proximity across lines (e.g. for the address), the use of bold font for emphasis—but this lets us quickly evaluate performance with an off-the-shelf model. Perhaps the text alone is sufficient.

Log NER results to a wandb.Table

I log each extracted entity to a wandb.Table: the predicted entity type, the corresponding text tokens, the text length, and the start and end positions of the entity in the document. This produces a Table like the following:

Sample code to produce this table

import wandbimport spacywandb.init(project="ner_spacy")nlp = spacy.load("en_core_web_sm")docs = load_docs()data = []for d_key, d in docs.items(): # extract entities using spaCy parsed = nlp(d) for ent in parsed.ents: token_length = int(ent.end_char) - int(ent.start_char) data.append([d_key, ent.text, token_length, ent.label_, ent.start_char, ent.end_char])# create a Table with the accumulated datatable = wandb.Table(data=data, columns=["doc_id", "text", "token_len", "type", "start", "end"]) wandb.run.log({"token_full_parse" : table})

Exploring tokenized entities

Now we can look for patterns in the entities:
Applying these operations yields a table like the following screenshot:
Entity types with examples and length distribution, sorted by most frequent

Interactive version of the Table

You can try these operations and explore yourself in this interactive instance of the Table. If you refresh the page, the Table will reset.

Observations

Next steps

To improve performance without including visual layout/geometry information, we might need to add regexes or manual rules to the default NER model or finetune a custom version on this unusual dataset. Still, this baseline lets us quickly explore patterns in named entity recognition across the receipts and better scope the problem: which entity types might be easier or harder to extract, where regexes might be sufficient, and just how helpful the visual signal is likely to be :)