Named Entity Recognition with W&B and spaCy

Visualize and explore named entities with spaCy in W&B
Created on May 18|Last edited on June 26
Comment
﻿Named Entity Recognition (NER) is a common task in information extraction: given some text, such as a news article, can we identify certain words or phrases (tokens) as the important proper nouns, locations, dates, etc? 
NER classifies tokens into specific predetermined entity types: people's names ("John", "Mr. Smith", "Dr. Jane Smith"), organization names ("New York Times", "the Sinclair Broadcast Group", "ABC" ), times ("7:30am", "23:00", "tonight"), and more. In this article, I use the DeepForm dataset of political ad receipts to illustrate NER using spaCy in W&B. 
Table of ContentsVisualizing Named Entities with spaCySample Receipts for TV AdsFind the Cost, Buyer, and Dates on a ReceiptNote: First OCR the PDF Image To Extract Raw TextLog NER Results to a W&B TableSample code to produce this tableExploring Tokenized EntitiesObservationsNext steps
﻿
Visualizing Named Entities with spaCyTo view the raw text with spaCy annotations for NER, log a W&B spaCy plot directly to a wandb.Table: use wandb.plots.NER(docs=document), where document is the spaCy-parsed result of the raw text. Below is a snippet of an annotated document and a Table with 5 samples. Hover over a "spacy_plot" row below and click on the gray box in the top right corner to scroll through a full-screen view. 
These sample documents are the unstructured streams of text extracted from the scanned receipts using optical character recognition. You'll notice very high recall and, unfortunately many irrelevant false positives. See below for samples of the original receipts, which are more coherent and diverse with the visual/geometric information included (instead of the raw characters).
Almost any of these words could be an entity.
﻿
Sample 2020 docs, N=51
﻿
Sample Receipts for TV AdsThese examples from 2020 show the diversity of receipt formats and the level of noise in the ground truth data. Click on the box in the top right of each image to "View full screen" so you can see the details and the correct answer tokens annotated in orange.
﻿
Ground truth from 20201
﻿
Find the Cost, Buyer, and Dates on a ReceiptThe concrete task in DeepForm is to extract certain meaningful fields from PDF receipts. Each receipt corresponds to a political television advertisement aired around a US election. We'd like to know which organization ("advertiser") paid how much ("amount") for which ad ("contractid") during which time period ("flight_from" and "flight_to" for the start and end of the ad's airtime). Some entity types automatically extracted by the spaCy library are relevant:
person for advertisers like "Michael Bloomberg"
organization for advertisers like "Future Forward USA"
money for the total amounts paid
cardinal for contract ids
event/time/date for the air dates
Below, I pull in the full OCR-ed text from a few sample PDF receipts from 2020 and show how W&B Tables can help us understand the data and develop a simple baseline model with spaCy's NER functionality.
Note: First OCR the PDF Image To Extract Raw TextAs a first pass, I use spaCy's NER model to extract entities from the raw text—just the strings detected in the image of the receipt using pdfplumber's optical character recognition (OCR). Focusing on the parsed characters and not the image pixels throws away important signals—the visual layout of the receipt, vertical proximity across lines (e.g. for the address), and the use of the bold font for emphasis—but this lets us quickly evaluate performance with an off-the-shelf model. Perhaps the text alone is sufficient.
Log NER Results to a W&B TableI log each extracted entity to a wandb.Table: the predicted entity type, the corresponding text tokens, the text length, and the start and end positions of the entity in the document. This produces a Table like the following: 
﻿
﻿
Sample 2020 docs, N=51
﻿
Sample code to produce this tableimport wandb
import spacy
﻿
wandb.init(project="ner_spacy")
nlp = spacy.load("en_core_web_sm")
docs = load_docs()
data = []
for d_key, d in docs.items():
  # extract entities using spaCy
  parsed = nlp(d)
  for ent in parsed.ents:
    token_length = int(ent.end_char) - int(ent.start_char)
    data.append([d_key, ent.text, token_length, ent.label_,
                 ent.start_char, ent.end_char])
﻿
# create a Table with the accumulated data
table = wandb.Table(data=data, 
        columns=["doc_id", "text", "token_len", "type", "start", "end"]) 
wandb.run.log({"token_full_parse" : table})
Exploring Tokenized EntitiesNow we can look for patterns in the entities:
group by "entity" type: this will aggregate all the examples of the same predicted entity type, so you can scroll through many examples quickly in the "text" column. 
sort by most frequent type: insert a column to the right of "text" and append "count" to see the number of times this entity type appears. Now you can sort by this column to see the most frequent entity (in this case, money :)
average the token length: insert a column and add "avg" to token_len to see the average length of the entities by type.
Applying these operations yields a table like the following screenshot:
Entity types with examples and length distribution, sorted by most frequent
Interactive version of the TableYou can try these operations and explore yourself in this interactive instance of the Table. If you refresh the page, the Table will reset.
﻿
Sample 2020 docs, N=51
﻿
Observations"Money" is the most frequent entity type and has high recall, but it has a long tail of false positives: phone numbers and extraneous tokens, like the string "30 1 $525.00 NM" instead of simply "$525.00"
"Cardinal" is even noisier, including plenty of dates with slashes and some prices.
There is some understandable confusion between "Organization" and "Person", but "Organization" is slightly longer on average.
"Dates" don't seem to accept the most frequent format of "MM/DD/YY"—most of these fall under the "Cardinal" type instead.
Next stepsTo improve performance without including visual layout/geometry information, we might need to add regexes or manual rules to the default NER model or finetune a custom version on this unusual dataset. Still, this baseline lets us quickly explore patterns in named entity recognition across the receipts and better scope the problem: which entity types might be easier or harder to extract, where regexes might be sufficient, and just how helpful the visual signal is likely to be :)﻿﻿
﻿
Add a comment
Kelsey Liu • 1 year ago
use wandb.plots.NER(docs=document)
Tags: Intermediate, NLP, NER, OCR, spaCy, Experiment, Tutorial, W&B Meta, Plots, Tables, DeepForm, Exemplary
Iterate on AI agents and models faster. Try Weights & Biases today.