Information Extraction from Scanned Receipts: Fine-tuning LayoutLM on SROIE

An OCR demo with LayoutLM fine-tuned for information extraction on receipts data. Made by Eric Bunch using Weights & Biases
Eric Bunch

Check out the Github repo →

Introduction

Many companies still process paper documents either physically or after being scanned as an image and stored in a document storage system. The current state of processing these documents for many companies is either completely or semi-manual, with rules-based workflows combined with manual processing.
Advancements in deep learning have helped surge demand for intelligent document processing. The insurance, health care, finance, and government sectors are perhaps most closely associated with this type of work, but many companies interacting with these sectors are seeing the need to automate information extraction from scanned documents. In fact, big companies like Google, Microsoft, AWS, and IBM are selling products to try to meet these needs.
Microsoft is making a particularly large effort in this domain (Microsoft Document AI).
In addition, there are many startup companies aiming to solve this problem. Here's a partial list:
Which is all to say: intelligent document processing isn't exactly a niche discipline.
Today, we're going to look at using W&B in this context. Specifically, we'll be fine-tuning a LayoutLM (Layout Language Model) on the SROIE (Scanned Receipts OCR and Information Extraction) dataset.
Let's get going:

General pipeline

The pipeline for training an information extraction model on scanned documents is similar to many other machine learning pipelines, however, there's more emphasis put on certain aspects of the pipeline vis a vis other applications, notably in preprocessing.
Machine learning pipeline diagram for information extraction from scanned documents.
  1. Raw data. The raw data here is images of scanned documents, typically as PDFs, JPGs, or PNGs. (Note: Although modern PDF formats have text stored in an accessible layer, many scanned or older PDF documents do not.)
  2. Labeled data. These are the results of human annotators marking the document images with bounding boxes, each containing the relevant fields present in the document.
  3. Preprocessed data. This step is particularly more intensive than many typical machine learning training pipelines. The preprocessed data is the result of:
    1. Running each document through an OCR model to extract text and bounding boxes
    2. Matching the bounding boxes resulting from OCR to those obtained in step 2 (labeled data), to obtain a label for each pair of the form (token, [x1, y1, x2, y2]). The token is obtained from a BERT-syle tokenizer.
  4. Train-Test split. This is a typical train-test split of the preprocessed data, typically done at the document level (as opposed to the token level).
  5. Test model. This is typical model evaluation. Particular attention should be paid to precision, recall, and F1, since the setup is typically a highly imbalanced dataset.
(Note: For this demo, we have preprocessed the documents in a slightly nonstandard way in order to avoid running OCR again on the documents. SROIE gives the OCR output per line, with coordinates of a bounding box that goes around the entire line. Also given is the text value of each of the four fields (but no bounding box!). In the preprocessing for this project, the OCR output of each line is taken, the text split into words on whitespaces, and bounding box coordinates assigned to each word derived by splitting the bounding box coordinates of the entire line horizontally into equal parts; one part for each word. To obtain labels for each word, we check whether or not the text of the word is contained in any of the label text fields.)

Dataset

The ML task here is to extract fields from scanned documents. The dataset used here is a standard one in this domain; the SROIE dataset (Scanned Receipts OCR and Information Extraction), consisting of 1000 scanned receipt images, labeled with text and bounding box information, as well as field values for four fields:
Sample images from SROIE dataset (source)
Example of bounding box and field content ground truth.

Model

The model used in this demo is LayoutLM (paper, github, huggingface), a transformer based model introduced by Microsoft, that takes into account the position of text on the page. Optionally, the model also includes a visual feature representation of each word's bounding box.
The core of the model architecture is identical to BERT, however the preprocessing of the tokens is slightly different to accommodate position information.
LayoutLM is open source and the model weights of a pretrained version are available (e.g. through huggingface). The pretraining tasks are the same as those of BERT: masked token prediction and next sequence prediction. Microsoft pre-trained LayoutLM on a document data set consisting of ~6 million documents, amounting to ~11 million pages.
LayoutLM architecture.

A few related approaches before we dig in further:

Representation Learning for Information Extraction from Form-like Documents (Neural Scoring Model), by Google (paper, Google blog post, W&B blog post). This uses embeddings of the tokens in a local neighborhood to represent a token.
PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks, by Ping (paper). This uses transformers, R-CNNs, Graphs, and BiLSTMs.
BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding, by SAP (paper). Fast-RCNN-like model architecture.
Graph Convolution for Multimodal Information Extraction from Visually Rich Documents, by Alibaba (paper). Graph "convolution" with edge features, followed by BiLSTM.

The reasons why LayoutLM is the BEST!

It's simple

At its core, LayoutLM has the same architecture as BERT. Much less preprocessing is needed relative to other models. No need to build a graph or local neighborhood, or use multiple model architectures or steps.

It's pre-trained

Due to the structure of the model, as well as how the data is consumed by the model, a pre-training task that jointly learns text and position information can be employed. These pre-training tasks in the style of BERT have been shown to be immensely successful in other domains, especially for the use case of fine tuning on a particular task.

It performs

Check out LayoutLM's performance vs. other options:

Metrics per field

The pre-trained LayoutLM model was fine-tuned on SRIOE for 100 epochs. The total loss was logged each epoch, and metrics were calculated and logged every 2 epochs. The metrics calculated per field were precision, recall, and F1 score. As a refresher:
precision = \frac{tp}{tp + fp}
recall = \frac{tp}{tp + fn}
F1 = \frac{tp}{tp + 0.5(fp + fn)}
(where tp, fp, fn stand for true positive, false positive, and false negative respectively.)
This grid of plots was extremely easy to put together using W&B, and very nice and useful for debugging model training.

Annotated images

This table is populated with images of the SROIE receipts, overlaid with bounding boxes colored by field class. Generally, for inspection this type of tool can be extremely useful for understanding collections of documents--metadata like document class, tags, and OCR information can be attached in other columns and grouped and filtered on for inspection.

Conclusion

While a bit less flashy than the latest GAN application, document extraction and OCR generally remain vital machine learning projects for companies of all sizes. We saw in the intro how many players there are in the space, from the Googles of the world to smaller, sprier startups.
We hope this walkthrough gives you a few ideas for you could incorporate W&B in your document extraction workflows. If you'd like to see any additional experiments with LayoutLM or any other OCR reports, please let us know!
But there's really only one way to sign off here: