Information Extraction from Scanned Receipts: Fine-tuning LayoutLM on SROIE
An OCR demo with LayoutLM fine-tuned for information extraction on receipts data.
Created on November 16|Last edited on March 7
Comment
Check out the Github repo →

Introduction
Many companies still process paper documents either physically or after being scanned as an image and stored in a document storage system. The current state of processing these documents for many companies is either completely or semi-manual, with rules-based workflows combined with manual processing.
Advancements in deep learning have helped surge demand for intelligent document processing. The insurance, health care, finance, and government sectors are perhaps most closely associated with this type of work, but many companies interacting with these sectors are seeing the need to automate information extraction from scanned documents. In fact, big companies like Google, Microsoft, AWS, and IBM are selling products to try to meet these needs.
In addition, there are many startup companies aiming to solve this problem. Here's a partial list:
Which is all to say: intelligent document processing isn't exactly a niche discipline.
Today, we're going to look at using W&B in this context. Specifically, we'll be fine-tuning a LayoutLM (Layout Language Model) on the SROIE (Scanned Receipts OCR and Information Extraction) dataset.
Let's get going:
General pipeline
The pipeline for training an information extraction model on scanned documents is similar to many other machine learning pipelines, however, there's more emphasis put on certain aspects of the pipeline vis a vis other applications, notably in preprocessing.

Machine learning pipeline diagram for information extraction from scanned documents.
- Raw data. The raw data here is images of scanned documents, typically as PDFs, JPGs, or PNGs. (Note: Although modern PDF formats have text stored in an accessible layer, many scanned or older PDF documents do not.)
- Labeled data. These are the results of human annotators marking the document images with bounding boxes, each containing the relevant fields present in the document.
- Preprocessed data. This step is particularly more intensive than many typical machine learning training pipelines. The preprocessed data is the result of:
- Running each document through an OCR model to extract text and bounding boxes
- Matching the bounding boxes resulting from OCR to those obtained in step 2 (labeled data), to obtain a label for each pair of the form (token, [x1, y1, x2, y2]). The token is obtained from a BERT-syle tokenizer.
- Train-Test split. This is a typical train-test split of the preprocessed data, typically done at the document level (as opposed to the token level).
- Test model. This is typical model evaluation. Particular attention should be paid to precision, recall, and F1, since the setup is typically a highly imbalanced dataset.
(Note: For this demo, we have preprocessed the documents in a slightly nonstandard way in order to avoid running OCR again on the documents. SROIE gives the OCR output per line, with coordinates of a bounding box that goes around the entire line. Also given is the text value of each of the four fields (but no bounding box!). In the preprocessing for this project, the OCR output of each line is taken, the text split into words on whitespaces, and bounding box coordinates assigned to each word derived by splitting the bounding box coordinates of the entire line horizontally into equal parts; one part for each word. To obtain labels for each word, we check whether or not the text of the word is contained in any of the label text fields.)
Dataset
The ML task here is to extract fields from scanned documents. The dataset used here is a standard one in this domain; the SROIE dataset (Scanned Receipts OCR and Information Extraction), consisting of 1000 scanned receipt images, labeled with text and bounding box information, as well as field values for four fields:
- total
- date
- company
- address

Example of bounding box and field content ground truth.
Model
The model used in this demo is LayoutLM (paper, github, huggingface), a transformer based model introduced by Microsoft, that takes into account the position of text on the page. Optionally, the model also includes a visual feature representation of each word's bounding box.
The core of the model architecture is identical to BERT, however the preprocessing of the tokens is slightly different to accommodate position information.
LayoutLM is open source and the model weights of a pretrained version are available (e.g. through huggingface). The pretraining tasks are the same as those of BERT: masked token prediction and next sequence prediction. Microsoft pre-trained LayoutLM on a document data set consisting of ~6 million documents, amounting to ~11 million pages.

LayoutLM architecture.
A few related approaches before we dig in further:
Representation Learning for Information Extraction from Form-like Documents (Neural Scoring Model), by Google (paper, Google blog post, W&B blog post). This uses embeddings of the tokens in a local neighborhood to represent a token.

PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks, by Ping (paper). This uses transformers, R-CNNs, Graphs, and BiLSTMs.

BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding, by SAP (paper). Fast-RCNN-like model architecture.


Graph Convolution for Multimodal Information Extraction from Visually Rich Documents, by Alibaba (paper). Graph "convolution" with edge features, followed by BiLSTM.


The reasons why LayoutLM is the BEST!
It's simple
At its core, LayoutLM has the same architecture as BERT. Much less preprocessing is needed relative to other models. No need to build a graph or local neighborhood, or use multiple model architectures or steps.
It's pre-trained
Due to the structure of the model, as well as how the data is consumed by the model, a pre-training task that jointly learns text and position information can be employed. These pre-training tasks in the style of BERT have been shown to be immensely successful in other domains, especially for the use case of fine tuning on a particular task.
It performs
Check out LayoutLM's performance vs. other options:
- LayoutLM: .95 avg F1
- Neural Scoring Model: .83 avg F1
- PICK: .96 avg F1 (much more complicated model for very little uplift)
- (BERTgrid and the Graph Convolution models were not evaluated on the SROIE dataset.)
Metrics per field
The pre-trained LayoutLM model was fine-tuned on SRIOE for 100 epochs. The total loss was logged each epoch, and metrics were calculated and logged every 2 epochs. The metrics calculated per field were precision, recall, and F1 score. As a refresher:
(where stand for true positive, false positive, and false negative respectively.)
This grid of plots was extremely easy to put together using W&B, and very nice and useful for debugging model training.
Run set 2
1
Annotated images
This table is populated with images of the SROIE receipts, overlaid with bounding boxes colored by field class. Generally, for inspection this type of tool can be extremely useful for understanding collections of documents--metadata like document class, tags, and OCR information can be attached in other columns and grouped and filtered on for inspection.
Run set 2
1
Conclusion
While a bit less flashy than the latest GAN application, document extraction and OCR generally remain vital machine learning projects for companies of all sizes. We saw in the intro how many players there are in the space, from the Googles of the world to smaller, sprier startups.
We hope this walkthrough gives you a few ideas for you could incorporate W&B in your document extraction workflows. If you'd like to see any additional experiments with LayoutLM or any other OCR reports, please let us know!
But there's really only one way to sign off here:

Add a comment
Weave: LayoutLM on SROIE: Annotations
Nice W&B Table! Very cool seeing each image with their labels overlaid. 1 reply
Tags: Intermediate, Computer Vision, OCR, Experiment, LayoutLM, Panels, Tables, SROIE, NLP, Fine-tuning
Iterate on AI agents and models faster. Try Weights & Biases today.

