Skip to main content

Fine-tuning LayoutLM on SROIE: Information Extraction from Scanned Receipts (Internal)

LayoutLM fine-tuned for information extraction on receipts data.
Created on October 29|Last edited on November 19

Introduction

This demo fine-tunes LayoutLM (Layout Language Model) on the SROIE (Scanned Receipts OCR and Information Extraction) dataset.
Github repo for this demo.

Problem setup

Many companies still process paper documents either physically or after being scanned as an image and stored in a document storage system. The current state of processing these documents for many companies is either completely or semi-manual, with rules-based workflows combined with manual processing.
The combination of easier access to deep learning, and the higher performance of deep learning on many tasks is giving rise to a surge in demand for intelligent document processing. In particular, companies in domains such as
  • Insurance
  • Health care
  • Finance
  • Government
as well as many companies interacting with these sectors are seeing the need to automate information extraction from scanned documents. Big companies like Google, Microsoft, AWS, and IBM are selling products to try to meet these needs:
Microsoft is making a particularly large effort in this domain (Microsoft Document AI).
In addition, there are many startup companies aiming to solve this problem:
There are also conferences, workshops, and competitions around document intelligence.

General pipeline

The pipeline for training an information extraction model on scanned documents is similar to many other machine learning pipelines; however, there is some more emphasis put on certain aspects of the pipeline than many typical pipelines.
Machine learning pipeline diagram for information extraction from scanned documents.
  1. Raw data. The raw data here is images of scanned documents. Typically in any of the formats of pdf, jpg, png. Note: Although modern pdf formats have text stored in an accessible layer, pdf documents that are results of many scanners do not have this layer.
  2. Labeled data. These are the results of human annotators marking the document images with bounding boxes, each containing the relevant fields present in the document.
  3. Preprocessed data. This step is particularly more intensive than many typical machine learning training pipelines. The preprocessed data is the result of
    1. Running each document through an OCR model to extract text and bounding boxes
    2. Matching the bounding boxes resulting from OCR to those obtained in step 2 (Labeled data), to obtain a label for each pair of the form (token, [x1, y1, x2, y2]). The token is obtained from a BERT-syle tokenizer.
  4. Train-Test split. This is a typical train-test split of the preprocessed data, typically done at the document level (as opposed to the token level).
  5. Test model. This is typical model evaluation. Particular attention should be paid to precision, recall, and F1, since the setup is typically a highly imbalanced dataset.
Note: For this demo, we have preprocessed the documents in a slightly nonstandard way, in order to avoid running OCR again on the documents. SROIE gives the OCR output per line, with coordinates of a bounding box that goes around the entire line. Also given is the text value of each of the four fields (but no bounding box!). In the preprocessing for this project, the OCR output of each line is taken, the text split into words on whitespaces, and bounding box coordinates assigned to each word derived by splitting the bounding box coordinates of the entire line horizontally into equal parts; one part for each word. To obtain labels for each word, is it checked whether or not the text of the word is contained in any of the label text fields.

Dataset

The ML task here is to extract fields from scanned documents. The dataset used here is a standard one in this domain; the SROIE dataset (Scanned Receipts OCR and Information Extraction), consisting of 1000 scanned receipt images, labeled with text and bounding box information, as well as field values for four fields:
  • total
  • date
  • company
  • address
This dataset was featured as a competition task at ICDAR 2019 (15th International Conference on Document Analysis and Recognition).
Sample images from SROIE dataset. Source: https://arxiv.org/pdf/2103.10213.pdf

Example of bounding box and field content ground truth.

Model

The model used in this demo is LayoutLM (paper, github, huggingface), a transformer based model introduced by Microsoft, that takes into account the position of text on the page. Optionally, the model also includes a visual feature representation of each word's bounding box.
The core of the model architecture is identical to BERT; however the preprocessing of the tokens is slightly different to accommodate position information.
LayoutLM is open source and the model weights of a pretrained version have been made available (e.g. through huggingface). The pretraining tasks are the same as those of BERT: masked token prediction, and next sequence prediction. Microsoft pre-trained LayoutLM on a document data set consisting of ~6 million documents, amounting to ~11 million pages.
LayoutLM architecture.
Representation Learning for Information Extraction from Form-like Documents (Neural Scoring Model), by Google (paper, Google blog post, W&B blog post). This uses embeddings of the tokens in a local neighborhood to represent a token.

PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks, by Ping (paper). This uses transformers, R-CNNs, Graphs, and BiLSTMs.

BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding, by SAP (paper). Fast-RCNN-like model architecture.


Graph Convolution for Multimodal Information Extraction from Visually Rich Documents, by Alibaba (paper). Graph "convolution" with edge features, followed by BiLSTM.



Why LayoutLM is the BEST!

  • Simple. At its core, LayoutLM has the same architecture as BERT. Much less preprocessing needed relative to other models. No need to build a graph or local neighborhood, or use multiple model architectures or steps.
  • Pretrained. Due to the structure of the model, as well as how the data is consumed by the model, a pretraining task that jointly learns text and position information can be employed. These pretraining tasks in the style of BERT have been shown to be immensely successful in other domains, especially for the use case of fine tuning on a particular task.
  • Performant.
    • Neural Scoring Model: .83 avg F1
      
    • LayoutLM: .95 avg F1
      
    • PICK: .96 avg F1 (much more complicated model for very little uplift)
      
    • BERTgrid and the Graph Convolution models were not evaluated on the SROIE dataset.

Metrics per field

The pre-trained LayoutLM model was fine-tuned on SRIOE for 100 epochs. The total loss was logged each epoch, and metrics were calculated and logged every 2 epochs. The metrics calculated per field were precision, recall, and F1 score, which have the following formulas
precision=tptp+fpprecision = \frac{tp}{tp + fp}

recall=tptp+fnrecall = \frac{tp}{tp + fn}

F1=tptp+0.5(fp+fn)F1 = \frac{tp}{tp + 0.5(fp + fn)}

where tp,fp,fntp, fp, fn stand for true positive, false positive, and false negative respectively.
This grid of plots was extremely easy to put together using W&B, and very nice and useful for debugging model training.

Run set 2
1


Annotated images

This table is populated with images of the SROIE receipts, overlaid with bounding boxes colored by field class. Generally, for inspection this type of tool can be extremely useful for understanding collections of documents--metadata like document class, tags, and OCR information can be attached in other columns and grouped and filtered on for inspection.
Note: Image annotations seem to be off. Could be related to the nonstandard method used to parse OCR results. Further investigation needed.
TODO:
  • Debug bounding boxes
  • Add bounding boxes via W&B Image property
  • Add OCR field

Run set 2
1



Takeaways

  • Good
    • Getting up and running was easy, visualizations were smooth.
    • Visualizations are nice/high quality, UI makes sense.
    • Tables can be extremely useful for easy document exploration.
    • Reports would be extremely helpful for collaboration and sharing.
  • Bad
    • Sometimes uploading a dataset artifact was pretty slow.
    • Occasionally I deleted a dataset artifact that had the latest tag, and subsequently tried to pull the latest dataset, and got an error. Would be nice if the latest tag automatically updated.
    • Tables seems like it is duplicating data. I am storing data as a dataset artifact, and storing a slightly modified version of that data (e.g. with bounding boxes added) again as a Table artifact.
    • Documentation seems lacking or sparse.
  • Missing
    • Deployment product! This is a common, real-word problem, and can only be of use if in some production system.
    • Human in the loop/model prediction correction. This can speed up development in a deploy first, improve later paradigm.