Information Extraction From Documents

Extract information from templatic documents like invoices, receipts, loan documents, bills, and purchase orders.
Tulasi Ram Laghumavarapu

Alt text

Image Source: Nanonets blog


In this report, we will discuss how to extract information from structured or unstructured documents. Specifically, we will be discussing Representation Learning for Information Extraction from Form-like Documents paper by Google. This paper is also accepted at ACL 2020.

If we can add all contact details like Contact No, Email Id, Address, etc.. directly by scanning the business card. Interesting isn't it? This small feature saves a lot of time.

Extracting Information from documents is a cumbersome task for humans and of course, it is also expensive.

Let us discuss some of the deep learning approaches on how to extract information

Check out the code on GitHub →

Note: Code implementation is not working as expected. But this will be your good starting point to implement this paper.

Various Approaches

  1. Templatic based Information Extraction
  2. Deep Visual Template-Free Form Parsing
  3. Attend, Copy, Parse
  4. Graph Convolutional Networks
  5. Representation Learning for Information Extraction from Form-like Documents


Image Source : ACL Demo Slides

Traditional approaches use template-based methods and match the OCR'ed text with the template and extract the information. But due to the huge variance in the invoices, it is not possible to apply this approach at scale.

invoices.png Fig1

Source : paper

Invoices from different vendors present information using different layouts. Using templatic approaches are not scalable and error-prone.

In this blog post, we will be discussing a high-level view of this Representation Learning for Information Extraction from Form-like Documents paper. To know more about other techniques you can visit this blog post by Nanonets.

Dataset Details

In this paper authors used two different datasets.

  1. Invoices: There are 2 corpora of Invoices. The first corpora contain 14,237 invoices while the second one contains 595 invoices. Invoices do not share any common template. Each Invoice template is different from the others.

Invoices corpora is a private dataset.

  1. Receipts: This dataset is a publicly-available corpus of scanned receipts published as part of the ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction(SROIE).

This dataset contains 626 images along with ground truths for four fields address, company name, total amount, date. We will be using only the total amount and date fields as our target. This dataset also contains OCR'ed CSV files with co-ordinates and corresponding text associated.

Sample image imgonline-com-ua-resize-KPtbs1BaAn3Mb.jpg Sample GroundTruth

    "company": "BOOK TA .K (TAMAN DAYA) SDN BHD",
    "date": "25/12/2018",
    "address": "NO.53 55,57 & 59, JALAN SAGU 18, TAMAN DAYA, 81100 JOHOR BAHRU, JOHOR.",
    "total": "9.00"


Before going into the model part let's discuss a few observations.

  1. Each field often corresponds to a well-understood type. For example, likely candidates for total amount will be the instances of numeric values. A text like "Weights and Biases" will be clearly incorrect for the total amount field.


Image Source : ACL Demo Slides

Limiting the search space by type reduces the complexity of the problem drastically. So we will use some libraries to generate candidates for each field. For example, a potential candidate generator for the date field will be the date parser library. We can also use Cloud Services like Google NLP to generate candidates more effectively.

  1. Each field instance is associated with the key phrase. For example, in the fig1 we can infer that date instances are always surrounded by key phrases Date or Dated. It is also not the case that key phrases always occur on the same line. The effective solution is to include spatial information along with text information.


Image Source : ACL Demo Slides

How to include spatial information?

Spatial Information is included by considering the neighbors around each word. For selecting the neighbor's Authors of this paper defined a neighborhood zone.

Neighborhood Zone: For each candidate, the neighborhood zone extends all the way to the left of the page and 10percent of the page height above the candidate.

Any text tokens whose bounding boxes have an overlap of more than 50percent with the neighborhood zone is considered as a neighbor.

  1. Key Phrases for a field are largely drawn from a limited vocabulary. About 93% of the date instances in the invoices are associated with key phrases like "date" or "dated"(from the paper). This suggests that this problem can be solved with a modest amount of training data.


Image Source : ACL Demo Slides

All the above observations are applicable to many fields across various documents.

In this problem, data pre-processing is crucial and important than the model part. Let's look at the data processing pipeline using the above observations.


Image Source : ACL Demo Slides

  1. OCR: To extract the text from the image we need to apply OCR. You can use open source tools like EasyOCR, PyTesseract or Cloud Services like Google OCR. In our case, ICDAR has already provided OCR'ed text.

  2. Candidate Generators(Entity Tagging): As discussed in the above observations we will be generating candidates using multiple candidate generators. Since the recall of the overall system cannot exceed the recall of the candidate generators, it is important that their recall will be high.

  3. Scoring and Assignment Module: This module computes a score between 0 and 1 for each candidate independently using a neural module and then we assign to each field the scored candidate that is likely to be the true extraction for it.

This separation of scoring and assignment allows us to learn a representation for each candidate based only on its neighborhood, independently of other candidates and fields. It also frees us to encode arbitrarily complex business rules into the assigner if required, for example, that the due date for an invoice cannot (chronologically) precede its invoice date, etc...(from the paper)

For brevity, the authors omitted the details of the assignment module and report results using a simple assignment that chooses the highest-scoring candidate for each field independently of other fields

Neural Scoring Model

To make sure the model generalizes across multiple document templates model attempts to learn separate embeddings for the candidate and the field it belongs to, and where the similarity between the candidate and field embedding determines the score.

One more important design choice authors made is not to incorporate candidate text into the model is to avoid accidental overfitting. For instance, the dataset may contain all the invoices prior to 2020, it is possible that the model could learn invoice date must occur prior to 2020.

candidate encoding.jpg

Neighborhood Embeddings: * The neighboring text tokens are embedded using a word embedding table. Each neighbor relative position is embedded through a nonlinear positional embedding consisting of two ReLU-activated layers with dropout. This nonlinear embedding allows the model to learn to resolve fine-grained differences in position, say between neighbors sharing the same line as the candidate and those on the line above.*(from the paper)

Neighborhood Encoding: All the neighborhood embeddings are independent of each other. In order to capture the relationship between neighbors self-attention mechanism is used. This will ensure to down weight the neighbors that are not relevant for prediction and generate a contextualized representation of neighbors.


Candidate Position Embedding: The candidate position is embedded using a simple linear layer.

Since information about the relative positions of the neighbors with respect to the candidates is already captured in the embeddings themselves, in order to make sure that the neighborhood encoding is invariant to the (arbitrary) order in which the neighbors are included in the features max-pooling mechanism is employed.


Candidate Encoding: A candidate encoding is obtained by concatenating the neighborhood encoding with the candidate position embedding.


Field Embedding: Field Id is also embedded using a field embedding layer to generate a representation of field id.


Candidate Score: Candidate Encoding is expected to contain all the information about the candidate position along with neighborhood details. It is independent of the field to which the candidate belongs.

Now we compute CosineSimilarity between candidate encoding and field embedding. As similarity value lies in the range of -1 to 1 we simply rescale the value between 0 to 1 and choose the candidate with the highest score.


To demonstrate the benefits of this model, the authors of this paper proposed two baselines and compared the results.

The bag-of-words(BoW) baseline incorporates only the neighboring tokens of a candidate, but not their positions. The MLP baseline uses the same input features as our proposed model, including the relative positions of the candidate’s neighbors, and encodes the candidate using 3 hidden layers.

Both these baselines follow the same approach, encoding the candidate and the field separately. Screenshot from 2021-01-26 13-08-14.png

It is clear that this model outperforms two baselines. Using neighbor position MLP baseline outperforms BoW baseline.

The relative order of feature importance:

neighbor text > candidate position > neighbor position

It is also observed that removing the self-attention layer leads to a 1-point deterioration in scorer ROC AUC and a 1.7 point deterioration in end-to-end max F1.

Model Representations

The exciting part of this paper is they investigated the internal representations of the model and visualized using t-SNE(Dimensionality Reduction Technique).

Are you excited about how the representations look? Let's go ahead and see. tsne.png

Points to Note:

  1. From Fig 4(b) it is clearly evident that positive points are clustered together nicely, while negative points show a sparse spatial distribution.

  2. It is important to note that field embedding lies at the edge of the cluster far away from the points rather than the center of the cluster. This pattern is predicted by the fact that the the loss function is essentially trying to minimize the cosine distance between the field embedding and its positives while maximizing its distance from its negatives, most importantly the positives for the other fields.

  3. From Fig 4(c) an invoice date example lies far away from the invoice date cluster. It is clearly evident that it is the fault of the annotator who labeled the purchase date as invoice date.

  4. Candidate Encoding of the sample in Fig 4(d) lies between invoice date and due date. This is explained by the fact that this candidate is surrounded by both the terms due date and date of invoice.

  5. Candidate Encoding of the sample in Fig 4(e) lies far away from the invoice date cluster. After careful examination it is found out that this is due to OCR error due to scanning noise.

I believe visualizing model representations helps a lot in general for any model.


I hope you enjoyed reading this blog post. Information Extraction from documents is a tough and challenging task. Increasing the accuracy of candidate generators is still research and requires a lot of domain expertise. Feel free to let me know the feedback through comments or my Twitter handle.