Optical Character Recognition: Then and Now
In this article, we explore optical character recognition and leverage pre-trained text localization and recognition models to find and extract text from images.
Created on May 25|Last edited on July 15
Comment
Today, we'll explore optical character recognition (OCR)—the process of using computer vision models to locate and identify text in an image––and gain an in-depth understanding of some of the common deep-learning-based OCR libraries and their model architectures.
We'll also look at one of the more well-known 'historical' OCR tools (Tesseract) and see how it compares to more recent OCR models. In fact, let's start with that little bit of history.
Here's what we'll be covering:
Table of Contents
Brief History of OCRToday's OCR Tools and OCR Tools of the PastComparing Off-the-Shelf OCR ToolsAbout the Frameworks and ModelsTesseractNotebooksWrapping UpAdditional Reading:
Brief History of OCR
Even back in the 1980s, during a period of time with decreased research funding and interest in artificial intelligence (you may know this as the AI winter), there was still progress being made regarding the practical applications of OCR: barcodes were being used to sort piece of mail automatically; facsimile (fax) machines were used to transmit information across long distances quickly.
OCR technology as we know it today did not exist during the AI winter. But in the late 1980s, Hewlett-Packard had a breakthrough: their OCR engine––designed for the very narrow use case of OCRing printed text scanned using HP's proprietary flat-bed scanners - began to show promising results. Indeed, by the mid-1990s at the UNLV Annual Test of OCR Accuracy the tool––known as Tesseract––began to perform far in excess of the creators' expectations.
Here's what we'll be covering:
Today's OCR Tools and OCR Tools of the Past
Today's OCR tools rely upon deep-learning-based architectures (we'll explore these later in more detail). Tesseract, on the other hand, operated on binarized (black-white) text, looking for connected components, and used a two-step process to first identify each word and then pass correctly identified words into an adaptive classifier.
As Tesseract 'read' a page, it could get better at recognizing words further down the page. Because the adaptive classifier may have learned something 'too late' on the page, Tesseract would re-read the page, performing a second pass on the words that were not recognized well enough the first time. The final phase handled things like 'fuzzy spaces' and small capitalized text.
Other early OCR tools had a similar process:
- The image of the text was preprocessed: think skew correction, contrast adjustment (Otsu's method, etc.), 'snow' removal, and more
- Object localization: in our case, text localizations. Basically, "draw a bounding box around the region of interest" in which we think text occurs.
- Character segmentation: Also known as glyph segmentation, this is the process whereby individual characters are identified from their neighboring characters.
- Character or glyph recognition: given my unknown letter and my collection of all the known letters/glyphs for English, which letter does this unknown letter most closely match up with?
- Post-processing: You can use tools like dictionaries or word lists to post-process the text that has been pulled off of a page image
Modern OCR tools rely upon deep neural networks, which first perform text detection–– "Is there text in this image? If so, where is it located?"––followed by text recognition which is the process of identifying which letters or characters appear in the region in which the model had already detected text.
Text detection can also be thought of as object localization; if you're familiar with YOLO (You Only Look Once) you're already aware of the object localization process. Simply stated, object localization can be thought of as drawing a tightly cropped bounding box around a region where an object occurs in an image. Because text recognition is so much more robust with these deep learning models, it is oftentimes possible now to skip over the post-processing steps, which are sometimes expensive operations involving dictionary lookups. Due to the rapid increases in neural network performance over the last decade, new model architectures mean that the 'old' multi-step OCR methods of the past are no longer needed nor have comparable performance to these new OCR techniques.

Cat with object localization box
Comparing Off-the-Shelf OCR Tools
For a comparison of off-the-shelf OCR tools, we've assembled three commonly-used libraries in this Notebook: Fine-tuning keras-ocr with Weights & Biases. Next, we cover fine-tuning a PaddleOCR model, which is a task-specific model in the PaddlePaddle ecosystem. And for the third OCR fine-tuning task, we'll walk you through fine-tuning an OCR recognition model using the PaddleOCR library (which is a modular component of the PaddlePaddle ecosystem).
About the Frameworks and Models
Now, before we jump into exploring the tools in our Notebooks, let us briefly learn about each library:
keras-ocr is an OCR library built on top of the popular deep learning framework, Keras. It utilizes the CRAFT: Character-Region Awareness For Text detection algorithm with a VGG model as the backbone.
PaddleOCR is a tool built by Baidu Research that supports many languages and, in contrast to EasyOCR, is able to OCR Chinese characters. The PaddlePaddle – PArallel Distributed Deep LEarning ecosystem – consists of the PaddlePaddle framework along with hundreds of production-ready end-to-end models for common deep learning tasks, which are available on the PaddleHub.
Whereas some other OCR libraries are not very performant across non-Latin scripts, PaddleOCR has great performance across English, Chinese, French, German, Arabic, and many additional languages. Out-of-the-box PaddleOCR performs comparably to some cloud providers' computer vision APIs. For engineers who want the performance of a cloud provider's computer vision API but who do not want to pay for the cost of the cloud, PaddleOCR is an excellent alternative.
EasyOCR is a CNN+BiLSTM+CTC (Connectionist Temporal Classification loss) deep neural network by default, although you can experiment a bit with some alternate architectures when fine-tuning an EasyOCR model: Attention mechanisms, VGG, and more can be used when you're trialing architectural decisions during training. The decoder options offered by the tool are: greedy, beam search, and word-beam search.
EasyOCR is able to handle 'messier' data - including scenes containing text data - which often trip up Tesseract. EasyOCR supports 80+ languages, including non-Latin scripts such as Arabic, Cyrillic, Chinese, Korean, Japanese, Telugu, Kannada, and more. The CRAFT model, which was used in the keras-ocr library, makes a second appearance in the EasyOCR library as it is used for text detection, which is not part of our tutorial.
Tesseract
Since Tesseract is C/C++-based – albeit with a Python wrapper with arguably limited functionality – the fine-tuning is outside of the scope of this article as we don't expect most people to have C/C++ backgrounds, nor is the CLI version of Tesseract trainable in an 'observable' fashion. PyTesseract can be thought of as a Python 'flavor' of Tesseract (which is written in C/C++), but PyTesseract shouldn't be considered as providing true Python bindings because all it does it provide an interface to the Tesseract binary.
For a more feature-rich Python Tesseract implementation, check out tesserocr. PyTesseract has to perform its OCR work on temp files, so if this I/O is a concern for you, you may want to choose a different OCR library or make use of the C/C++ version of Tesseract.
In 2018 Tesseract began to ship with an LSTM-based OCR 'engine'; this was Tesseract version 4.0. This boosted performance, yet for visually occluded text, blurry text, curved text, etc. Tesseract still struggled somewhat. In a future write-up, we'll compare the performance of Tesseract 5.0, which has increased performance on 'messy' text data. It is our experience that Tesseract's performance can be dramatically increased without fine-tuning the model, in contrast to the other tools we've covered on this page. You can improve Tesseract's performance by:
- properly preprocessing the input data: de-skew, de-noise, thresholding/contrast adjustment
- running tesseract with the correct language pack installed if you have non-English text to OCR
- having tesseract operate on TIFF files rather than PNGs or other compressed-format images; if you must use PNGs, make sure not to have the Alpha (transparency) channel set as this will further degrade performance
- selecting the proper page segmentation method (psm parameter)
Notebooks
Fine-tune several OCR libraries' text detection or text recognition models with the help of Weights and Biases:
- keras-ocr Colab Notebook - Take the CRAFT model and fine-tune it on ICDAR 2013 data: images of straight lines of text data that are noticeably 'easier' - less blur, no occlusion, etc. - compared to the ICDAR 2015 data, used below.
- PaddleOCR Fine-tuning Colab Notebook - Take the PaddleOCR framework, and using the MobileNetV3 as the backbone model, fine-tune the model on the ICDAR 2015 dataset: scenes containing text data. MobileNet is often preferred over the larger, slower ResNet backbones, especially when a small, fast model is needed.
- EasyOCR Fine-tuning Colab Notebook - We explore how to fine-tune one of the recognition model architectures that EasyOCR can use for optical character recognition and showcase some gradient logging as well. Once you've fine-tuned your model, you can BYOM (Bring Your Own Model) into the EasyOCR tooling and make OCR predictions with your own model: https://github.com/JaidedAI/EasyOCR#trainuse-your-own-model
Wrapping Up
After you've experimented with the frameworks in the three Notebooks above and perhaps have given Tesseract a try as well, you may have come to some of the following conclusions:
- For occluded, motion-blurred, curved or distorted, or otherwise 'hard to read' text, you'll probably notice that any one of the three libraries outperforms Tesseract
- For 'clean' text such as scanned or even photographed images of printed media, Tesseract is still a winner and can be made even more performant by adjusting the page segmentation method, telling Tesseract which language the text is in, etc. A newer version of Tesseract, such as the 5.0 version, uses 'fast floats' instead of doubles for a performance increase and runs on a second version of the LSTM model (which initially appeared in Tesseract 4.0) which is more performant than the original.
Depending on the variability of text data images that your machine learning system may need to handle, your OCR neural net model could be very performant with off-the-shelf models that you didn't even fine-tune. Some of the multi-step architectures, such as the architectures showcased in the PaddleOCR tool or the CRAFT model with the VGG backbone, are leaps and bounds beyond the convolutional neural networks of ten years ago, which were so commonplace in computer vision and OCR tasks.
Experiment with several OCR libraries – and within those libraries, experiment with several model architectures, too – to see which are well-suited to your text recognition needs. Don't be afraid to augment your images while testing to make sure that your OCR choice is performant under a wide range of 'text conditions', e.g., augment training images with a tool like albumentations.
And finally, the OCR tools we showcased above all have active development communities on Github; ask questions in the respective development communities, and you can learn a lot from engineers and researchers who have been working in OCR for several decades as well as from new-comers to OCR and deep learning. Many of the skills you'll pick up optimizing and setting up deep-learning-based OCR pipelines are easily transferable to other computer vision tasks, so check out some of our other Reports on computer vision models!
Addendum
For an interesting analysis of Tesseract versus cloud provider's OCR tools - Amazon and Google - check out this paper hosted on the Springer Verlag website: OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment.
While cloud providers' OCR models aren't always fine-tunable, they can sometimes provide performance boosts over fine-tunable deep-learning-based OCR libraries because the cloud-based OCR tools were trained on such large volumes of data from varied sources. Cloud OCR tools have additional utility when it would be cost-prohibitive for you to come up with labeled training data to feed into your neural net OCR models during the fine-tuning process.
Additional Reading:
Information Extraction from Scanned Receipts: Fine-tuning LayoutLM on SROIE
An OCR demo with LayoutLM fine-tuned for information extraction on receipts data.
Train and Debug Your OCR Models With PaddleOCR and W&B
This article provides a quick tutorial on using the Weights & Biases integration in PaddleOCR to track training and evaluation metrics along with model checkpoints.
Information Extraction From Documents Using Machine Learning
In this article, we'll extract information from templated documents like invoices, receipts, loan documents, bills, and purchase orders, using a model.
YOLOv5 Object Detection on Windows (Step-By-Step Tutorial)
This tutorial guides you through installing and running YOLOv5 on Windows with PyTorch GPU support. Includes an easy-to-follow video and Google Colab.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.