EMNIST to Wikisplit: Sentence Composition

Build a CNN+RNN model to read the sentence from an image. Made by Ayush Thakur using Weights & Biases
Ayush Thakur

Introduction

Image classification is the task of assigning the whole image to a class. What if we want to read the sentence written in an image?

One naive approach would be to crop multiple patches of an image and pass those patches through a character classification model. We can train such a model with EMNIST dataset. However, such an approach has many loopholes.

• It won't generalize to all kinds of sentences in an image.
• The crops won't capture the presence of "space" or other special characters.
• Sentence is a sequence and our naive approach is not treating it like one. Thus sequential relationship is ignored.

Let's build a fun application that leverages Convolutional Neural Networks for extracting features from the image and Recurrent Neural Networks for sequence modeling. Thus we will reach entire sentences at a time.

An example of where this may be useful is a Google Translate type application that can read text from a mobile phone camera, and then try to translate it. Another example can be an assistive app for visually impaired people, the app can read out road signs aloud for them.

On a high-level solution this approach will:

• Process the input image
• Find places where there is text
• And convert them into words and sentences using RNNs.

This report will not focus on the theoretical underpins of this application, will rather cover the implementation details(briefly). We will start from data preparation, all the way to building our architecture and finally using Weights and Biases to track our model training and log predictions.

Data Preparation

We will be preparing our dataset by using two separate datasets: EMNIST/bymerge and Wiki-split.

The former is a handwritten character dataset, while the latter contains one million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.

We start by downloading both the dataset. To download the EMNIST/bymerge dataset, we can use TensorFlow Dataset. We can also use the EMNIST/byclass dataset (will require minor changes in the code.)

# Gather EMNIST/bymerge dataset
"emnist/bymerge",
split=["train[:85%]", "train[85%:]"],
as_supervised=True
)


To download the Wiki-split dataset, simply git clone the official repo.

!git clone https://github.com/google-research-datasets/wiki-split


Preprocess Data

The next few steps involve:

• Loading Wikisplit dataset in memory to remove punctuations since we do not have in the EMNIST dataset. Note that we are using test.tsv for quicker experimentation.

• Building tf.data dataloader pipeline to apply image preprocessing and batching and shuffling to EMNIST dataset(train_ds). The preprocessing steps involved - scaling images from 0-255 to 0-1; transposing images to re-rotate images by 90 degrees anti-clockwise.

• We then iterate over the dataloader and extract all the images and labels in two arrays, respectively. Our implementation relies on our training data to be indexable.

IMAGES = []
LABELS = []

IMAGES.append(tf.reshape(img, (28,28)).numpy())
LABELS.append(label.numpy())

IMAGES = np.array(IMAGES)
LABELS = np.array(LABELS)

• We will create a dictionary image_index where the key is the character(unique label), and the value is a list of IDs(index of an image corresponding to that character).

Generate Dataset - Images

Our training image looks like this,

Fig 1: A view of our training images

Thus each training image is a concatenated randomly chosen EMNIST image built using Wikisplit sentences. We will use two utility functions get_sample_sentences and get_generated_image. Check them out in the Colab notebook under Utils to generate images from the text section.

Finally, we will build our training dataset using the generate_sentences function. It takes text lines and generates images. It can generate multiple images per sentence as controlled by thenum_variants parameter. max_length parameter ensures that all sentences are the same length.

 ![image.png](https://api.wandb.ai/files/wandb/images/projects/110758/3cd647cc.png) train_sentences = sentences[:2000]
test_sentences = sentences[2000:2500]

# Lets assume that for each training sample, 2 variants will be generated

def generate_sentences(texts, chars,
index, num_variants=2, max_length=32):
# this method takes input text lines, character samples and labels
# and generates images. It can generate multiple images per sentence
# as controlled by num_variants parameter. max_length parameter
# ensures that all sentences are the same length

# total number of samples to generate
num_samples = len(texts) * num_variants
height, width = chars[0].shape  # shape of image

# setup empty array of the images
images = np.zeros((num_samples, height, width * max_length), np.float64)
labels = []

for i, item in tqdm(enumerate(texts)):
padded_item = item[0:max_length] if (len(item) > max_length) else item.ljust(max_length, ' ')

for v in range(num_variants):
images[i*num_variants+v, :, :] += img

return images, labels


Generate Dataset - Labels

We need to integer encode our sentence(made of characters). We will use an inverse_mapping dictionary where the key is the character, and the value is its corresponding integer encoding.

Fig 2: An example of integer encoding.

We will build an input pipeline using tf.data.

Now on to a more exciting part of this application.

The Model

Before we get into the nitty-gritty of our model, give a good look at the model's Keras implementation. Once you have a mental model of how layers are stacking against each other, understanding the architecture would be easier.

def ImagetoSequence():
# IMAGE ENCODER #
labels = Input(name="label", shape=(None,), dtype="float32")
image_input = Input(shape=(IMG_H, IMG_W), name="cnn_input")
image_reshaped = Reshape((IMG_H, IMG_W, 1))(image_input)
# extract patches of images
image_patches = Lambda(extract_patches)(image_reshaped)
# get CNN backbone architecture to get embedding for each patch
image_patch_encoder = ImagePatchEncoder()
# Wrapper allows to apply a layer to every temporal slice of an input.
time_wrapper = TimeDistributed(image_patch_encoder)(image_patches)

# RECURRENT NETWORK #
lstm_out = LSTM(128, return_sequences=True, name="lstm")(time_wrapper)
softmax_output = Dense(NUM_CLASSES, activation='softmax', name="lstm_softmax")(lstm_out)

output = CTCLayer(name="ctc_loss")(labels, softmax_output)

return Model([image_input, labels], output)


Now let us deconstruct our ImagetoSequence model.

• We have two Input layer. Thus we are providing two inputs to our model. One is the training image. Another one is the training label itself. No, we are not cheating here. If you look closely, the labels input layer is connected to CTCLayer. To learn more about Connectionist Temporal Classification(CTC), check out the report on Text Recognition with CRNN-CTC Network by Rajesh Shreedhar Bhat

• The input image is reshaped so that it can have 1 as the channel. We might not need this step for a colored image with 3 channels or if our input data pipeline handles reshaping.

• We then extract patches from the input image. Notice a Lambda layer. Thus we have a custom layer. We will get to it soon.

• We are calling another function ImagePatchEncoder inside our model. Your best guess should be that it is returning a Keras model.

• We are passing the returned model to a TimeDistributed layer. Thus we intend to wrap a layer(models can act as a layer in Keras) such that it is applied to every temporal slice of an input. Thus we want to pass the patches generated from the input image using the Lambda layer sequentially to the model returned from ImagePatchEncoder. Check out this blog post for a better understanding of how the TimeDistributed layer works.

• We then apply the LSTM layer followed by a Dense layer with softmax activation.

• We are passing this softmax output to the CTCLayer. Check out the report linked above to learn more about CTCLayer.

Extract Patches - Wrapping with Lambda Layer

This layer collects patches from the input image as if applying a convolution. All extracted patches are stacked in the depth (last) dimension of the output. The extract_pathes function is wrapped using theLambda layer. Learn more about tf.image.extract_patched here and Lambda layer here.

def extract_patches(image):
kernel = [1, 1, PATCH_WIDTH, 1]
strides = [1, 1, PATCH_STRIDE, 1]
patches = tf.image.extract_patches(image, kernel, strides, [1, 1, 1, 1], 'VALID')
patches = tf.transpose(patches, (0, 2, 1, 3))
patches = tf.expand_dims(patches, -1)
return patches


ImagePatchEncoder

This is the CNN backbone of our model architecture. Notice the input shape of the Input layer. It thus takes in the patches extracted and produce the flattened embedding for each patch. This embedding is then used by the Long Short Term Memory recurrent network to map to a sequence. WOW!

def ImagePatchEncoder():
patched_inputs = Input(shape=(IMG_H, PATCH_WIDTH, 1))
x = Conv2D(32, kernel_size=(3, 3), activation='relu')(patched_inputs)
x = Conv2D(64, kernel_size=(3, 3), activation='relu')(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Dropout(0.2)(x)
flattened_outputs = Flatten()(x)

return Model(inputs=patched_inputs, outputs=flattened_outputs, name='backbone')


Let's end this section with model.summary()

Fig 3: Model Summary

Training

Now that we have our data and the model, it's time to train our model.

We will use Weights and Biases to track our training metrics and log our predictions. We will use WandbCallback API to automatically log our training metrics. Learn more about Keras Integration in this colab notebook.

wandb.init(project='my-cool-project')

epochs=100,
callbacks=[WandbCallback(),
early_stopper])


Let's look at the training metrics. :point_down: The model looks to have trained well. Let's look at the predictions.

Predictions

The first thing that is required is to build a model for inference.

# notice we are removing CTCLayer
inference_model = Model(model.get_layer('cnn_input').input, model.get_layer('lstm_softmax').output)
inference_model.summary()


Notice that we are discarding the labels layer(one of the two Input layers) and discarding the CTCLayer. The inference_model will this output the softmax prediction.

Let's pass in some test image and read some sentences.

Final words

We can see that our model is performing well. Like surprisingly well. Nevertheless, these are a few things you can try:

• You can see that we have the small and capital letter problem in the text. This is because we used the EMNIST/bymerge dataset. A better variant of the EMNIST dataset would be the EMNIST/byclass dataset. You should check out the EMNIST paper.

• Play with the backbone CNN model depth.

• Change LSTM with BiLSTM.

• Use a learning rate scheduler.