From EMNIST to Wikisplit: Sentence Composition Using W&B
This article explains how to build a CNN+RNN model to read a sentence from an image, using Weights & Biases to track our experiments.
Created on August 29|Last edited on November 15
Comment
Image classification is the task of assigning the whole image to a class. What if we want to read the sentence written in an image?
One naive approach would be to crop multiple patches of an image and pass those patches through a character classification model. We can train such a model with EMNIST dataset. However, such an approach has many loopholes.
- It won't generalize to all kinds of sentences in an image.
- The crops won't capture the presence of "space" or other special characters.
- A sentence is a sequence and our naive approach is not to treat it like one. Thus sequential relationship is ignored.
Let's build a fun application that leverages Convolutional Neural Networks for extracting features from the image and Recurrent Neural Networks for sequence modeling. Thus we will reach entire sentences at a time.
An example of where this may be useful is a Google Translate type application that can read text from a mobile phone camera, and then try to translate it. Another example can be an assistive app for visually impaired people, the app can read out road signs aloud for them.
On a high-level solution this approach will:
- Process the input image
- Find places where there is text
This report will not focus on the theoretical underpins of this application, but will rather cover the implementation details(briefly). We will start from data preparation, all the way to building our architecture and finally using Weights & Biases to track our model training and log predictions.
Table of Contents
Data Preparation
We will be preparing our dataset by using two separate datasets: EMNIST/bymerge and Wiki-split.
The former is a handwritten character dataset, while the latter contains one million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.
Download Data
We start by downloading both datasets. To download the EMNIST/bymerge dataset, we can use TensorFlow Dataset. We can also use the EMNIST/byclass dataset (will require minor changes in the code.)
# Gather EMNIST/bymerge datasettrain_ds, validation_ds = tfds.load("emnist/bymerge",split=["train[:85%]", "train[85%:]"],as_supervised=True)
To download the Wiki-split dataset, simply git clone the official repo.
!git clone https://github.com/google-research-datasets/wiki-split
Preprocess Data
The next few steps involve:
- Loading Wikisplit dataset in memory to remove punctuations since we do not have in the EMNIST dataset. Note that we are using test.tsv for quicker experimentation.
- Building tf.data dataloader pipeline to apply image preprocessing and batching and shuffling to EMNIST dataset(train_ds). The preprocessing steps involved - scaling images from 0-255 to 0-1; transposing images to re-rotate images by 90 degrees anti-clockwise.
- We then iterate over the dataloader and extract all the images and labels in two arrays, respectively. Our implementation relies on our training data to be indexable.
IMAGES = []LABELS = []for img, label in tqdm(trainloader):IMAGES.append(tf.reshape(img, (28,28)).numpy())LABELS.append(label.numpy())IMAGES = np.array(IMAGES)LABELS = np.array(LABELS)
- We will create a dictionary image_index where the key is the character(unique label), and the value is a list of IDs(index of an image corresponding to that character).
Generate Dataset - Images
Our training image looks like this:

Fig 1: A view of our training images
Thus each training image is a concatenated randomly chosen EMNIST image built using Wikisplit sentences. We will use two utility functions get_sample_sentences and get_generated_image. Check them out in the Colab notebook under Utils to generate images from the text section.
Finally, we will build our training dataset using the generate_sentences function. It takes text lines and generates images. It can generate multiple images per sentence as controlled by the num_variants parameter. max_length parameter ensures that all sentences are the same length.
 train_sentences = sentences[:2000]test_sentences = sentences[2000:2500]# Lets assume that for each training sample, 2 variants will be generateddef generate_sentences(texts, chars,index, num_variants=2, max_length=32):# this method takes input text lines, character samples and labels# and generates images. It can generate multiple images per sentence# as controlled by num_variants parameter. max_length parameter# ensures that all sentences are the same length# total number of samples to generatenum_samples = len(texts) * num_variantsheight, width = chars[0].shape # shape of image# setup empty array of the imagesimages = np.zeros((num_samples, height, width * max_length), np.float64)labels = []for i, item in tqdm(enumerate(texts)):padded_item = item[0:max_length] if (len(item) > max_length) else item.ljust(max_length, ' ')for v in range(num_variants):img = get_generated_image(padded_item, chars, index)images[i*num_variants+v, :, :] += imglabels.append(padded_item)return images, labels
Generate Dataset - Labels
We need to integer encode our sentence(made of characters). We will use an inverse_mapping dictionary where the key is the character, and the value is its corresponding integer encoding.

Fig 2: An example of integer encoding.
We will build an input pipeline using tf.data.
Now on to a more exciting part of this application.
The Model
Before we get into the nitty-gritty of our model, give a good look at the model's Keras implementation. Once you have a mental model of how layers are stacking against each other, understanding the architecture will be easier.
def ImagetoSequence():# IMAGE ENCODER #labels = Input(name="label", shape=(None,), dtype="float32")image_input = Input(shape=(IMG_H, IMG_W), name="cnn_input")# reshape to add dimensionsimage_reshaped = Reshape((IMG_H, IMG_W, 1))(image_input)# extract patches of imagesimage_patches = Lambda(extract_patches)(image_reshaped)# get CNN backbone architecture to get embedding for each patchimage_patch_encoder = ImagePatchEncoder()# Wrapper allows to apply a layer to every temporal slice of an input.time_wrapper = TimeDistributed(image_patch_encoder)(image_patches)# RECURRENT NETWORK #lstm_out = LSTM(128, return_sequences=True, name="lstm")(time_wrapper)softmax_output = Dense(NUM_CLASSES, activation='softmax', name="lstm_softmax")(lstm_out)output = CTCLayer(name="ctc_loss")(labels, softmax_output)return Model([image_input, labels], output)
Now let us deconstruct our ImagetoSequence model.
- We have two Input layers. Thus we are providing two inputs to our model. One is the training image. Another one is the training label itself. No, we are not cheating here. If you look closely, the labels input layer is connected to CTCLayer. To learn more about Connectionist Temporal Classification(CTC), check out the report on Text Recognition with CRNN-CTC Network by Rajesh Shreedhar Bhat
- The input image is reshaped so that it can have 1 as the channel. We might not need this step for a colored image with 3 channels or if our input data pipeline handles reshaping.
- We then extract patches from the input image. Notice a Lambda layer. Thus we have a custom layer. We will get to it soon.
- We are calling another function ImagePatchEncoder inside our model. Your best guess should be that it is returning a Keras model.
- We are passing the returned model to a TimeDistributed layer. Thus we intend to wrap a layer(models can act as a layer in Keras) such that it is applied to every temporal slice of an input. Thus we want to pass the patches generated from the input image using the Lambda layer sequentially to the model returned from ImagePatchEncoder. Check out this blog post for a better understanding of how the TimeDistributed layer works.
- We then apply the LSTM layer followed by a Dense layer with softmax activation.
- We are passing this softmax output to the CTCLayer. Check out the report linked above to learn more about CTCLayer.
Extract Patches - Wrapping With Lambda Layer
This layer collects patches from the input image as if applying a convolution. All extracted patches are stacked in the depth (last) dimension of the output. The extract_pathes function is wrapped using the Lambda layer. Learn more about tf.image.extract_patched here and Lambda layer here.
def extract_patches(image):kernel = [1, 1, PATCH_WIDTH, 1]strides = [1, 1, PATCH_STRIDE, 1]patches = tf.image.extract_patches(image, kernel, strides, [1, 1, 1, 1], 'VALID')patches = tf.transpose(patches, (0, 2, 1, 3))patches = tf.expand_dims(patches, -1)return patches
ImagePatchEncoder
This is the CNN backbone of our model architecture. Notice the input shape of the Input layer. It thus takes in the patches extracted and produces the flattened embedding for each patch. This embedding is then used by the Long Short Term Memory recurrent network to map to a sequence. WOW!
def ImagePatchEncoder():patched_inputs = Input(shape=(IMG_H, PATCH_WIDTH, 1))x = Conv2D(32, kernel_size=(3, 3), activation='relu')(patched_inputs)x = Conv2D(64, kernel_size=(3, 3), activation='relu')(x)x = MaxPooling2D(pool_size=(2, 2))(x)x = Dropout(0.2)(x)flattened_outputs = Flatten()(x)return Model(inputs=patched_inputs, outputs=flattened_outputs, name='backbone')
Let's end this section with model.summary()

Fig 3: Model Summary
Training
Now that we have our data and the model, it's time to train our model.
We will use Weights & Biases to track our training metrics and log our predictions. We will use WandbCallback API to automatically log our training metrics. Learn more about Keras Integration in this colab notebook.
wandb.init(project='my-cool-project')_ = model.fit(trainloader,validation_data=testloader,epochs=100,callbacks=[WandbCallback(),early_stopper])
Let's look at the training metrics. 👇 The model looks to have trained well. Let's look at the predictions.
Run set
1
Predictions
The first thing that is required is to build a model for inference.
# notice we are removing CTCLayerinference_model = Model(model.get_layer('cnn_input').input, model.get_layer('lstm_softmax').output)inference_model.summary()
Notice that we are discarding the labels layer(one of the two Input layers) and discarding the CTCLayer. The inference_model will this output the softmax prediction.
Let's pass in some test image and read some sentences.
Run set
1
Final words
We can see that our model is performing well. Like surprisingly well. Nevertheless, these are a few things you can try:
- You can see that we have the small and capital letter problem in the text. This is because we used the EMNIST/bymerge dataset. A better variant of the EMNIST dataset would be the EMNIST/byclass dataset. You should check out the EMNIST paper.
- Play with the backbone CNN model depth.
- Change LSTM with BiLSTM.
- Use a learning rate scheduler.
- Use SGD optimizer instead of Adam optimizer.
I hope this report, along with the colab notebook, will be useful in structuring your Deep Learning workflow. Try using Weights & Biases to track your experiments and write such reports to share your results.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.