EMNIST to Wikisplit: Sentence Composition

Build a CNN+RNN model to read the sentence from an image. Made by Ayush Thakur using Weights & Biases
Ayush Thakur

Introduction

Image classification is the task of assigning the whole image to a class. What if we want to read the sentence written in an image?

One naive approach would be to crop multiple patches of an image and pass those patches through a character classification model. We can train such a model with EMNIST dataset. However, such an approach has many loopholes.

Let's build a fun application that leverages Convolutional Neural Networks for extracting features from the image and Recurrent Neural Networks for sequence modeling. Thus we will reach entire sentences at a time.

An example of where this may be useful is a Google Translate type application that can read text from a mobile phone camera, and then try to translate it. Another example can be an assistive app for visually impaired people, the app can read out road signs aloud for them.

On a high-level solution this approach will:

This report will not focus on the theoretical underpins of this application, will rather cover the implementation details(briefly). We will start from data preparation, all the way to building our architecture and finally using Weights and Biases to track our model training and log predictions.

Try it yourself in Google Colab $\rightarrow$

Data Preparation

We will be preparing our dataset by using two separate datasets: EMNIST/bymerge and Wiki-split.

The former is a handwritten character dataset, while the latter contains one million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.

Download Data

We start by downloading both the dataset. To download the EMNIST/bymerge dataset, we can use TensorFlow Dataset. We can also use the EMNIST/byclass dataset (will require minor changes in the code.)

# Gather EMNIST/bymerge dataset
train_ds, validation_ds = tfds.load(
    "emnist/bymerge",
    split=["train[:85%]", "train[85%:]"],
    as_supervised=True
)

To download the Wiki-split dataset, simply git clone the official repo.

!git clone https://github.com/google-research-datasets/wiki-split

Learn more about EMNIST dataset here and about Wikisplit here

Preprocess Data

The next few steps involve:

IMAGES = []
LABELS = []

for img, label in tqdm(trainloader):
    IMAGES.append(tf.reshape(img, (28,28)).numpy())
    LABELS.append(label.numpy())

IMAGES = np.array(IMAGES)
LABELS = np.array(LABELS)

Generate Dataset - Images

Our training image looks like this,

image.png image.png Fig 1: A view of our training images

Thus each training image is a concatenated randomly chosen EMNIST image built using Wikisplit sentences. We will use two utility functions get_sample_sentences and get_generated_image. Check them out in the Colab notebook under Utils to generate images from the text section.

Finally, we will build our training dataset using the generate_sentences function. It takes text lines and generates images. It can generate multiple images per sentence as controlled by thenum_variants parameter. max_length parameter ensures that all sentences are the same length.

 ![image.png](https://api.wandb.ai/files/wandb/images/projects/110758/3cd647cc.png) train_sentences = sentences[:2000]
test_sentences = sentences[2000:2500]

# Lets assume that for each training sample, 2 variants will be generated

def generate_sentences(texts, chars, 
                           index, num_variants=2, max_length=32):
    # this method takes input text lines, character samples and labels
    # and generates images. It can generate multiple images per sentence
    # as controlled by num_variants parameter. max_length parameter
    # ensures that all sentences are the same length
    
    # total number of samples to generate
    num_samples = len(texts) * num_variants
    height, width = chars[0].shape  # shape of image
    
    # setup empty array of the images
    images = np.zeros((num_samples, height, width * max_length), np.float64)
    labels = []
    
    for i, item in tqdm(enumerate(texts)):
        padded_item = item[0:max_length] if (len(item) > max_length) else item.ljust(max_length, ' ')
        
        for v in range(num_variants):
            img = get_generated_image(padded_item, chars, index)
            images[i*num_variants+v, :, :] += img
            labels.append(padded_item)
    
    return images, labels

Generate Dataset - Labels

We need to integer encode our sentence(made of characters). We will use an inverse_mapping dictionary where the key is the character, and the value is its corresponding integer encoding.

image.png Fig 2: An example of integer encoding.

We will build an input pipeline using tf.data.

Now on to a more exciting part of this application.

The Model

Before we get into the nitty-gritty of our model, give a good look at the model's Keras implementation. Once you have a mental model of how layers are stacking against each other, understanding the architecture would be easier.

def ImagetoSequence():
  # IMAGE ENCODER #
  labels = Input(name="label", shape=(None,), dtype="float32")
  image_input = Input(shape=(IMG_H, IMG_W), name="cnn_input")
  # reshape to add dimensions
  image_reshaped = Reshape((IMG_H, IMG_W, 1))(image_input)
  # extract patches of images
  image_patches = Lambda(extract_patches)(image_reshaped)
  # get CNN backbone architecture to get embedding for each patch
  image_patch_encoder = ImagePatchEncoder()
  # Wrapper allows to apply a layer to every temporal slice of an input.
  time_wrapper = TimeDistributed(image_patch_encoder)(image_patches)

  # RECURRENT NETWORK #
  lstm_out = LSTM(128, return_sequences=True, name="lstm")(time_wrapper)
  softmax_output = Dense(NUM_CLASSES, activation='softmax', name="lstm_softmax")(lstm_out)

  output = CTCLayer(name="ctc_loss")(labels, softmax_output)

  return Model([image_input, labels], output)

Now let us deconstruct our ImagetoSequence model.

Extract Patches - Wrapping with Lambda Layer

This layer collects patches from the input image as if applying a convolution. All extracted patches are stacked in the depth (last) dimension of the output. The extract_pathes function is wrapped using theLambda layer. Learn more about tf.image.extract_patched here and Lambda layer here.

def extract_patches(image):
    kernel = [1, 1, PATCH_WIDTH, 1]
    strides = [1, 1, PATCH_STRIDE, 1]
    patches = tf.image.extract_patches(image, kernel, strides, [1, 1, 1, 1], 'VALID')
    patches = tf.transpose(patches, (0, 2, 1, 3))
    patches = tf.expand_dims(patches, -1)
    return patches

ImagePatchEncoder

This is the CNN backbone of our model architecture. Notice the input shape of the Input layer. It thus takes in the patches extracted and produce the flattened embedding for each patch. This embedding is then used by the Long Short Term Memory recurrent network to map to a sequence. WOW!

def ImagePatchEncoder():
  patched_inputs = Input(shape=(IMG_H, PATCH_WIDTH, 1))
  x = Conv2D(32, kernel_size=(3, 3), activation='relu')(patched_inputs)
  x = Conv2D(64, kernel_size=(3, 3), activation='relu')(x)
  x = MaxPooling2D(pool_size=(2, 2))(x)
  x = Dropout(0.2)(x)
  flattened_outputs = Flatten()(x)

  return Model(inputs=patched_inputs, outputs=flattened_outputs, name='backbone')

Let's end this section with model.summary()

image.png Fig 3: Model Summary

Training

Now that we have our data and the model, it's time to train our model.

We will use Weights and Biases to track our training metrics and log our predictions. We will use WandbCallback API to automatically log our training metrics. Learn more about Keras Integration in this colab notebook.

wandb.init(project='my-cool-project')

_ = model.fit(trainloader,  
             validation_data=testloader,
             epochs=100,
             callbacks=[WandbCallback(),
                        early_stopper])

Let's look at the training metrics. :point_down: The model looks to have trained well. Let's look at the predictions.

Section 5

Predictions

The first thing that is required is to build a model for inference.

# notice we are removing CTCLayer
inference_model = Model(model.get_layer('cnn_input').input, model.get_layer('lstm_softmax').output)
inference_model.summary()

Notice that we are discarding the labels layer(one of the two Input layers) and discarding the CTCLayer. The inference_model will this output the softmax prediction.

Let's pass in some test image and read some sentences.

Section 7

Final words

We can see that our model is performing well. Like surprisingly well. Nevertheless, these are a few things you can try:

I hope this report, along with the colab notebook, will be useful in structuring your Deep Learning workflow. Try using Weights and Biases to track your experiments and write such reports to share your results.