How to Teach Your Computer Japanese

Ajay Arasanipalai, Student at the University of Illinois at Urbana-Champaign


Hi. Hello. こんにちは.

Those squiggly characters you just saw are from a language called Japanese. You’ve probably heard of it if you’ve ever watched Dragon Ball Z.


Here’s the problem though: you know those ancient Japanese scrolls that make you look like you’re going to unleash an ultimate samurai ninja overlord super combo move.

Yeah, those. I can’t exactly read them, and it turns out that very few people can.

Luckily, a bunch of smart people understands how important it is that I master the Bijudama-Rasenshuriken, so they invented this thing called deep learning.

So pack your ramen and get ready. In this article, I’ll show you how to train a neural network that can accurately predict Japanese characters from their images.

To ensure that we get good results, I’m going to use of an incredible deep learning library called fastai, which is a wrapper around PyTorch that makes it easy to implement best practices from modern research. You can read more about it on their docs.

With that said, let’s get started.


OK, so before we can create anime subtitles, we’re going to need a dataset. Today we’re going to focus on KMNIST.

This dataset takes of examples of characters from the Japanese Kuzushiji script, and organizes them into 10 labeled classes. The images measure 28x28 pixels, and there are 70,000 images in total, mirroring the structure of MNIST.

But why KMNIST? Well firstly, it has “MNIST” in its name, and we all know how much people in machine learning love MNIST.


So in theory, you could just change a few lines of that Keras code that you copy-pasted from Stack Overflow and BOOM! You now have computer code that can revive an ancient Japanese script.

Of course, in practice, it isn’t that simple. For starters, the cute little model that you trained on MNIST probably won’t do that well. Because, you know, figuring out whether a number is a 2 or a 5 is just a tad easier than deciphering a forgotten cursive script that only a handful of people on earth know how to read.

Animated GIF showing

Apart from that, I guess I should point out that Kuzushiji, which is what the “K” in KMNIST stands for, is not just 10 characters long. Unfortunately, I’m NOT one of the handfuls of experts that can read the language, so I can’t describe in intricate detail how it works.

But here’s what I do know: There are actually three variants of these Kuzushiji character datasets — KMNIST, Kuzushiji-49, and Kuzushiji-Kanji.

Kuzushiji-49 is variant with 49 classes instead of 10. Kuzushiji-Kanji is even more insane, with a whopping 3832 classes.

Yep, you read that right. It’s three times as many classes as ImageNet.

How to Not Mess Up Your Dataset

To keep things as MNIST-y as possible, it looks like the researchers who put out the KMNIST dataset kept it in the original format (man, they really took that whole MNIST thing to heart, didn’t they).

If you take a look at the KMNIST GitHub repo, you’ll see that the dataset is served in two formats: the original MNIST thing, and as a bunch of Numpy arrays.

Of course, I know you were probably too lazy to click that link. So here you go. You can thank me later.

GitHub screenshot showing the various download formats for the KMNIST dataset

Personally, I found the NumPy array format easier to work with when using fastai, but the choice is yours. If you’re using PyTorch, KMNIST comes for free as a part of torchvision.datasets.

The next challenge is actually getting those 10,000-year-old brush strokes onto your notebook (or IDE, who am I to judge). Luckily, the GitHub repo mentions that there’s this handy script called that’ll do all the work for us. Yay!

party popper

From here, it’ll probably start getting awkward if I continue talking about how to pre-process your data without actual code. So check out the notebook if you want to dive deeper.

Moving on…

Should I use a hyper ultra Inception ResNet XXXL?

Short Answer

Probably not. A regular ResNet should be fine.

A Little Less Short Answer

Ok, look. By now, you’re probably thinking, “KMNIST big. KMNIST hard. Me need to use very new, very fancy model.”

Did I overdo the Bizzaro voice?

The point is, you DON’T need a shiny new model to do well on these image classification tasks. At best, you’ll probably get a marginal accuracy improvement at the cost of a whole lot of time and money.

Most of the time, you’ll just waste a whole lot of time and money.

So heed my advice — just stick to good ol’ fashion ResNets. They work really well, they're relatively fast and lightweight (compared to some of the other memory hogs like Inception and DenseNet), and best of all, people have been using them for a while, so it shouldn’t be too hard to fine-tune.

If the dataset you’re working with is simple like MNIST, use ResNet18. If it’s medium-difficulty, like CIFAR10, use ResNet34. If it’s really hard, like ImageNet, use ResNet50. If it’s harder than that, you can probably afford to use something better than a ResNet.

Don’t believe me? Check out my leading entry for the Stanford DAWNBench competition from April 2019:

What do you see? ResNets everywhere! Now come on, there’s got to be a reason for that.

Hyperparameters Galore

A few months ago, I wrote an article on how to pick the right hyperparameters. If you’re interested in a more general solution to this herculean task, go check that out. Here, I’m going to walk you through my process of picking good-enough hyperparameters to get good-enough results on KMNIST.

To start off, let’s go over what hyperparameters we need to tune.

We’ve already decided to use a ResNet34, so that’s that. We don’t need to figure out the number of layers, filter size, number of filters, etc. since that comes baked into our model.

See, I told you it would save time.

So what’s remaining is the big three: learning rate, batch size, and the number of epochs (plus stuff like dropout probability for which we can just use the default values).

Let’s go over them one by one.

Number of Epochs

Let’s start with the number of epochs. As you’ll come to see when you play around with the model in the notebook, our training is pretty efficient. We can easily cross 90% accuracy within a few minutes.

So given that our training is so fast in the first place, it seems extremely unlikely that we would use too many epochs and overfit. I’ve seen other KMNIST models train for over 50 epochs without any issues, so staying in the 0-30 range should be absolutely fine.

That means within the scope of the restrictions we’ve put on the model when it comes to epochs, the more, the merrier. In my experiments, I found that 10 epochs strike a good balance between model accuracy and training time.

Learning Rate

What I’m about to say is going to piss a lot of people off. But I’ll say it anyway — We don’t need to pay too much attention to the learning rate.

Yep, you heard me right. But give me a chance to explain.

Instead of going “Hmm… that doesn’t seem to work, let’s try again with lr=3e-3 ,” we’re going to use a much more systematic and disciplined approach to finding a good learning rate.

We’re going to use the learning rate finder, a revolutionary idea proposed by Leslie Smith in his paper on cyclical learning rates.

Here’s how it works:

When all is said and done, and you plot the loss against the learning rate, you should see something like this:

Now, before you get all giddy and pick 1e-01 as the learning rate, I’ll have you know that it’s NOT the best choice.

That’s because fastai implements a smoothening technique called exponentially weighted averages, which is the deep learning researcher version of an Instagram filter. It prevents our plots from looking like the result of giving your neighbors’ kid too much time with a blue crayon.

Since we’re using a form of averaging to make the plot look smooth, the “minimum” point that you’re looking at on the learning rate finder isn’t actually a minimum. It’s an average.

Instead, to actually find the learning rate, a good rule of thumb is to pick the learning rate that’s an order of magnitude lower than the minimum point on the smoothened plot. That tends to work really well in practice.

I understand that all this plotting and averaging might seem weird if all you’ve been brute-forcing learning rate values all your life. So I’d advise you to check out Sylvain Gugger’s explanation of the learning rate finder to learn more.

Batch Size

OK, you caught me red-handed here. My initial experiments used a batch size of 128 since that’s what the top submission used.

I know, I know. Not very creative. But it’s what I did. Afterward, I experimented with a few other batch sizes, and I couldn’t get better results. So 128 it is!

In general, batch sizes can be a weird thing to optimize, since it partially depends on the computer you’re using. If you have a GPU with more VRAM, you can train on larger batch sizes.

So if I tell you to use a batch size of 2048, for example, instead of getting that coveted top spot on Kaggle and eternal fame and glory for life, you might just end up with a CUDA: out of memory error.

So it’s hard to recommend a perfect batch size because, in practice, there are clearly computational limits. The best way to pick it is to try out values that work for you.

But how would you pick a random number from the vast sea of positive integers?

Well, you actually don’t. Since GPU memory comes is organized in bits, it’s a good idea to choose a batch size that’s a power of 2 so that your mini-batches fit snugly in memory.

Here’s what I would do: start off with a moderately large batch size like 512. Then, if you find that your model starts acting weird and the loss is not on a clear downward trend, half it. Next, repeat the training process with a batch size of 256, and see if it behaves this time.

If it doesn’t, wash, rinse, and repeat.

A Few Pretty Pictures

With the optimizations going on here, it’s going to be pretty challenging to keep track of this giant mess of models, metrics, and hyperparameters that we’ve created.

To ensure that we all remain sane human beings while climbing the accuracy mountain, we’re going to use the wandb + fastai integration.

So what does wandb actually do?

It keeps track of a whole lot of statistics about your model and how it’s performing automatically. But what’s really cool is that it also provides instant charts and visualizations to keep track of critical metrics like accuracy and loss, all in real-time!

If that wasn’t enough, it also stores all of those charts, visualizations, and statistics in the cloud, so you can access them anytime anywhere.

Your days of starting at a black terminal screen and fiddling around with matplotlib are over.

The notebook tutorial for this article has a straightforward introduction to how it works seamlessly with fastai. You can also check out the wandb workspace, where you can take a look at all the stuff I mentioned without writing any code.



That means “this is the end.”

But you didn't need me to tell you that, did you? Not after you went through the trouble of getting a Japanese character dataset, using the learning rate finder, training a ResNet using modern best practices, and watching your model rise to glory using real-time monitoring in the cloud.

Yep, in about 20 minutes, you actually did all of that! Give yourself a pat on the back.

And please, go watch some Dragonball.

Join our mailing list to get the latest machine learning updates.