If you’ve taken a look at the state of the art benchmarks/leaderboards for ImageNet sometime in the recent past, you’ve probably seen a whole lot of this thing called “EfficientNet.”
Now, considering that we’re talking about a dataset of 14 million images, which is probably a bit more than you took on your last family vacation, take the prefix “Efficient” with a fat pinch of salt. But what makes the EfficientNet family special is that they easily outperform other architectures that have a similar computational cost.
In this article, we’ll discuss the core principles that govern the EfficientNet family. Primarily, we’ll explore an idea called compound scaling which is a technique that efficiently scales neural networks to accommodate more computational resources that you might have/gain.
In this report, I’ll present the results I got from attempting to try the various EfficientNet scales on a dataset much smaller than ImageNet which is much more representative of the real world. You’ll also be able to interactively visualize the results and answer the question that the title of this post asks — how efficient is EfficientNet?
Let’s begin.
The very first thing that the EfficientNet paper points out is that while scaling neural nets is something a lot of people do a lot of the time, traditionally, the methods we use to decide how to scale networks are not much more accurate than flipping a coin, rolling a dice, and sacrificing a frog foot to see a vision in the flames.
Specifically, they note that there are three primary ways you can beef up your network: depth, width, and resolution.
Depth is simply the number of layers in your network. It’s the thing that makes deep learning, you know, deep…
Width is the number of channels (or filters or kernels, if that’s what you like to call it) that each convolution layer uses. In PyTorch, it’s the out_channels
argument of the Conv2d layer.
torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros')
size
argument you supply to the resize transform. torchvision.transforms.Resize(size, interpolation=2)
Note that to a degree, this is pretty much specific to computer vision using ConvNets with a focus on classification i.e. ImageNet. However, there’s no reason the general idea can’t be adapted to other architectures/domains.
One thing you’ll notice here is that all of these things are super easy to modify. They’re literally just one argument in some cases. However, they all can, in fact, have a pretty big impact on your model’s performance.
And of course, these hyperparameters for model scaling have already been quite thoroughly explored individually.
ResNets are a really good example of this. A big chunk of the idea of ResNets was to see how far you could push the number of layers in a network. And in general, it seems to be the case that more layers lead to better performance (big asterisk here, since there have been plenty of studies that investigate this claim).
There’s another slightly less popular architecture called the WideResNet, which is, as you probably guessed, is a ResNet scaled along the width dimension (that is, a ResNet with more channels in the Conv layers). This works well not only because of the obvious increase in parameter count, but also has the added advantage that the underlying convolution CUDA kernels are written to parallelize across multiple filters.
So you get better performance at nearly the same speed as you would have gotten if you used fewer channels.
Image size is such a simple idea that there really isn’t a particular model or paper to point to. Intuitively, increasing the image size or resolution provides more information to the network. Using more input information, the model can make better predictions – at the cost of more computation, of course.
More recently, progressive resizing is becoming a fairly popular technique across computer vision tasks. To the best of my knowledge, this was first used effectively in Nvidia’s ProGAN model, where they gradually increase the image size as training progresses. To be clear, EfficientNet and compound scaling have nothing to do with progressive resizing, but I thought this was something worth pointing out anyway.
Now, individually, increasing depth, width, and resolution certainly will make your network better. However, as the ancient proverb says, all good things must eventually come to an end.
At some point, the performance gains you get by increasing depth, width, and the resolution starts plateauing. And considering the added computational cost for scaling along these dimensions, it’s not really worth it once you cross a certain threshold.
Here, take a look at a plot from the EfficientNet paper that highlights this idea:
The three plots show how the performance of the model, measured as top-1 accuracy on ImageNet, change as you scale width (left), depth (middle), and resolution (right). They all initially perform better, but the performance starts leveling off.
But if you stare at those plots side-by-side for a little while, you’ll notice that there’s a much better way to scale networks — instead of increasing just one of the three key scaling parameters by a large amount, scale each of the width, depth, and resolution equally!
This works because if you scale each hyperparameter by a bit, but not too much, you’re effectively moving along the regions where there is a large increase in performance, so overall, it takes a longer time to reach the plateau.
This initially seems like an “Eh, ok. Whatever, I guess it kind of works” sort of thing, but if you really think about, this idea of balanced scaling is quite profound.
Let’s put it this way. A lot of deep learning today more or less boils down to where you want to allocate compute resources in your model. Specifically, I’m going to measure “compute resources” in terms of FLOPs (floating-point operations) and the number of parameters in the network.
The number of FLOPs is a direct measurement of computation which can be calculated using standard tools. And as you’re probably aware – increasing the number of parameters, however you choose to do that, isn’t exactly free – which is why we don’t just set the number of layers to the largest number that our brains can conjure.
So the question the authors of the EfficientNet paper are really trying to answer is this — given that I have the budget to add x amount of FLOPs/parameters to my network, how do I do so in the way that gives me the best possible performance?
The answer they came up with is an algorithm for compound scaling.
But first, we need to rigorously define how exactly we’re going to scale the depth, width, and resolution. The way the authors describe it is by using three numbers: w, d, and r.
You start with a baseline network, say a ResNet18 that takes 224x224 images as input. For the default network, all three numbers will initially be 1. Now suppose we increase the number of layers to 36 (i.e. double the number of layers). This would correspond to d=2. Similarly, increasing w to 3, for example, would mean we are adding 3 times as many filters in each convolution layer. Resolution is a similar concept. If we set r=1.5, that would correspond to changing the input image size to 336x336 (since 224*1.5=336).
This simple convention will be super useful as we discuss the exact compound scaling technique that the EfficientNet paper introduces, which we will discuss exactly three lines from now.
Oh, by the way, did I tell you that there’s a version of EfficientNet for object detection?
It’s called (hilariously, in my opinion) EfficientDet (get it? “Det” for detection)?
Ok, at this point I’m just stalling and filling space to keep good on my word. Onto compound scaling!
What we need now is a way to set those three numbers that we talked about above (w, d, and r, incase like me, you have the attention span of a goldfish).
Here’s the method that EfficientNet says we should use:
$d = \alpha^\phi$
$w= \beta^\phi$
$r= \gamma^\phi$
$\alpha> 1, \beta>1, \gamma>1$
$\alpha + \beta + \gamma \approx 2$
In the equations above, $\alpha, \beta, \gamma$ are constants that represent how much to scale the individual dimensions by, and $\phi$ is a variable that represents how much additional computational resources you have.
Geez! That seems unnecessarily complicated. Why all the business with the powers and exponents? Wasn’t this supposed to be a simple idea?
Well, pay attention, because this is about to blow 2% of the minds reading this article.
What’s the most common operation in a ConvNet? You guessed it. It’s convolution. In terms of FLOPs, the cost of convolution scales linearly with d, but quadratically with w and r, which is fancy speak for saying that if you double the depth, you double the computation cost, but you double the width or resolution, you quadruple the computation cost.
Putting the math together, this means that scaling a convolutional network with the parameter $\phi$ using these rules results in a new network that has a computational cost of approximately $2^\phi$, which gives us a clean way of measuring cost and scaling networks accordingly.
If all that math flew over your head, intuitively, setting $\phi$ to 0 gives you the baseline model. Setting $\phi$ to 1 results in a model with roughly twice the total capacity as the original, at roughly twice the computational cost. Setting $\phi$ to 2 results in a model with roughly twice the total capacity as the second model, at roughly twice the computational cost of the second version. And so on…
Yet another way of looking at this form of compound scaling is to think of it as multiplying the depth, width, and resolution of a network each by some constant value (like 1.5) at each iteration of scaling. In this picture, $\phi$ represents the iteration index for the scaling procedure. That is, setting $\phi=5$ would be the result of scaling the network 5 times.
Nice. But you might have noticed one small issue — how do we set those scaling constants (which are called alpha, bets, and gamma in the paper)?
In the paper, the authors suggest using a small grid search (trying out a bunch of values and seeing what works).
If you’re curious, the specific values they get from running experiments on the EfficientNet architecture (which we’ll talk about shortly) are $\alpha=1.2, \beta=1.1, \gamma=1.15$.
Perfect! Now we know exactly how to scale networks, in precise terms. But there’s another really important part about the EfficientNet paper that we’re missing — EfficientNet!
The core what we’ve talked about so far is model scaling. But this assumes that we have a good starter model that can take advantage of to begin with.
A “good model” in this context is one that is able to get good performance while minimizing the total FLOPs, since that is what we’ll be scaling with respect to .
Since they can, the authors use architecture search (which I guess is not too bad in this case since they’re only trying to find the smallest network in the scaling series) and come up with something that looks like this:
MBConv, if you’re unfamiliar with it, is a layer that was introduced with MobileNetv2. Other than that, what can I really say? It’s a model made by architecture search, and it works.
What’s next? Scaling!
The baseline network is called EfficientNetb0
, the first scaled version, which has roughly twice the capacity, is called EfficientNetb1, and so seven times.
How well does this work? The paper has a really cool table that answers exactly this. I’ll let the results, which include state of the art on ImageNet and orders of magitude of improvements when compared to similar performing networks, speak for themselves:
Congratulations, you now have a two powerful tools in your arsenal: a principled way to scale your network and a set of state-of-the-art models that implement these best practices by default.
We also explored my results on trying the EfficientNet models on smaller datasets and saw if the reduction in FLOPs that EfficientNet claims to bring actually translated to a reduced wall time.
But even if you decide not to ever use EfficientNet in your deep learning career, hopefully you’ve learnt a far more valuable lesson — that in a field where top results are typically dominated by hulking, memory hungry Goliaths with their armor of massive GPU arrays, maybe there still is a still a place for a young shepherd boy named David with his slick leather sling and bag full of small yet effective training techniques to achieve what was thought to be impossible.
Maybe, just maybe, it pays to occasionally be a tad more… efficient.