Understanding PolyNLoss for Image Classification
Understand the ideology of polyloss; Build an image classifier using CE Loss and Poly1Loss and compare them
Created on May 6|Last edited on May 6
Comment
Introduction
In this post, we will understand all the working of polyloss from the paper PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions and implement the same on an image classification task. We will proceed as follows
- Understanding PolyNLoss
- Quick overview of CrossEntropy Loss
- CrossEntropy (CE) Loss as an infinite series
- PolyN Loss by perturbations in terms of CE loss
- Implementation in PyTorch
- Building an image classification pipeline in fastai
- Writing PolyN loss function in pytorch
- Compare classifiers trained using CE loss vs Poly1 loss
Understanding PolyNLoss
In essence, PolyNLoss function is a generalised form of CrossEntropy Loss. To motivate the formulation of this loss, it would help to first get a background of vanilla crossentropy loss function.
Quick overview of CrossEntropy Loss
Given two distributions and which could be represented as k-dimensional vectors, the cross entropy loss is defined as
In any classification problem, and are the target distribution (i.e. one-hot encoded) and the distribution output from a neural network respectively.
For most single label classification problems, since is one-hot encoded, they have all their components to be 0s and only one component in that vector is 1 which is the true class of that data-point. So, with this fact, the above equation could be simplified and rewritten as
In minibatch GD, it is this loss which is aggregated using an appropriate function (mostly mean) and then backpropagated for gradient updation. But for now let us concentrate on a single instance level as above and expand the CE loss.
CE Loss as an infinite series
Any mathematical function can be expressed using Taylor series as an infinite sum of terms which involve the derivative of the function at a point. In general, Taylor series can be expressed as
Using this, we can express the reduced cross entropy loss above as
where we have taken the value as a = 1 for convenience in expanding the series
Substituting and so on for all the terms involving derivative wrt at 1 (since a = 1) in the above equation, we finally get
Now, if we substitute and all the coefficients with , we can again view the above equation as
We can interpret the above function as a combination of several powers of x, each weighted with a fixed coefficient . Wouldn't it be amazing if we could tweak the for each term in order to suit the downstream task (in our case classification, but it could be any other task as well) at hand? This is the central idea behind PolyNLoss.
PolyNLoss by perturbing coefficients in CE Loss
As shown above, if we could modify all alphas or at least a lot of leading alpha terms (since very high powers of would likely tend to zero) based on the task at hand, it might benefit the process of backprop. However, computationally it would mean tuning a lot of hyperparameters. Consider that we decide to only adjust the first n terms of the infinite series as follows
Now, if we take the epsilon terms apart, we will end up with
Now, the second part here is the Taylor series expansion of CE Loss and the leading terms are a weighted combination of first n terms which occur in the Taylor series expansion of the CE Loss, so we can finally write our loss function as
In section 4 of the paper, the authors discuss about the effects of these perturbations and they claim that adjusting the first polynomial coefficient leads to maximal gain while requiring minimal code change and hyperparameter tuning. In the subsequent section, we shall therefore implement our own version of Poly1Loss from scratch in pytorch with the help of fastai library on an open-source dataset to bolster our practical understanding of this concept.
Implementation in PyTorch
The dataset we will be looking at for the demonstration of classification is the oxford flowers dataset. It consists of images of flowers which are classified into 102 different types. Here is how a small slice of images from the dataset looks like.

Oxford 102 Flowers Dataset
The split of these datapoints across different sets is as follows

We can see that out of the entirety of the dataset most of them are contained in the test and only a fraction of them in the train and validation sets respectively. Both train and validation have around 1k images which means there would be roughly around 10 images on an average per class in both these sets whereas test set is substantially large. Let us look at the counts in the datapoint types by splitting this chart furthermore on a class level
As seen above, we can conclude that there is a class imbalance but that is only on a test level and not in training/validation sets. This means we could safely assume that we will not have to do anything special to tackle label imbalance in our dataset because there is none.
Building an image classification pipeline in fastai
Once we create a csv which contains the basic information about the dataset i.e. input image, label type and split type, we can very easily define our dataloader in fastai.
# Define getter for input ImageBlockdef get_x(row):return f'../data/oxford-102-flowers/{row["ImgPath"]}'# Define getter for output CategoryBlockdef get_y(row):return row["ImgLabel"]# Define trian validation splitterdef splitter(df):train_idxs = df[df.SetType == "train"].index.tolist()valid_idxs = df[df.SetType == "valid"].index.tolist()return (train_idxs, valid_idxs)# Define CPU based item transforms heredef get_item_tfms(size):return Resize(size, pad_mode = PadMode.Zeros, method = ResizeMethod.Pad)()# Define GPU based augmentation transforms heredef get_aug_tfms():proba = 0.3h = Hue(max_hue = 0.3, p = proba, draw=None, batch=False)s = Saturation(max_lighting = 0.3, p = proba, draw=None, batch=False)ag_tfms = aug_transforms(mult = 1.00, do_flip = True, flip_vert = False, max_rotate = 5,min_zoom = 0.9, max_zoom = 1.1, max_lighting = 0.5, max_warp =0.05, p_affine = proba, p_lighting = proba, xtra_tfms = [h, s],size = 224, mode = 'bilinear', pad_mode = "zeros", align_corners = True,batch = False, min_scale = 0.75)return ag_tfms# Define a function to retrieve the dataloader# Use the subordinate functions defined above for the samedef get_dls(df, BATCHSIZE = 16):datablock = DataBlock(blocks = (ImageBlock, CategoryBlock),get_x = get_x,get_y = get_y,splitter = splitter,item_tfms = Resize(size = 460),batch_tfms = get_aug_tfms())dls = datablock.dataloaders(source=df, bs = BATCH_SIZE, drop_last = True)return dls
All we need are an ImageBlock as the input and a CategoryBlock as the output and some functions which can help to get the input and output in the required format given the dataframe. Next we can define get into the meat of today's topic which is implementation of the Poly1Loss.
Writing PolyN loss function in pytorch
class PolyLoss(nn.Module):def __init__(self, epsilon = [2], N = 1):# By default use poly1 loss with epsilon1 = 2super().__init__()self.epsilon = epsilonself.N = Ndef forward(self, pred_logits, target):# Get probabilities from logitsprobas = pred_logits.softmax(dim = -1)# Pick out the probabilities of the actual classpt = probas[range(pred_logits.shape[0]), target]# Compute the plain cross entropyce_loss = -1 * pt.log()# Compute the contribution of the poly losspoly_loss = 0for j in range(self.N, self.N + 1):poly_loss += self.epsilon[j - 1] * ((1 - pt) ** j) / jloss = ce_loss + poly_lossreturn loss.mean()
Above is a simple implementation of the poly1 loss.
- We compute the softmax activations of the prediction logits
- We identify with the help of target, the probability corresponding the true class label
- CE Loss is simply negative log of the probability corresponding to these true labels
- Next, we loop over the epsilon list and incrementally add these N component perturbations to the CE loss to obtain the final loss
- Ultimately we aggregate these datapoint losses using a simple average and that becomes the polyloss for our minibatch.
We could then instantiate a learner object in fastai and train a simple resnet50 classifier. The results obtained by training using CE Loss and by using Poly1Loss for this problem are summarized below
Compare classifiers trained using CE loss vs Poly1 loss
First we train by freezing the body of the classifier for 8 epochs and subsequently we unfreeze the body and use discriminative learning rate for different layers of the network. The comparison of runs is as follows
We can observe that
- The magnitude of polyloss Poly1Loss is always higher as compared to the CE Loss.
- The metrics obtained using Poly1Loss start off at a much better position already than CE Loss.
- The accuracy metric of model trained using Poly1Loss is mostly consistently higher than that trained by CE Loss.
- For this dataset, there is a substantial headstart which a model trained using Poly1Loss has over the model trained using CE Loss, however, over training for a lot of epochs, CE Loss trained model catches up and performs equally as well as the Poly1Loss trained model.
Hope you enjoyed reading this post and learned something new today!
References
Add a comment