Create your First Neural Net in PyTorch - Line by Line Explanation

Learn how to create your first NN in PyTorch
Created on May 19|Last edited on November 20
Comment
﻿
IntroductionSo you read about what neural nets are, some of their applications, along with mathy details, and now want to train your own very first network? Or are you one of those folks interested in understanding every line of a PyTorch code used to create and train a basic neural net? I've got you covered. Let's dive right in! 
Data & Set UpWe will use FashionMNIST - a dataset of Zalando's article images consisting of 60,000 training examples and a test set of 10,000 examples. Each example image is 28x28 grayscale, associated with a label from 10 classes. We will train a neural classifier to classify each image into one of the 10 classes.
Note: The goal of this post is not to achieve the highest accuracy. Rather, it is to demonstrate and understand line by line how PyTorch is a powerful framework for coding neural nets. On the same lines, it assumes basic familiarity with deep learning and associated terms like loss, optimization and so on. After every block of code, necessary explanations are provided (that, obviously, are not meant to be exhaustive).
💡
First, let's import our dependencies:
import torch
from torch import nn, optim
from torch.nn import functional as F
from torchvision import datasets, transforms
from torch.nn.modules.loss import NLLLoss
from torch.utils.data import DataLoader
A note about the above: 
torch.nnAmong other functionalities, torch.nn provides various layers that are the building blocks of large neural nets. For eg. Linear layers, Pooling layers, Convolutional layers, Dropout layers to curb overfitting (more on this later) and so on.
torch.optimtorch.optim contains various optimization algorithms like Stochastic Gradient Descent, Adam Optimizer etc. (It also has a learning rate scheduler functionality torch.optim.lr_scheduler that provides various methods to adjust the learning rate based on the number of epochs the model is trained for.)
torch.nn.functionaltorch.nn.functional contains various functions like, you guessed it right - activation functions (relu, tanh, softmax, sigmoid etc.), loss functions (binary cross entropy loss, negative log likelihood loss - NLLLoss above etc.), dropout functions, pooling functions, so on and so forth.
torchvisiontorchvision package consists of popular datasets and model architectures, and image transformations and augmentations for CV domain. We will see further in the post one use of torchvision.transforms. 
[For anyone that's wondering why we used NLLLoss from torch.nn.modules.loss and not from torch.nn.functional, I found this.]
torch.utils.dataFinally, DataLoader from torch.utils.data allows us to easily access data samples for training and testing by wrapping an iterable around 'our data'. 
Digging DeeperNote here, 'our data' can be of 2 types - torch.utils.data.Dataset and torch.utils.data.IterableDataset. Dataset and IterableDataset are yet another classes in torch.utils.data that allow us to use our own data as well as data already available in PyTorch's pre-loaded datasets.
Time for some code - let's get our FashionMNIST data from torchvision.datasets and make it easily accessible in batches using DataLoader. (And I hope you didn't confuse datasets with Dataset)
train_set = datasets.FashionMNIST(root='./data/FashionMNIST', train=True, download=True, transform=transforms.ToTensor())
test_set = datasets.FashionMNIST(root='./data/FashionMNIST', train=False, download=True, transform=transforms.ToTensor())
﻿
train_loader = DataLoader(dataset=train_set, batch_size=64, shuffle=True)
test_loader = DataLoader(dataset=test_set, batch_size=64)
As must be clear, first two lines are downloading the train and test data respectively. root is the path where the data shall be stored, download=True downloads the data if it isn't present in root. Here also comes the use of torchvision.transforms - ToTensor() is used to transform an image or a NumPy ndarray (in our case) to a torch.FloatTensor in the range [0.0, 1.0]. The original image or ndarray is in the range [0, 255]. Tensors are basically the data structures PyTorch works with and so the need for this transformation.
As far as the conversion of range from [0, 255] to [0.0, 1.0] is concerned - it is normalization done to speed up compute.
Next two lines use the DataLoader class to load data in batches. When batch_size is specified (and so is not None), the DataLoader fetches batched samples instead of individual samples. Remember our train_set and test_set already contain the data as tensors. Batching is basically used to merge tensors of separate data points into one single tensor so that these multiple points could be processed simultaneously thus yielding faster compute. Hence, this single tensor now has one dimension equal to 64 - the batch size, and this is usually the first dimension (mark this as this fact shall be used later in the post).
shuffle=True constructs an automatic shuffled sampler (one could roughly relate this to sampling the next data points as the data is loaded).
[For ones that are still curious or remain unclear, there is a Sampler class (torch.utils.data.Sampler) that can be used to create a sequence of indices using which data is loaded each time. One could either create a custom sampler by creating an instance of this class and passing it to the sampler argument in DataLoader or use an automatic shuffled sampler using shuffled=True.]
Now, let's create our own class where we will specify the different layers of our neural net along with the non-linear activations. Basically, here we completely specify how the input is prcoessed via various hidden layers to finally produce the output layer. In short - our custom architecture.
class ClothesClassifier(nn.Module):
  def __init__(self, input_size, num_classes):
    super(ClothesClassifier, self).__init__()
    self.fc1 = nn.Linear(input_size, 256)
    self.fc2 = nn.Linear(256, 128)
    self.fc3 = nn.Linear(128, 64)
    self.fc4 = nn.Linear(64, num_classes)
    #self.Dropout = nn.Dropout(p=0.1)
﻿
  def forward(self, x):
    x = x.reshape(x.shape[0], -1)
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = F.relu(self.fc3(x))
    x = F.log_softmax(self.fc4(x), dim=1)
    return x
All models that we create in PyTorch inherit from nn.Module which is a subclass in the module torch.nn. Why is it so? As nn.Module contains various useful methods like parameters() , __call__() [we will see shortly how both of these are useful] and so on.
__init__() is where we specify the layers of our architecture and forward() is where the input data (in batches) is actually processed in the architecture via non-linear activations and so forward returns the final output. As should be clear, the architecture is constructed using four linear layers - fc1 through fc4. Each time a layer is created, the first argument is the size of input coming to it and the second argument is the size of output layer going out from it. That's why, the last layer fc4 has num_classes as the size of output layer. 
[The commented part self.Dropout = nn.Dropout(p=0.1) basically defines a dropout layer that randomly makes 0 some of the output nodes of the layer that it is applied on - this is done to curb overfitting. I have not used dropout in this post but you should definitely go ahead using it; just make sure not to apply dropout on the last layer of your network - the one that gives the final output, and it should be pretty clear why.]
In forward(), x is the incoming input batch. Now, if you go on & inspect a bit, you'll find that for this dataset each image (each data point) has the dimension 28*28 and consequently one batch of data points has the dimension batch_size*28*28; and so according to our network we reshape it so that one batch is now batch_size*784.
Next three lines after reshaping just apply the ReLU activation function on the outputs coming out from each layer fc1 through fc3 - so that the input to fc2 is not directly what is outputted from fc1, but is what's outputted after applying ReLU to the output of fc1. Ofcourse, non-linear activations are at the heart of any NN. 
Notice we apply log softmax (not ReLU) to the final output from fc4 as this is a classification problem and we basically want the output corresponding to each of the 10 labels to be between 0 and 1 so that it could be interpreted as the probability of a particular class being the label. dim=1 is very important here - the output from fc4 has the shape batch_size*num_classes (64*10),  and we want the log softmax to be applied across columns i.e. for each row (each datapoint).
Alright!
Data loading ✅ , Defining the network architecture ✅ , Defining how input shall be processed ✅
Time for training our network!
#some std. hyperparameters/variables
device = 'cuda' if torch.cuda.is_available() else 'cpu' #we will use a GPU if it is available
input_size = 784 #28*28
num_classes = 10
learning_rate = .001
epochs = 4
﻿
classifier = ClothesClassifier(input_size, num_classes).to(device)
criterion = NLLLoss()
optimizer = optim.Adam(classifier.parameters(), lr=learning_rate)
We first define some standard variables (or hyperparameters) to be used in model training and then create an instance of our ClothesClassifier class passing suitable parameters. .to(device) puts the instance (our model) on the specified device, which will be cuda if GPUs are available else CPU. 
We are using negative log likelihood loss as the loss function (the criterion) to be optimized and the optimization mechanism is Adam optimizer. classifier.parameters() specifies which parameters need to be updated, while learning rate is another hyperparameter that roughly translates to controlling the size of steps or the pace with which the parameter updation is done. Let's define iterative loops for training..
for epoch in range(epochs):
  for batch, (images, targets) in enumerate(train_loader):
    images, targets = images.to(device), targets.to(device)
    scores = classifier(images)
    #scores = classifier.forward(images)
    loss = criterion(scores, targets)
    optimizer.zero_grad() 
    loss.backward() 
    optimizer.step()
As we already know, our train_loader produces data in batches; in every epoch all the batches are processed by the model to make updations to the parameters before moving to the next epoch. Using .to(device) we move our features (images) and targets to the specified device and then pass the images to our model that outputs the scores. Using our criterion we then calculate the loss between the scores outputted by the model and the actual targets.
[For anyone that found scores = classifier(images) unintuitive and is wondering why we  didn't pass the images to the forward method (like the commented line just below it), as forward is actually the one that processes the inputs to output the scores - Both are right. It is the internal functioning of PyTorch that causes the command classifier(images) to call the __call__() method of nn.Module that internally calls the forward method to return the scores.]
Before using this loss to compute the gradients to be backpropagated, optimizer.zero_grad() is a very important step to keep in mind. It clear the gradients from the previous step so that we do not end up 'accumulating' gradients from the previous loss.backward() calls. After this, loss.backward() computes the derivatives of the loss w.r.t. the parameters using backpropagation and finally, optimizer.step() causes the optimizer to take a step and update the parameters based on their gradients.
We are almost finished. The network is trained epoch number of times and is ready to be tested. Let's define a function called check_accuracy to check the accuracies on our train and test sets.
def check_accuracy(loader, model):
  if loader.dataset.train:
    print("Train accuracy")
  else:
    print('test accuracy')
  num_correct = 0
  num_samples = 0
  model.eval()
  with torch.no_grad():
    for x, y in loader:
      x, y = x.to(device), y.to(device)
      scores = model.forward(x) # scores is 64*10 in dimension
      _, predictions = scores.max(1)  # the label corresponding to the maximum label
				      # is taken as the output
      num_correct += (predictions == y).sum()
      num_samples += predictions.size(0)
    print(f'Accuracy {float(num_correct)/float(num_samples)*100:.2f}')
  model.train()
﻿
check_accuracy(train_loader, classifier)
check_accuracy(test_loader, classifier)
Few things need to be noted here: 
model.eval() is used to tell PyTorch that we are wanting to infer from our model and so operations like BatchNorm, Dropout etc. that behave differently during training and inference should be turned off. Consequently, model.train() is used to switch these on as the evaluation phase is finished. 
Another efficient practice while evaluating the models is to turn the gradient computation off as we will not be using these gradients anywhere - we are just evaluating. This is done using with torch.no_grad() - it also ensures that as soon as the with block is completed executing the gradient computation is turned on back again. This practice speeds up compute and saves memory usage.
Nice! You made it this far and so are good to go with learning to train more complex networks like RNNs, CNNs etc. and also try more advanced PyTorch functionalities. 
As for this model, experiment with it and here is a notebook on my Github if you are curious to see some results. I'll attach one for you here :)
(Here, I use a tool developed by Udacity to render this graphic - Find it in my notebook)
﻿