Skip to main content

MLOps2025_G24AIT007

Created on February 14|Last edited on February 16
This report summarizes the key observations from the MLOps assignment, focusing on hyperparameter exploration and artifact management using Weights & Biases (wandb).

I. Project Overview
This project involved training a simple neural network (SimpleNN) for image classification using the Fashion-MNIST dataset. The goal was to integrate W&B for experiment tracking, hyperparameter optimization, and model artifact management. My roll number is G24AIT007, so I have selected requirements for 7 as the last digit.

Detailed Explanation of Steps Performed in the Colab Notebook:
Install wandb:
!pip install wandb -q: This command installs the wandb library in the Colab environment. The -q option makes the installation quiet.

Import Libraries:
This section imports all the necessary libraries for the project:
torch: PyTorch library for tensor operations and neural networks.
torch.nn as nn: PyTorch's neural network module.
torch.optim as optim: PyTorch's optimization module.
torchvision: PyTorch's library for computer vision tasks, including datasets and transforms.
torchvision.transforms as transforms: For data preprocessing.
torch.utils.data.DataLoader: For creating data loaders.
torch.utils.data.random_split: For splitting the dataset.
wandb: The Weights & Biases library.

Q1: Dataset and Model Preparation:
a) Load Fashion-MNIST Dataset:
transform = transforms.Compose(...): Defines a sequence of transformations to apply to the images:
transforms.ToTensor(): Converts the images to PyTorch tensors.
transforms.Normalize((0.5,), (0.5,)): Normalizes the pixel values to the range [-1, 1]. This helps with training stability.
fashion_mnist_dataset = torchvision.datasets.FashionMNIST(...): Loads the Fashion-MNIST dataset from the ./data directory. If the dataset is not already downloaded, it will be downloaded. The transform is applied to each image.

b) Configure Train-Validation Split:
train_size = int(0.8 * len(fashion_mnist_dataset)): Calculates the size of the training set (80% of the total dataset).
val_size = len(fashion_mnist_dataset) - train_size: Calculates the size of the validation set (20% of the total dataset).
train_dataset, val_dataset = random_split(fashion_mnist_dataset, [train_size, val_size]): Splits the dataset into training and validation sets using random_split.

c) Model Architecture:
class SimpleNN(nn.Module): Defines a simple neural network with two fully connected layers and ReLU activation.
__init__(self, num_neurons=128): The constructor initializes the layers:
self.flatten = nn.Flatten(): Flattens the 28x28 images into a 784-dimensional vector.
self.fc1 = nn.Linear(28 * 28, num_neurons): The first fully connected layer, mapping the input to num_neurons (default 128).
self.relu = nn.ReLU(): ReLU activation function.
self.fc2 = nn.Linear(num_neurons, 10): The second fully connected layer, mapping num_neurons to 10 (the number of classes in Fashion-MNIST).
forward(self, x): Defines the forward pass of the model.
x = self.flatten(x): Flattens the input.
x = self.fc1(x): Applies the first fully connected layer.
x = self.relu(x): Applies the ReLU activation function.
x = self.fc2(x): Applies the second fully connected layer.

Q2: Setting Up the Project & Logging Hyperparameters:
a) Set up wandb project:
wandb.login(): Logs in to the wandb service. You'll need to authenticate with your wandb account.
PROJECT_NAME = "MLOps2025_G24AIT007": Defines the name of the wandb project.
run = wandb.init(project=PROJECT_NAME, job_type="training"): Initializes a new wandb run, connecting your code to the wandb service. The project argument sets the project name, and job_type sets the type of run.

b) Define configuration parameters:
config = wandb.config: Accesses the wandb configuration object.
config.learning_rate = 0.001: Sets the learning rate.
config.batch_size = 64: Sets the batch size.
config.epochs = 5: Sets the number of epochs.
config.num_neurons = 128: Sets the number of neurons in the hidden layer.
config.model_architecture = "SimpleNN": Sets the model architecture.

c) Ensure all configuration parameters are logged:
wandb.config.update(config): Updates the wandb configuration with the defined parameters. This ensures that all parameters are logged in wandb under the "Config" section.

Q3: Training and Validation:
Initialize model, loss function, and optimizer:
model = SimpleNN(num_neurons=config.num_neurons).to(device): Creates an instance of the SimpleNN model and moves it to the specified device (GPU if available, otherwise CPU).
criterion = nn.CrossEntropyLoss(): Defines the loss function (cross-entropy loss for multi-class classification).
optimizer = optim.Adam(model.parameters(), lr=config.learning_rate): Defines the optimizer (Adam) and sets the learning rate.

a) Prepare data loaders:
train_loader = DataLoader(train_dataset, batch_size=config.batch_size, shuffle=True): Creates a data loader for the training set. The batch_size determines the number of samples in each batch, and shuffle=True shuffles the data in each epoch.
val_loader = DataLoader(val_dataset, batch_size=config.batch_size, shuffle=False): Creates a data loader for the validation set. shuffle=False because we don't need to shuffle the validation data.

b) Train and validate the model:
The code iterates through the specified number of epochs.

Training Loop:
model.train(): Sets the model to training mode.
The code iterates through the training data loader.
inputs, labels = inputs.to(device), labels.to(device): Moves the input data and labels to the specified device.
outputs = model(inputs): Performs the forward pass.
loss = criterion(outputs, labels): Calculates the loss.
optimizer.zero_grad(): Resets the gradients.
loss.backward(): Performs the backward pass (calculates the gradients).
optimizer.step(): Updates the model parameters.
The code calculates the training accuracy and loss for the epoch.

Validation Loop:
model.eval(): Sets the model to evaluation mode (disables gradient calculation).
The code iterates through the validation data loader.
with torch.no_grad(): Disables gradient calculation.
The code performs the forward pass and calculates the validation loss and accuracy.

c) Track and log metrics:
wandb.log(...): Logs the training loss, training accuracy, validation loss, and validation accuracy to wandb for each epoch.
The code prints the training and validation metrics for each epoch.

Q4: Hyperparameter Exploration (Sweeps):

a) Implement wandb sweeps:
sweep_config = {...}: Defines the sweep configuration.
method: 'random': Specifies the sweep method (random search).
metric: ...: Defines the metric to optimize (validation accuracy).
parameters: ...: Defines the hyperparameters to sweep (batch size).
batch_size: ...: Specifies the range of values for the batch size (32, 64, 128).
sweep_id = wandb.sweep(sweep_config, project=PROJECT_NAME): Creates a new sweep in wandb.
train_sweep() function:
This function is called by the wandb agent for each sweep run.
It initializes a new wandb run, gets the batch size from the sweep configuration, prepares the data loaders with the current batch size, initializes the model, loss function, and optimizer, and trains and validates the model for 5 epochs.
It logs the training and validation metrics to wandb for each epoch.
It finishes the wandb run.
wandb.agent(sweep_id, train_sweep, count=3): Starts the wandb agent, which runs the train_sweep() function 3 times, each with a different batch size selected by the sweep.

Q5: Artifact Management and Model Saving:

a) Save the trained model as a wandb artifact:
artifact = wandb.Artifact("model", type="model"): Creates a new wandb artifact of type "model".
MODEL_PATH = "model.pth": Defines the path to save the model.
torch.save(model.state_dict(), MODEL_PATH): Saves the model's state dictionary to a file.
artifact.add_file(MODEL_PATH): Adds the saved model file to the artifact.

b) Log the artifact:
run.log_artifact(artifact): Logs the artifact to wandb.
artifact.wait(): Waits for the artifact to be uploaded.
Finish the wandb run:
wandb.run.finish(): Finishes the wandb run and uploads all the data to the wandb service.

Q6: Observations
I. Hyperparameter Settings (Initial Run)
Before running any hyperparameter sweeps, I trained the model with the following initial hyperparameter settings:
Learning Rate: 0.001.
I chose this value as it is a commonly used starting point for the Adam optimizer and often provides a good balance between convergence speed and stability.
Batch Size: 64.
A batch size of 64 was selected as it is a commonly used default value and often provides a good trade-off between computational efficiency and gradient accuracy.
Epochs: 5.
I selected 5 epochs initially to allow the model to begin learning without excessive training time, intending to increase this if needed.
Number of Neurons (Hidden Layer): 128.
I used 128 neurons as a starting point for the hidden layer size, providing sufficient capacity for the model to learn the underlying patterns in the data.
Model Architecture:
SimpleNN (two fully connected layers with ReLU activation).

II. Sweep Results (Batch Size)
As my roll number ends in 7, I performed a hyperparameter sweep over the batch size. The sweep configuration was as follows:
Hyperparameter Swept: Batch Size
Range of Values Tested: 32, 64, 128
Sweep Method: Random


The sweep results indicated that a batch size of 64 achieved the highest validation accuracy of 0.88908
Analyzing the results, I observed that a batch size of 64:
Example 1: 64 likely provided a good balance between gradient accuracy and computational efficiency. Smaller batch sizes might have introduced more noise into the gradient estimates, while larger batch sizes might have led to slower convergence or poorer generalization."

III. Trade-offs
I observed the following trade-offs between the different batch sizes:
"Smaller batch sizes (e.g., 32) generally resulted in longer training times per epoch but sometimes led to slightly higher validation accuracy."
"Larger batch sizes (e.g., 128) trained faster but might have resulted in lower validation accuracy compared to the smaller batch sizes."

IV. Importance of Artifact Management
Saving the trained model as a wandb artifact is crucial for:
Reproducibility: Wandb artifacts ensure that experiments can be reliably reproduced. The artifact stores not only the trained model weights but also the exact code (Colab notebook) and the configuration parameters (hyperparameters) used during training. This eliminates any ambiguity and allows anyone with access to the wandb project to recreate the experiment and obtain the same results, even if the original environment or code has changed.
Version Control: Wandb's artifact versioning provides a robust mechanism for tracking different versions of the model as experiments evolve. Each time the model is saved as an artifact (e.g., after a hyperparameter sweep), a new version is created. This allows easy comparison of the performance of different models and the ability to revert to previous versions if needed. Furthermore, descriptions can be added to each artifact version, providing valuable context and facilitating the understanding of changes made during the experiment.


V. Conclusion
In summary, this project demonstrated the effectiveness of using wandb for tracking machine learning experiments, performing hyperparameter optimization, and managing model artifacts. The sweep results allowed for the identification of an effective batch size for the Fashion-MNIST dataset, and the artifact management features of wandb ensured the reproducibility and traceability of the experiments.