ML Experiment tracking and model management using WandB

As part of MLOps Assignment 3, I developed a simple NN model for handwritten digit recognition over the MNIST dataset. One of the hyperparameters for the model was the learning rate (lr). I swept this over values (.01, .001, .0001) in an attempt to identify the optimal value. I saved the best model as a versioned artifact. This is a report that summarizes this exercise.

Chandra Prakash Manglani (G24AIT063)

Created on February 10|Last edited on February 11

Comment

﻿
Training the NN modelThe model is architected as a simple Neural Network, with two Fully Connected Layers in the configuration: Input -> FC1 -> ReLU -> FC2 -> Output. The model was trained over 5 epochs.
﻿
﻿
Hyperparameter ExplorationAs advised, WandB Sweep capability was used to sweep learning rate (lr) parameter through values .01, .001, and .0001.
The value of lr = .001 demonstrated the best performance in terms of training and test accuracy.
﻿
﻿
﻿
Best Model - Artifact﻿
﻿
project("g24ait063-prom-iit-rajasthan", "MLOps2025_G24AIT063").artifact("best_model").membershipForAlias("v1").artifactVersion.file("")
best_model.pth
409.1KB
409.1KB
ObservationsHolding number of batches and epochs static (64 and 5 respectively), a learning rate of 0.001 provided the best model performance. The validation accuracy achieved was 96.42%.
A larger learning rate, i.e. 0.01 - led to a marginally poor validation accuracy (93.65%). Validation loss was plateaued around 0.23.
A smaller learning rate, i.e. 0.001 - led to a poorer validation accuracy (92.49%). Validation loss was higher than the best run, but the trend indicated that it would decrease if the number of epochs were increased. 
As evident from the following trend charts that compare a run with (lr 0.01, epochs 5) against another with (lr 0.001, epochs 10), a  smaller learning rate and appropriately larger number of epochs yield stabler training and better performance, however, it comes at a cost of training time (CPU). If the model is coded efficiently, the impact on memory is negligible.
﻿
﻿
﻿
Artifact Management Motivations
As part of the MLOps coursework and assignments, I developed useful insights on why Artifact Management is critically important to ensure reproducibility and version control in ML development and deployment environments.
ReproducibilityIn Machine Learning (ML) research and applications, it is fundamentally important to be able to run a given experiment with specified inputs, model code and environment - and obtain the same results. This reproducibility ensures that ML models and research findings can be consistently verified. This is even more important for a rapidly evolving field like ML/AI, where new innovations are rapidly built on top of past, trustable work.
Using Artifact Management, key components of a researched model - i.e. datasets, model code and tuned weights,  hyperparameters, and dependencies - can be tagged and tracked. This enables seamless collaboration between participating development, testing and integrating teams - since everyone has a well defined source to work with.  This also eases deployment in newer environments.
Version ControlML workflows typically involve multiple artifacts—datasets, foundational models layered with application-specific code, hyperparameters, and dependencies. Keeping track of the versions of these artifacts - i.e. version control, ensures reproducibility, eases collaboration, accelerates debugging, and helps audit the models for requisite compliance.
ML teams typically work in parallel across different functions (say, data processing, model training, and deployment integration), and often different geographies. Version control ensures that they can communicate and collaborate effectively. 
Also, as ML models evolve and get refined, version control allows teams to track the changes between model versions, and develop insights on  factors which helped a model get better or get degraded.
In production environments, models are exposed to unseen data and sometimes unexpected use patterns. If a newly deployed model version runs into issues in production, version control enables an efficient mechanism to rollback to a previous stable version.
﻿
﻿

Add a comment