Skip to main content

Evaluation

Report created by W&B to highlight capabilities as part of your evaluation.
Created on August 2|Last edited on August 2


Weights and Biases (W&B) 💫

Weights and Biases is a ML Ops platform built to facilitate collaboration and reproducibility across the machine learning development lifecycle. Machine learning projects can quickly become a mess without some best practices in place to aid developers and scientists as they iterate on models and move them to production.
W&B is lightweight enough to work with whatever framework or platform teams are currently using, but enables teams to quickly start logging their important results to a central system of record. On top of this system of record, W&B has built visualization, automation, and documentation capabilities for better debugging, model tuning, and project management.


Pilot Plan



Key Deliverable - Environment

  1. Vendor to provide full featured product with which Bank of America (BAC) will conduct the POC
  2. Vendor to provide availability to product SME resources to assist with the POC
  3. Vendor to provide product documentation covering all aspects of the product including but not limited to:
  4. Architecture
  5. Data flows
  6. Component interactions
  7. Security controls, both for the product itself as well for those devices that are identified by the product
  8. Software Bill of Materials with version numbers for any open source components used in the product
  9. Vendor product will run completely in Bank of America (BAC) lab environment

Key Deliverable - Demonstrate Product Performance and Effectiveness

Experiment Tracking 🍽

The entry point for W&B usage is Experiments. W&B has a few core primitives which comprise the experiment tracking logging system of the SDK allowing W&B to take logging to a whole new level, allowing you to log pretty much anything with W&B: scalar metrics, images, video, custom plots, etc. Once the logging is completing, we then need to contextualize our experiment, and W&B provides means to accomplish this via Reports.
Even though our experiment tracking is easy to instrument, there are great integrations available to make experiment tracking even easier for popular frameworks and libraries!


Minimal code setup needed to use the product

Getting started with W&B is a simple procedure
!pip install wandb
import wandb
with wandb.init(project = = "my_project") as run:
run.log({"metric": 0.1})
This is the common pattern employed, and can simplify significantly if using an available integration.

Ability to log model stats and results per training run

  • Default stats for models: Loss, Epochs, Accuracy, other common stats
  • Custom / Ad Hoc Stats
  • User Specified
Through the use of W&B python client, you can log anything to W&B. We support logging of data as key value pairs (think python dictionary), where values can take on many difference data types. Below, we show how logging model configuration, loss, accuracy and other common stats manifests. Plotting completed below was done within a project and directly imported into this report.


Some cases, you might be exploring more than 1 modeling methodology, you can leverage W&B to aggregate detail logged for each model. The first example below shows Feature Importance for all available features across all trained models (error bars represent 25th percent and 75 percentile respectively. The second example presents similar detail, but as a heat map where y axis represent the model, the x axis represents the feature, and the cell value represents the importance of the feature for the model. This is considered advanced usage of W&B Custom Charting.
You can also log plots made in your python environment. We provide methods to log Matplotly figures as well as Plotly figures using both wandb.Image and wandb.Plotly to log your figures / plots. In fact, there is a large set of Data types which W&B can log. Please see the details in the docs

Run set
60


Ability to log model code per training run

In our python client, when an experiment is initialized, if the use passes the save_code = True flag, code will be captured. Moreover, if the user is in a notebook, the entire notebook can be logged automatically, and we do this on a per experiment basis. We also permit comparisons of code between runs as seen below.




Ability to log system performance Metrics Per training Run

System performance is logged automatically when a user begins their experiment.


W&B supports rendering PyTorch traces using the Chrome Trace Viewer. There is an excellent W&B report available if you would like to dive deeper on the topic,
The setup can be particularly simple if you are already using PyTorch Lightning for your model development.
wandb_logger = WandbLogger(project='MNIST', log_model='all', save_code=True, ) # log all new checkpoints during training

training_loader = DataLoader(training_set, batch_size=64, shuffle=True, pin_memory=True)
validation_loader = DataLoader(validation_set, batch_size=64, pin_memory=True)
## Using a raw DataLoader, rather than LightningDataModule, for greater transparency

# Set up model
model = MNIST_LitModule(n_layer_1=128, n_layer_2=128)

trainer = Trainer(gpus=None, max_epochs=5, profiler="pytorch",logger=wandb_logger,
callbacks=[
log_predictions_callback,
checkpoint_callback
],
precision=32)
trainer.profiler.dirpath="/content/wandb/latest-run/tbprofile"
trainer.fit(model, training_loader, validation_loader)
# trace_files = glob.glob("/content/lightning_logs/*.pt.trace.json")
trace_files = glob.glob("/content/wandb/latest-run/tbprofile/*.pt.trace.json")
for i, trace_file in enumerate(trace_files):
if "training_step" in trace_file:
profile_art = wandb.Artifact(f"train-trace{i}-{wandb.run.id}", type="profile")
profile_art.add_file(trace_file, "train_trace.pt.trace.json")
else:
profile_art = wandb.Artifact(f"validation-trace{i}-{wandb.run.id}", type="profile")
profile_art.add_file(trace_file, "validation_trace.pt.trace.json")
wandb.log_artifact(profile_art)
wandb.finish()
Which can be used to render the trace in the UI, and from there, you can share via the UI itself, or incorporate the trace into your reports as needed, also you can make available system usage



Ability to log to different projects

It is simple to log to different projects. Consider the python script which requires logging info to two seperate projects, project1 and project2 it is as simple as
with wandb.init(project = "project1") as run:
run.log({"metric": 0.2})

with wandb.init(project = "project2") as run:
run.log({"metric": 0.3")
Then once this data has been logged, you can create reports that span multiple projects using /Panel grid command within a report.

Run set
1
Run set 2
Run set 3
1


Ability to import offline run data.

If you are in a situation with no connectivity, but still want to run experiments locally and upload your data at another time, you may always do so via wandb sync command.
In the example, we'll consider the case where wandb is being run in offline model.
with wandb.init(project = "my_project", mode = "offline") as run:
run.log({"metric": 0.02})
This would require you to perform a sync with W&B after the fact. All of the these offline runs get placed in ./wandb/offline*. From here, the sync could look like the following
import glob
import subprocess
offline_runs = glob.glob("./wandb/offline*")
for run in offline_runs:
output = subprocess.run(["wandb", "sync", run], stdout=subprocess.PIPE)
print(output.stdout)
You could also use wandb.resume in order to have W&B automatically resume runs that have crashed of exited unsuccessfully. Please see the docs.

Ability to perform analysis on model runs and model run data

  • Visual Analytics
  • Numerical Analytics

Interactive Tables

Through our Tables product, you can easily log sample predictions and interact with data. Moreover, supporting evidence could include: Artifacts (models and datasets), custom charts, and other data.
Below is an example of a sample of predictions logged to W&B. Predictions were made on the image column, and we have provided the actual label as well as the prediction across all runs where predictions have been logged. The table supports interactive analysis.
W&B Tables enable a granular analysis of predictions and results through tabular data manipulation. Oftentimes, understanding a model's behavior during or after training requires more than seeing a clean loss curve go down and to the right. We need to understand where specifically the model fails, what examples are giving it trouble, where we might need to collect more training data/re-label, or maybe even uncover more nuanced errors like numerical instability.
Tables can be used as a model evaluation store, which stores consolidated results on golden validation datasets across different trained models in your project. They can also be used as model leaderboards, where each row is a model class or architecture with embedded explainability or custom performance charts alongside them. These are both best practices which you can start incorporating with a few lines of code.



Ability to log supporting evidence per training run including prediction Samples






Artifact Tracking and Versioning

Artifacts are inputs and outputs of each part of your machine learning pipeline, namely datasets and models. Training datasets change over time as new data is collected, removed, or re-labeled, models change with new architectures being implemented along with continuous re-retraining. With these changes, all downstream tasks utilizing the changed datasets and models will be affected and understanding this dependency chain is critical for debugging effectively. W&B can log this dependency graph easily with a few lines of code.


Ability to log model file itself

Through usage of Artifacts, model files can easily be logged to W&B, and we've extended artifacts with a Model Registry (shown below). Please see the docs for more detail on Model Management

Fraud Detection LID 62a88bde90d0f2003c6a7bf9
Model card
Tags
Full name
wandb-smle/model-registry/Fraud Detection LID 62a88bde90d0f2003c6a7bf9
Type
model
Created At
July 28th, 2022
Automations
1 automation
Slack notifications
Notify the team when changes happen in the model registry.
Description

Model Description

Please see report https://wandb.ai/wandb-smle/h2o-autoML-classification/reports/AutoML-W-B--VmlldzoyMzkxNDIz#data-overview

The goal of this model is to " predict the potentially fraudulent providers " based on the claims filed by them.

Model Usage

Simple usage of model to perform batch scoring with the H2O POJO file in the artifact.

!pip install wandb datarobot-drum
import wandb
import os 

api = wandb.Api()
model_artifact = api.artifact('wandb-smle/model-registry/Fraud Detection LID 62a88bde90d0f2003c6a7bf9:production', type='model')
model_dir = model_artifact.download("MODEL")
dataset_artifact = api.artifact( "wandb-smle/h2o-autoML-classification/test-data:v0", type='data')
dataset_dir = dataset_artifact.download("DATA")
[ os.remove(os.path.join("./MODEL", f)) for f in os.listdir("./MODEL") if ".java" not in f]

os.environ["CODE_DIR"] = "./MODEL"
os.environ["TARGET_TYPE"] = "binary"
os.environ["POSITIVE_CLASS_LABEL"] = "1"
os.environ["NEGATIVE_CLASS_LABEL"] = "0"

!drum score --input "./DATA/test_data.csv" --output "./DATA/test_predictions.csv"
Versions
Version
Aliases
Logged By
Tags
Created
TTL Remaining
# of Consuming Runs
Size
m.HGLM
m.link
m.seed
m.alpha
m.prior
m.theta
m.family
m.lambda
m.nfolds
m.solver
m.obj_reg
m.auc_type
m.model_id
m.nlambdas
m.startval
m.calc_like
m.intercept
m.rand_link
m.checkpoint
m.cold_start
m.fold_column
m.plug_values
m.rand_family
m.standardize
m.beta_epsilon
m.interactions
m.non_negative
m.lambda_search
m.offset_column
m.early_stopping
m.max_iterations
m.random_columns
m.training_frame
m.weights_column
m.balance_classes
m.fold_assignment
m.ignored_columns
m.response_column
m.stopping_metric
m.stopping_rounds
m.beta_constraints
m.compute_p_values
m.gradient_epsilon
m.lambda_min_ratio
m.max_runtime_secs
m.validation_frame
m.ignore_const_cols
m.interaction_pairs
m.objective_epsilon
m.custom_metric_func
m.stopping_tolerance
m.tweedie_link_power
m.score_each_iteration
m.max_active_predictors
m.class_sampling_factors
m.export_checkpoints_dir
m.max_after_balance_size
m.tweedie_variance_power
m.missing_values_handling
m.generate_scoring_history
m.remove_collinear_columns
m.score_iteration_interval
m.max_confusion_matrix_size
m.keep_cross_validation_models
m.keep_cross_validation_predictions
m.keep_cross_validation_fold_assignment
0
latest
fraud-detection
v0
Thu Jul 28 2022
Inactive
1
14.0MB
False
logit
2
-1
1.000e-10
binomial
5
L_BFGS
0.0002662
AUTO
GLM_1_AutoML_3_20220728_181021
30
False
True
False
True
0.0001
False
True
True
300
AutoML_3_20220728_181021_training_py_918_sid_8464
False
Modulo
Panel currently unsupported for custom step metrics. Please use an expression that returns the pre-defined step metric (_step) for now
Label
AUTO
0
False
0.000001
0.0001
0
py_919_sid_8464
True
0.0001
0.001
1
False
-1
5
0
MeanImputation
False
False
-1
20
False
True
False
Loading...
Automations
AUTOMATION
EVENT TYPE
ACTION TYPE
DATE CREATED
LAST EXECUTION
test-trigger
New version
Job launch
Thu Jan 26 2023, at 08:38 PM
Loading...


Ability to log info about the data used to train the model

Not only would you be permitted to log datasets via artifacts, but you will also be able to log an supplementary info concerning your data. Image the instance where a training dataset is curated from three different dataset. We can log the schema to W&B as well as surface it in a report alongside the actual datasets detail logged to W&B.

processed-data
Artifact overview
Type
data
Created At
July 28th, 2022
Description
Versions
Version
Aliases
Logged By
Tags
Created
TTL Remaining
# of Consuming Runs
Size
0
latest
v0
Thu Jul 28 2022
Inactive
0
3.4MB
Loading...



Below is a dashboard created with evidentlyAI used to explore the data. Please excuse the messiness as this is a result of VERY long feature names.

Run set
1


Ability to retrieve / export all data logged

You may either use the W&B client or W&B api to retrieve such data. Click into the Overview tab below and find the detail on Model Usage to see usage via W&B API. Alternatively, click into the Usage tab below to see usage by way of W&B client. These are unique, as the latter couples the usage / download of the artifact with an experiment, whereas the former does no such thing.
You may also retrieve other details via the API such as experiment summary / configurations and the like. Please see the docs.

Fraud Detection LID 62a88bde90d0f2003c6a7bf9
Model card
Tags
Full name
wandb-smle/model-registry/Fraud Detection LID 62a88bde90d0f2003c6a7bf9
Type
model
Created At
July 28th, 2022
Automations
1 automation
Slack notifications
Notify the team when changes happen in the model registry.
Description

Model Description

Please see report https://wandb.ai/wandb-smle/h2o-autoML-classification/reports/AutoML-W-B--VmlldzoyMzkxNDIz#data-overview

The goal of this model is to " predict the potentially fraudulent providers " based on the claims filed by them.

Model Usage

Simple usage of model to perform batch scoring with the H2O POJO file in the artifact.

!pip install wandb datarobot-drum
import wandb
import os 

api = wandb.Api()
model_artifact = api.artifact('wandb-smle/model-registry/Fraud Detection LID 62a88bde90d0f2003c6a7bf9:production', type='model')
model_dir = model_artifact.download("MODEL")
dataset_artifact = api.artifact( "wandb-smle/h2o-autoML-classification/test-data:v0", type='data')
dataset_dir = dataset_artifact.download("DATA")
[ os.remove(os.path.join("./MODEL", f)) for f in os.listdir("./MODEL") if ".java" not in f]

os.environ["CODE_DIR"] = "./MODEL"
os.environ["TARGET_TYPE"] = "binary"
os.environ["POSITIVE_CLASS_LABEL"] = "1"
os.environ["NEGATIVE_CLASS_LABEL"] = "0"

!drum score --input "./DATA/test_data.csv" --output "./DATA/test_predictions.csv"
Versions
Version
Aliases
Logged By
Tags
Created
TTL Remaining
# of Consuming Runs
Size
m.HGLM
m.link
m.seed
m.alpha
m.prior
m.theta
m.family
m.lambda
m.nfolds
m.solver
m.obj_reg
m.auc_type
m.model_id
m.nlambdas
m.startval
m.calc_like
m.intercept
m.rand_link
m.checkpoint
m.cold_start
m.fold_column
m.plug_values
m.rand_family
m.standardize
m.beta_epsilon
m.interactions
m.non_negative
m.lambda_search
m.offset_column
m.early_stopping
m.max_iterations
m.random_columns
m.training_frame
m.weights_column
m.balance_classes
m.fold_assignment
m.ignored_columns
m.response_column
m.stopping_metric
m.stopping_rounds
m.beta_constraints
m.compute_p_values
m.gradient_epsilon
m.lambda_min_ratio
m.max_runtime_secs
m.validation_frame
m.ignore_const_cols
m.interaction_pairs
m.objective_epsilon
m.custom_metric_func
m.stopping_tolerance
m.tweedie_link_power
m.score_each_iteration
m.max_active_predictors
m.class_sampling_factors
m.export_checkpoints_dir
m.max_after_balance_size
m.tweedie_variance_power
m.missing_values_handling
m.generate_scoring_history
m.remove_collinear_columns
m.score_iteration_interval
m.max_confusion_matrix_size
m.keep_cross_validation_models
m.keep_cross_validation_predictions
m.keep_cross_validation_fold_assignment
0
latest
fraud-detection
v0
Thu Jul 28 2022
Inactive
1
14.0MB
False
logit
2
-1
1.000e-10
binomial
5
L_BFGS
0.0002662
AUTO
GLM_1_AutoML_3_20220728_181021
30
False
True
False
True
0.0001
False
True
True
300
AutoML_3_20220728_181021_training_py_918_sid_8464
False
Modulo
Panel currently unsupported for custom step metrics. Please use an expression that returns the pre-defined step metric (_step) for now
Label
AUTO
0
False
0.000001
0.0001
0
py_919_sid_8464
True
0.0001
0.001
1
False
-1
5
0
MeanImputation
False
False
-1
20
False
True
False
Loading...
Automations
AUTOMATION
EVENT TYPE
ACTION TYPE
DATE CREATED
LAST EXECUTION
test-trigger
New version
Job launch
Thu Jan 26 2023, at 08:38 PM
Loading...

Reports

W&B reports help contextualize and document the system of record built through logging diagnostics and results from different pieces of your pipeline. Reports are interactive and dynamic, reflecting filtered run sets logged in W&B. You can add all sorts of assets to a report; the one you are reading now includes plots, tables, images, code, and nested reports.

Ability to present model development effort progress

This is exactly the purpose of Reports. You can contextualize experiments and project, and share reports with colleagues and stake holders.

artifact
artifact
artifact