Federated Learning with Weights & Biases

This report details some ways W&B can be used to support Federated Learning.
Created on April 19|Last edited on January 22
Comment
﻿
What is Federated Learning and When is it Useful?Federated learning is a machine learning technique where multiple clients collaboratively train a machine learning model, with each client training a local model on their own local dataset, and then sharing the weights with other clients to enable collaboration, typically over many rounds.
Federated learning is useful in scenarios where data may be stored in different locations and where it may be impractical (e.g., huge constantly changing datasets) or illegal (e.g., due to GDPR) to centralize the dataset for traditional machine learning. 
For example, consider a scenario where a bank has operations across the world. Due to the GDPR, it may not be possible for the bank to transfer their EU data to the USA for training models, and the inverse may also be true. As we consider more and more countries the bank may operate in, it becomes clear that there exists a large number of siloed datasets that can not be centralised as each dataset can not leave the silo in which it lives. And yet: deep learning models benefit from larger datasets. 
This is precisely where federated learning can be useful. In a federated learning setup, a model would be trained within each silo, and the weights of each model would be aggregated together. By doing so, each model is trained on data which never leaves the silo, but by aggregating the weights of the local model with other local models, a form of collaborative learning takes place. 
 TL;DR: Federated Learning permits learning from large distributed datasets in a privacy preserving manner where it is not possible due to technical or legal reasons to centralize the data for non-federated machine/deep learning.
💡
﻿
An example of a federated learning scenario. Green lines represent bi-directional communication.
How W&B Helps Federated LearningWhile federated learning provides an approach to solve important problems where data is siloed but collaborative learning is beneficial to building better models, it has both the challenges from distributed computing  and machine learning combined.
Challenge 1: ObservabilityFederated learning suffers from an observability problem. Machine learning models run on distributed and remote hardware with their own datasets, where access may be challenging if not impossible. Each federated learning experiment may involve anywhere from 2 to thousands if not hundreds of thousands of models. 
How can the performance of thousands of models be effectively tracked, organized and analyzed? W&B provides a centralized system of record where each experiment running on each federated client is able to log relevant performance metrics back to W&B, where these can be interrogated individually or as part of the larger experiment. 
﻿
﻿
﻿
Hardware heterogeneity brings another challenge for observability. In federated learning experiments, clients may be using different hardware. W&B system metric logging can help diagnose, debug and improve federated learning experiments to better utilize the heterogeneity of hardware in your federation. 
﻿
﻿
Challenge 2: Huge Number of Artifacts per ExperimentW&B Artifacts keeps track of your datasets, models, dependencies, and results through each step of your machine learning pipeline, providing a complete and auditable history of changes to your files. 
A regular machine learning experiment has a number of experimental artifacts that is helpful to track using W&B. For example, a single non-federated experiment may have a single dataset that is being used to train a single model and produce a single set of performance metrics. In a federated learning setting however, these three metrics multiply by the number of clients in the experiment. A simplified non-federated experiment may have 3 artifacts being tracked, the equivalent federated experiment with 1000 clients would have 3000 artifacts. 
W&B provides different ways of storing your artifacts. If privacy and security matters then each client can connect to their own object stores for tracking artifacts (e.g., S3, GCP) as to remain in control of the data, or reference artifacts can be used. This prevents data from leaving your system entirely, works with local filesystems—plus, metadata about the artifacts can still be tracked, such as URLs, size, and checksums. Reference artifacts provides many of the benefits while maintaining privacy.
In the below artifact lineage view of the global federated model (specifically version 21 of it), change the style in the dropdown menu to "Complete". From this view we can see how it fits into the wider federated learning pipeline. We can see the federated learning server run (zany-dew-122) that produced it, and we can see the resulting clients (comic-bee-125, fresh-breeze-124) that trained with that server (connecting via the "connectivity" artifact called conn_details:v97), and each of their datasets (e.g., (client_dataset:v267) they trained on, and their resulting local models (e.g., client_model:v205)!
﻿
project("wandb-smle", "federated_launch").artifact("conn_details")
conn_detailsVersion 24
All Versions
Aliases
Versions
v131
v125
v123
v122
v121
v120
v119
v118
v117
v116
v115
v114
v113
v112
v111
v110
v109
v108
v107
v106
v105
v104
v103
v102
v101
v100
v99
v98
v97
v96
v95
v94
v93
v92
v91
v90
v89
v88
v87
v86
v79
v78
v77
v76
v75
v74
v73
v72
v71
v70
v69
v68
v67
v66
v65
v64
v63
v62
v61
v60
v59
v58
v57
v55
v54
v53
v52
v51
v50
v49
v48
v47
v46
v45
v44
v43
v42
v41
v40
v39
v38
v37
v36
v35
v34
v33
v31
v30
v29
v28
v27
v26
v25
v24
v23
v22
v21
v20
v19
v18
VersionMetadataUsageFilesLineage
Direct lineage view
Expanded graph
Include generated artifacts
Artifact - model
client_model:v42
Artifact - dataset
client_dataset:v75
Artifact - model
client_model:v44
Artifact - dataset
client_dataset:v74
Artifact - model
client_model:v43
Artifact - dataset
client_dataset:v73
Artifact - connectivity
conn_details:v24
Run - server
skilled-sweep-3
Run - client
smart-star-21
Run - client
autumn-resonance-20
Run - client
soft-hill-19
React Flow
﻿
﻿
Challenge 3: Hyperparameter OptimisationA regular machine learning experiment typically has a single model that is tuned via hyperparameter optimization. Following the trends above, it should be unsurprising that this is more challenging in federated learning given a single experiment will have many models to tune. Thankfully, however, W&B makes it possible to conduct hyperparameter sweeps easily in federated learning, where the W&B Sweep controller distributes the hyperparameters in each  trial to the federated server, who in turn is able to communicate these to each client involved in training. 
With W&B Sweeps it we can gain insights into parameter importance both globally and locally, with the ability to see how this varies across clients, a key value-add in federated learning where data is non-IID. 
﻿
﻿
﻿
﻿
Challenge 4: Increased Manual ProcessesAs mentioned earlier in this report, federated learning inherits the challenges of two notoriously difficult fields, distributed computing and machine learning. 
If we were to consider running a single federated learning experiment, this may involve starting a federated server on one node, and then spinning up many clients to take part in the training, each of which could be running on their own node. As the number of clients increases, so does the burden of coordinating training runs. While there are many ad hoc ways of approaching this problem, they tend to be hacks, unstable and difficult to maintain and scale.
W&B provides two tools to solve this problem. With W&B Launch we are able to spin up jobs on different infrastructure, automatically building the required environment and executing the training runs, all controllable from the W&B UI or command line interface. Further, with the use of W&B Automations, we are able to, for example, spin up each client across different infrastructure, automatically, whenever a federated learning server is launched.  This reduces the manual burden of coordinating the launching of the servers and clients across potentially heterogeneous hardware. 
﻿
This overview provides an example setup where W&B Launch can be used to deploy federated learning across different infrastructure effortlessly, and via W&B Automations, spin up clients to being training (across infrastructure) automatically. 
ConclusionThanks for reading our report about federated learning and how W&B can help make it a whole lot easier. If you're interested in checking out W&B, it's always free to try. And feel free to peruse our docs if you're looking for a little more information about anything in this post. 
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
Add a comment
Tags: Articles, Intermediate, Distributed Training, Sweeps, Experiment
Iterate on AI agents and models faster. Try Weights & Biases today.