Guide for Managing W&B NextGen Highly Scalable Architecture

In this document, we'll document the new architecture of W&B, the recommended system provisioning, what each service does, and how to handle possible future issues.
Abraham Leal
Created on February 19|Last edited on February 25
Comment
Weights & Biases has developed a new highly scalable architecture for Foundational Model Builders that allows for logging of high amounts of runs, steps, and metrics in a single project. To make this possible, W&B has created multiple micro-services that handle the day to day needs of this data intensive system. We will first go over these services, and what each one of them does, both visually and through descriptions.
Weights & Biases Architecture﻿
﻿
Weights & Biases Services
wandb-appThis pod houses the front-end of W&B and cannot run in HA mode. It commonly requires a lot of memory to accommodate retrieving high amounts of data from the back-end. 
wandb-apiThis pod houses the W&B API that the SDK interacts with, it can run in HA mode, and we recommend at least 3 replicas to be running at any given time, on different machines.
wandb-consoleThis pod houses the W&B Operator console, and aids in the management of the W&B deployment + feature gating certain aspects of the platform. This pod normally does not require much memory or cpu.
wandb-executorThis pod takes data sent to W&B and stored in MySQL/BigTable and converts it into parquet files to be stored in Blob Storage. This is an asynchronous process, and the front end does not read from these files until the metrics are flagged to be successfully exported. This pod can run in HA mode and we commonly recommend at least 3 with large allocations of cpu and memory.
wandb-filestreamThis pod processes incoming metrics / objects from the SDK to store in MySQL/BigTable. It runs as a single service and mostly consumes memory.
wandb-flat-run-fields-updaterThis pod is part of the RunsV2 architecture for W&B which allows for asynchronous processing of incoming metrics. This process reads from the queue housing metrics sent by the SDK and sends them to their "hot" storage (MySQL/BigTable). It can scale horizontally and we recommend at least 3 running at any time.
wandb-glueThis pod coordinates and runs scheduled tasks. A task can take a number of forms, but mainly to perform bulk actions, such as data migrations and other housekeeping tasks.
wandb-parquetThis pod coordinates the status of the parquet exports for the W&B installs. Commonly low usage in cpu/memory.
wandb-parquet-backfillThis pod does data transformations to parquet for older metrics. Commonly low usage in cpu/memory.
wandb-prometheus-serverThis pod literally houses a Prometheus server for metric collection for monitoring through the console. Can be Disk intensive and memory intensive depending on logging configurations.
wandb-stackdriverThis pod does metric reporting for the console.
wandb-weaveThis pod is used to process query panel queries and powers W&B Weave. Typically a low usage service in terms of cpu/memory. 
﻿
RecommendationsAll these services work together to bring a highly performant data intensive platform for Model Training to you. We have some experience running it, since we built it :) so here are some tips and tricks on things to do as the amount of data scales along the way:
Prior to setting any of these, ensure your deployment is configured with proper sizing in dependencies, infrastructure, and services.
💡
﻿
Enable BigTable Autoscaling
The more data sent to W&B, the more writes BigTable has to be able to handle. Autoscaling in BigTable is a good idea. We recommend an absolute minimum of 3 nodes, but 10 nodes might be a good steady state if there is a high degree of data intensity.
Enable Autoscaling for the following W&B Services:
flat-run-fields-updater
filestream
Enable GCS FUSE
Make sure `enable_gcs_fuse_csi_driver` variable is `true` on TF 
Upgrade the versions in the channel config and release the image 
`"operator-wandb"` to `"0.22.3"
server image to `0.65.0` or later
Once the app started, add following to the `parquet` config
"fuse": {
    "enabled": true,
    "resources": {
      "limits": {
        "cpu": "0",
        "memory": "0",
        "ephemeral-storage": "10Gi"
      },
      "requests": {
        "cpu": "0",
        "memory": "0"
      }
    }
  },
4. Ensure the MySQL DB flags are set as follows
﻿
﻿
6. Set the following in Redis
	- set bigtablev3.filestream.metric_points_chunk_size 20000
	- set bigtablev3.filestream.metric_points_parallelization 4
	- set bigtablev3.filestream.write_cache_size 10000000
7. Contact W&B to setup proper monitoring and alerting on a myriad of metrics the platform exposes through configuring otel tracing. W&B can provide pre-made Datadog dashboards to help in the monitoring of the service.
RunbookIf the number of metrics being sent to W&B dramatically increase, this may cause a backup in the queue, you can verify this through monitoring. 
If this happens, ensure to wait for BigTable autoscaling (or scale bigtable) and monitor the situation
If there is a big queue backup from metrics being received, ensure to scale (or autoscale) the flat-run-field-updater service to handle more metric processing in parallel
If a single run sends an overwhelming amount of load and it slow down metrics processing from other runs, it can be ignored by the metrics processing service by setting the following in redis: set gorilla.DiscardV2UpdatesForRun <id-for-run>
﻿
Add a comment