Monitoring GPU cluster performance with NVIDIA DCGM-Exporter and Weights & Biases
[DRAFT] A guide on consuming system metrics of GPU clusters exposed by NVIDIA DCGM-Exporter with Weights & Biases.
Created on April 13|Last edited on January 27
Comment
NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. NVIDIA DCGM allows users to gather GPU metrics and understand workload behavior and monitor GPU performance in clusters. The DCGM-Exporter tool exposes GPU metrics at an HTTP endpoint (/metrics) in the OpenMetrics format, the de-facto standard for transmitting cloud-native metrics at scale. DCGM-Exporter can be deployed in a multitude of environments, for example in Kubernetes.
Weights & Biases (W&B) SDK comes with the OpenMetrics feature, which enables users to capture and log metrics from external endpoints that expose OpenMetrics / Prometheus-compatible data, with custom regex-based metric filters to be applied to the consumed endpoints.
Configuring W&B SDK to consume DCGM-Exporter-exposed metricsSimple ExampleNotesSLURM + NVIDIA DCGM-Exporter Multi-Node Examplesrun-dcgm-multinode.pyCollected metricsSLURM + NVIDIA DCGM-Exporter Single Node Examplerun.shsrun-dcgm.pyDeploying DCGM-Exporter in Google Kubernetes Engine (GKE) GPU clusterConfig filesdcgm.yml (modified from https://github.com/suffiank/dcgm-on-gke)dcgm_loadtest.yml (from https://github.com/suffiank/dcgm-on-gke)clusterip_service.ymlubuntu.yml
Configuring W&B SDK to consume DCGM-Exporter-exposed metrics
- The user can configure the integration either by using environment variables or with settings passed to the wandb.init function.
- The user can define OM/P endpoints to scrape (_stats_open_metrics_endpoints) in the following format: {"open-metrics-endpoint-name": "<url>"}
- Optionally, regex-based filters (_stats_open_metrics_filters) to apply to the consumed metrics in the following formats:
- {"metric-regex-pattern-including-endpoint-name-as-prefix": {"label": "label value regex pattern", ...}, ...}
- ("metric-regex-pattern-including-endpoint-name-as-prefix", ...)
- The sampling interval (_stats_sample_rate_seconds - how often wandb’s system monitor will scrape the user-defined OM/P endpoints) defaults to 2 seconds and the number of samples to be averaged before reporting to the wandb server (_stats_samples_to_average) defaults to 15. That is, you would, by default, get a data point every 30 seconds.
Simple Example
Let’s look at an example python script:
import timeimport tqdmimport wandbrun = wandb.init(project="dcgm",settings=wandb.Settings(x_stats_open_metrics_endpoints={"node1": "http://192.168.0.1:9400/metrics", # ensure this is the metrics endpoint"node2": "http://192.168.0.2:9400/metrics",},x_stats_open_metrics_filters={"node1.DCGM_FI_DEV_(POWER_USAGE|MEM_COPY_UTIL|TOTAL_ENERGY_CONSUMPTION|GPU_TEMP|MEMORY_TEMP)": {"gpu": "[0,1]",},"node2.DCGM_FI_DEV_(POWER_USAGE|MEM_COPY_UTIL|TOTAL_ENERGY_CONSUMPTION|GPU_TEMP|MEMORY_TEMP)": {"gpu": ".*",},},# optional headers in case, for example, the endpoints sit behind a proxy requiring authentication# x_stats_open_metrics_http_headers={"Authorization": "Bearer MEDVED"}),)for i in tqdm.tqdm(range(300)):time.sleep(1)run.log({"loss": 1.0 / (i + 1)})run.finish()
- The wandb SDK will consume two OpenMetrics endpoints defined by the _stats_open_metrics_endpoints setting.
- The keys of the dictionary provided by the user will be used to name-space the scrapped metrics in the app (see below for example screenshots).
- In the System section on the Run page in the app, the user will see be 5 plots (the filtered metrics, see below) per endpoint (10 in total).
- The setting _stats_open_metrics_filters defines the filters that will be applied to the consumed data: only five metrics for both endpoints will be saved to wandb (DCGM_FI_DEV_POWER_USAGE, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, DCGM_FI_DEV_GPU_TEMP, and DCGM_FI_DEV_MEMORY_TEMP). For the first endpoint, only the metrics related to the GPUs 0 and 1 will be streamed, while for the second endpoint, all data will be streamed.
Notes
- On certain clusters, if the user wants to scrape the data from the endpoint running on the same node, they can’t use the node name or its IP address (due to networking setup) - the user must use http://localhost:9400/metrics as the endpoint URL.
- The DCGM Exporter could output a lot of data. The following five metrics are some of the most useful ones for tracking NVIDIA GPU performance in a cluster:
- DCGM_FI_DEV_POWER_USAGE: Power usage for the device in Watts
- DCGM_FI_DEV_MEM_COPY_UTIL: Memory Utilization
- DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION: Total energy consumption for the GPU in mJ since the driver was last reloaded
- DCGM_FI_DEV_GPU_TEMP: Current temperature readings for the device, in degrees C
- DCGM_FI_DEV_MEMORY_TEMP: Memory temperature for the device
SLURM + NVIDIA DCGM-Exporter Multi-Node Example
In this example, we will show how to configure wandb to consume DCGM-Exporter-exposed metrics in a GPU cluster managed with SLURM.
We will parse several SLURM-related environment endpoints inside a python script and use that to construct the endpoints to be scrapped. This example emulates a situation where the user wants to initialize a wandb run only on the RANK=0 node, but still capture GPU-related metrics for all the relevant nodes.
srun --nodes=2 --gpus=4 --cpus-per-gpu=12 --job-name=wandb python3 /<path>/<to>/srun-dcgm-multinode.py
srun-dcgm-multinode.py
import osimport reimport timefrom typing import Listimport tqdmimport wandbdef unpack_node_list(regex_string: str) -> List[str]:"""Unpacks a regex with the SLURM node list string into a list of strings."""pattern = re.compile(r"\[([\d,-]+)\]")match = pattern.search(regex_string)if match:segments = match.group(1).split(',')unpacked_strings = []for segment in segments:if '-' in segment:start_number, end_number = [int(x) for x in segment.split('-')]for number in range(start_number, end_number + 1):unpacked_strings.append(regex_string[: match.start()] + str(number) + regex_string[match.end():])else:unpacked_strings.append(regex_string[: match.start()] + segment + regex_string[match.end():])return unpacked_stringsreturn [regex_string]def main() -> None:# simulate creating a wandb run only on the rank 0 nodeif os.environ.get("SLURM_NODEID") != "0":returnnode_list = unpack_node_list(os.environ["SLURM_NODELIST"])# drop this node:node_list = [node for node in node_list if node != os.environ["SLURMD_NODENAME"]]run = wandb.init(project="dcgm",settings=wandb.Settings(x_stats_sampling_interval=1,x_stats_open_metrics_endpoints={**{"node1": "http://localhost:9400/metrics", # see note above},**{f"node{n+2}": f"http://{node}:9400/metrics" for n, node in enumerate(node_list)}},x_stats_open_metrics_filters={".*DCGM_FI_DEV_(POWER_USAGE|MEM_COPY_UTIL|TOTAL_ENERGY_CONSUMPTION|GPU_TEMP|MEMORY_TEMP)": {"gpu": ".*",},},),)for i in tqdm.tqdm(range(300)):time.sleep(1)run.log({"loss": 1 / (i + 1)})run.finish()
Here, we’ve configured the wandb SDK’s system monitor to sample system metrics every second (_stats_sampling_interval=1; defaults to 10).
Collected metrics
The collected metrics will look like what is shown below. The user can aggregate the metrics to track, for example, resource utilization at a glance, and be able to dig deep into the details if necessary.
Run: dry-meadow-24
1
SLURM + NVIDIA DCGM-Exporter Single Node Example
In this example, we will set up environment variables in a shell script (where we will propagate the CUDA_VISIBLE_DEVICES env variable into the filters to only capture the metrics for the visible GPUs) and execute a simple python script. The wandb SDK will consume the metrics reported by the NVIDIA DCGM Exporter running on the same node.
srun --nodes=1 --gpus=2 --cpus-per-gpu=12 --job-name=wandb /admin/home-dimaduev/run.sh
run.sh
#!/usr/bin/bashexport WANDB_X_STATS_OPEN_METRICS_ENDPOINTS='{"node1": "http://localhost:9400/metrics"}'WANDB_X_STATS_OPEN_METRICS_FILTERS='{"node1.DCGM_FI_DEV_(POWER_USAGE|MEM_COPY_UTIL|TOTAL_ENERGY_CONSUMPTION|GPU_TEMP|MEMORY_TEMP)": {"gpu": "[__CUDA_VISIBLE_DEVICES__]"}}'export WANDB_X_STATS_OPEN_METRICS_FILTERS=$(echo "$WANDB__STATS_OPEN_METRICS_FILTERS" | sed "s/__CUDA_VISIBLE_DEVICES__/$CUDA_VISIBLE_DEVICES/")export WANDB_X_STATS_SAMPLING_INTERVAL=15/usr/bin/python3 srun-dcgm.py
The collected metrics will look like what is shown below. The user can aggregate the metrics to track, for example, resource utilization at a glance, and be able to dig deep into the details if necessary.
Run: dry-meadow-24
1
srun-dcgm.py
import timeimport tqdmimport wandbdef main() -> None:run = wandb.init(project="dcgm")for _ in tqdm.tqdm(range(300)):time.sleep(1)run.log({"x": _})run.finish()if __name__ == "__main__":main()
Deploying DCGM-Exporter in Google Kubernetes Engine (GKE) GPU cluster
💡 TL;DR: - Create a GKE cluster with two nodes, with two NVIDIA Tesla T4 GPUs each. - Deploy a GPU monitoring system to it that uses NVIDIA DCGM. The collected metrics are exposed by the NVIDIA DCGM Exporter. - Deploy a pod running GPU load test. - Create a ClusterIP service exposing the DCGM Exporter’s metrics endpoint. - Create a pod to scrape the metrics.
Ensure your gcloud is configured and you have all proper permissions to spin up a cluster in GKE with GPU resources.
export ZONE="us-central1-f"export CLUSTER_NAME="gke-dcgm"
Create the cluster and get its credentials:
gcloud beta container clusters create $CLUSTER_NAME \--zone $ZONE \--machine-type=n1-standard-8 \--accelerator=type=nvidia-tesla-t4,count=2 \--num-nodes=2 \--enable-managed-prometheusgcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE
Install GPU drivers and wait for the cluster to be up and running:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yamlkubectl get pods -n kube-system | grep nvidia-gpu-device-plugin
Configure the cluster and deploy resources to it, including a GPU monitoring system that will use NVIDIA DCGM and DCGM Exporter deployed as a DaemonSet
kubectl create namespace gpu-monitoring-systemkubectl apply -f dcgm.yml # see below for the file content
Run a GPU load test:
kubectl apply -f dcgm_loadtest.yml # see below for the file content
Create a ClusterIP service that exposes the DCGM exporter pod's endpoint:
kubectl apply -f clusterip_service.yml # see below for the file content
From another pod in the cluster (in the same namespace), use the service name and port to reach the exposed endpoint
kubectl apply -f ubuntu.yml # see below for the file contentkubectl get pods --all-namespaceskubectl exec -it --namespace=gpu-monitoring-system ubuntu -- /bin/bashapt update && apt install curlcurl http://dcgm-exporter-svc:9400/metrics
Config files
dcgm.yml (modified from https://github.com/suffiank/dcgm-on-gke)
dcgm_loadtest.yml (from https://github.com/suffiank/dcgm-on-gke)
clusterip_service.yml
ubuntu.yml
Add a comment