GPU Utilization Reporting

Comprehensive guide for reporting and visualizing GPU utilization for jobs on the HPC system

GPU Utilization Reporting

This guide shows how to configure, collect, and visualize GPU utilization metrics for jobs running on your HPC cluster. Once this is enabled, you will be able to see the utilization of GPUs over the entire length of the job in the Admin panel.

Prerequisites

  • Prometheus installed and enabled
  • NVIDIA DCGM Exporter
  • NVIDIA GPUs
  • Slurm workload manager

Configuration Steps

1. Enable HPC Job Details in DCGM-Exporter

First, update the DCGM Exporter service to include job mapping details:

# Edit the systemd service file
sudo vim /etc/systemd/system/dcgm-exporter.service

Use the following configuration:

[Unit]
Description=Nvidia Data Center GPU Manager Exporter
Wants=network-online.target
After=network-online.target

[Service]
Environment="DCGM_HPC_JOB_MAPPING_DIR=/var/run/dcgm_job_maps"
User=node_exporter
Group=node_exporter
Type=simple
ExecStartPre=/bin/bash -c 'mkdir -p "$DCGM_HPC_JOB_MAPPING_DIR" && chmod 775 "$DCGM_HPC_JOB_MAPPING_DIR"; for i in $(seq 0 $(( $(nvidia-smi -L | wc -l) - 1 ))); do FILE="$DCGM_HPC_JOB_MAPPING_DIR/$i"; touch "$FILE" && chmod 666 "$FILE"; [ -s "$FILE" ] || echo 0 > "$FILE"; done'
ExecStart=/usr/local/bin/dcgm_exporter -d f

[Install]
WantedBy=multi-user.target

Reload and restart the service:

sudo systemctl daemon-reload
sudo systemctl restart dcgm-exporter

Verify Your Environment:

Ensure your directories and executables exist with proper permissions before proceeding.

2. Create Required Directories and Scripts

Create the job mapping directory on each node:

sudo mkdir -p /var/run/dcgm_job_maps
sudo chmod 775 /var/run/dcgm_job_maps

Create Slurm epilog and prolog scripts to track GPU allocation:

Epilog Script (/usr/local/bin/_dev_epilog):

#!/bin/bash
if [[ -n "${SLURM_JOB_GPUS}" ]]; then
    JOB_MAPPING_DIR="/var/run/dcgm_job_maps"
    for GPU_ID in ${SLURM_JOB_GPUS//,/ }; do
            truncate -s 0 "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
            echo "0" >> "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
    done
fi

Prolog Script (/usr/local/bin/_dev_prolog):

#!/bin/bash
if [[ -n ${SLURM_JOB_GPUS} ]]; then
        JOB_MAPPING_DIR="/var/run/dcgm_job_maps"
        for GPU_ID in ${SLURM_JOB_GPUS//,/ }; do
                truncate -s 0 "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
                echo "$SLURM_JOB_ID" >> "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
        done
fi

Set proper permissions:

sudo chmod +x /usr/local/bin/_dev_epilog
sudo chmod +x /usr/local/bin/_dev_prolog

3. Configure Slurm to Use the Scripts

Add the following to your slurm.conf:

TaskEpilog=/usr/local/bin/_dev_epilog
TaskProlog=/usr/local/bin/_dev_prolog

Restart Slurm services:

sudo systemctl restart slurmctld
sudo systemctl restart slurmd

Verification

Once configured correctly, you should see the hpc_job label in your Prometheus metrics:

DCGM_FI_DEV_GPU_UTIL{DCGM_FI_DRIVER_VERSION="535.154.05", Hostname="g002", UUID="GPU-4d5e36c6-c835-c5a2-65dc-833267ebf851", device="nvidia0", gpu="0", hpc_job="482", instance="192.168.1.2:9400", job="dcgm_exporter", modelName="NVIDIA GeForce GTX 2080ti", pci_bus_id="00000000:01:00.0"}

Troubleshooting

Common Issues

  1. Missing job labels: Ensure your epilog/prolog scripts have proper permissions and are being executed

  2. Zero utilization reported: Check that the GPU is properly allocated to the job in Slurm

  3. Service fails to start: Verify the paths in your service file and that the user has appropriate permissions

Debug Commands

Check if DCGM is running:

sudo systemctl status dcgm-exporter

View DCGM logs:

sudo journalctl -u dcgm-exporter

Verify job mapping files:

ls -la /var/run/dcgm_job_maps/