GPU Utilization Reporting
Comprehensive guide for reporting and visualizing GPU utilization for jobs on the HPC system
GPU Utilization Reporting
This guide shows how to configure, collect, and visualize GPU utilization metrics for jobs running on your HPC cluster. Once this is enabled, you will be able to see the utilization of GPUs over the entire length of the job in the Admin panel.
Prerequisites
- Prometheus installed and enabled
- NVIDIA DCGM Exporter
- NVIDIA GPUs
- Slurm workload manager
Configuration Steps
1. Enable HPC Job Details in DCGM-Exporter
First, update the DCGM Exporter service to include job mapping details:
# Edit the systemd service file
sudo vim /etc/systemd/system/dcgm-exporter.service
Use the following configuration:
[Unit]
Description=Nvidia Data Center GPU Manager Exporter
Wants=network-online.target
After=network-online.target
[Service]
Environment="DCGM_HPC_JOB_MAPPING_DIR=/var/run/dcgm_job_maps"
User=node_exporter
Group=node_exporter
Type=simple
ExecStartPre=/bin/bash -c 'mkdir -p "$DCGM_HPC_JOB_MAPPING_DIR" && chmod 775 "$DCGM_HPC_JOB_MAPPING_DIR"; for i in $(seq 0 $(( $(nvidia-smi -L | wc -l) - 1 ))); do FILE="$DCGM_HPC_JOB_MAPPING_DIR/$i"; touch "$FILE" && chmod 666 "$FILE"; [ -s "$FILE" ] || echo 0 > "$FILE"; done'
ExecStart=/usr/local/bin/dcgm_exporter -d f
[Install]
WantedBy=multi-user.target
Reload and restart the service:
sudo systemctl daemon-reload
sudo systemctl restart dcgm-exporter
Verify Your Environment:
Ensure your directories and executables exist with proper permissions before proceeding.
2. Create Required Directories and Scripts
Create the job mapping directory on each node:
sudo mkdir -p /var/run/dcgm_job_maps
sudo chmod 775 /var/run/dcgm_job_maps
Create Slurm epilog and prolog scripts to track GPU allocation:
Epilog Script (/usr/local/bin/_dev_epilog
):
#!/bin/bash
if [[ -n "${SLURM_JOB_GPUS}" ]]; then
JOB_MAPPING_DIR="/var/run/dcgm_job_maps"
for GPU_ID in ${SLURM_JOB_GPUS//,/ }; do
truncate -s 0 "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
echo "0" >> "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
done
fi
Prolog Script (/usr/local/bin/_dev_prolog
):
#!/bin/bash
if [[ -n ${SLURM_JOB_GPUS} ]]; then
JOB_MAPPING_DIR="/var/run/dcgm_job_maps"
for GPU_ID in ${SLURM_JOB_GPUS//,/ }; do
truncate -s 0 "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
echo "$SLURM_JOB_ID" >> "$JOB_MAPPING_DIR/$GPU_ID" 2>/dev/null
done
fi
Set proper permissions:
sudo chmod +x /usr/local/bin/_dev_epilog
sudo chmod +x /usr/local/bin/_dev_prolog
3. Configure Slurm to Use the Scripts
Add the following to your slurm.conf:
TaskEpilog=/usr/local/bin/_dev_epilog
TaskProlog=/usr/local/bin/_dev_prolog
Restart Slurm services:
sudo systemctl restart slurmctld
sudo systemctl restart slurmd
Verification
Once configured correctly, you should see the hpc_job
label in your Prometheus metrics:
DCGM_FI_DEV_GPU_UTIL{DCGM_FI_DRIVER_VERSION="535.154.05", Hostname="g002", UUID="GPU-4d5e36c6-c835-c5a2-65dc-833267ebf851", device="nvidia0", gpu="0", hpc_job="482", instance="192.168.1.2:9400", job="dcgm_exporter", modelName="NVIDIA GeForce GTX 2080ti", pci_bus_id="00000000:01:00.0"}
Troubleshooting
Common Issues
-
Missing job labels: Ensure your epilog/prolog scripts have proper permissions and are being executed
-
Zero utilization reported: Check that the GPU is properly allocated to the job in Slurm
-
Service fails to start: Verify the paths in your service file and that the user has appropriate permissions
Debug Commands
Check if DCGM is running:
sudo systemctl status dcgm-exporter
View DCGM logs:
sudo journalctl -u dcgm-exporter
Verify job mapping files:
ls -la /var/run/dcgm_job_maps/