DCGM Exporter Setup
Installing and configuring NVIDIA's DCGM exporter for GPU monitoring
Prerequisites
Before installing the DCGM exporter, ensure that:
- You have NVIDIA GPUs installed in your system
- NVIDIA drivers are properly installed and functioning
- You have the following components installed:
- NVIDIA CUDA toolkit (recommended version 11.0+)
- NVIDIA Container Toolkit (for Docker deployments)
Driver Compatibility:
Ensure your NVIDIA driver version is compatible with the DCGM version. Typically, you need driver version 450.80.02 or newer for recent DCGM versions.
Installation
Basic Installation
For systems where Docker is not available:
-
Install NVIDIA DCGM from the NVIDIA Developer Downloads page:
# For Ubuntu/Debian # First, add the CUDA repository to your system # Visit https://developer.nvidia.com/cuda-downloads to get the appropriate commands # for your specific distribution and version # Then install DCGM sudo apt-get update sudo apt-get install -y datacenter-gpu-manager-4 # For RHEL/CentOS # First, add the CUDA repository to your system # Visit https://developer.nvidia.com/cuda-downloads to get the appropriate commands # for your specific distribution and version # Then install DCGM sudo dnf -y install datacenter-gpu-manager-4
-
Start the DCGM service:
sudo systemctl start nvidia-dcgm sudo systemctl enable nvidia-dcgm
-
Clone the DCGM exporter repository:
git clone https://github.com/NVIDIA/dcgm-exporter.git cd dcgm-exporter
-
Build the exporter:
make binary
-
Create a systemd service file:
sudo nano /etc/systemd/system/dcgm-exporter.service
Add the following content:
[Unit] Description=NVIDIA DCGM Exporter After=nvidia-dcgm.service Requires=nvidia-dcgm.service [Service] Type=simple ExecStart=/path/to/dcgm-exporter/bin/dcgm-exporter [Install] WantedBy=multi-user.target
-
Enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable dcgm-exporter sudo systemctl start dcgm-exporter
Collection Interval Configuration
You can configure the collection interval by setting the collect-interval
parameter:
# For systemd service
ExecStart=/path/to/dcgm-exporter/bin/dcgm-exporter --collect-interval=30
The interval is specified in seconds (default is 30 seconds).
Troubleshooting
-
No metrics are reported:
- Verify NVIDIA drivers are loaded:
nvidia-smi
- Check DCGM service status:
sudo systemctl status nvidia-dcgm
- Ensure the exporter can connect to DCGM:
dcgmi discovery -l
- Verify NVIDIA drivers are loaded:
-
Missing GPU metrics:
- Verify GPU health:
nvidia-smi -q
- Check your metrics configuration file
- Ensure the user running the exporter has access to GPU devices
- Verify GPU health:
Performance Impact
DCGM exporter has been designed to have minimal impact on GPU performance:
- The default collection interval (30s) is appropriate for most workloads
- Consider increasing the interval for production systems with heavy GPU utilization
- Collecting fewer metrics can reduce the overhead
Resource Usage:
DCGM typically uses less than 1% of CPU time and minimal memory. The exporter itself has negligible impact on GPU performance.
Security Considerations
- Restrict access to the metrics endpoint with firewall rules
- Consider using a reverse proxy with TLS for secure communication
- Run the exporter with minimal permissions required to access GPU devices