DCGM Exporter Setup

Installing and configuring NVIDIA's DCGM exporter for GPU monitoring

Prerequisites

Before installing the DCGM exporter, ensure that:

  1. You have NVIDIA GPUs installed in your system
  2. NVIDIA drivers are properly installed and functioning
  3. You have the following components installed:
    • NVIDIA CUDA toolkit (recommended version 11.0+)
    • NVIDIA Container Toolkit (for Docker deployments)

Driver Compatibility:

Ensure your NVIDIA driver version is compatible with the DCGM version. Typically, you need driver version 450.80.02 or newer for recent DCGM versions.

Installation

Basic Installation

For systems where Docker is not available:

  1. Install NVIDIA DCGM from the NVIDIA Developer Downloads page:

    # For Ubuntu/Debian
    # First, add the CUDA repository to your system
    # Visit https://developer.nvidia.com/cuda-downloads to get the appropriate commands
    # for your specific distribution and version
    
    # Then install DCGM
    sudo apt-get update
    sudo apt-get install -y datacenter-gpu-manager-4
    
    # For RHEL/CentOS
    # First, add the CUDA repository to your system
    # Visit https://developer.nvidia.com/cuda-downloads to get the appropriate commands
    # for your specific distribution and version
    
    # Then install DCGM
    sudo dnf -y install datacenter-gpu-manager-4
    
  2. Start the DCGM service:

    sudo systemctl start nvidia-dcgm
    sudo systemctl enable nvidia-dcgm
    
  3. Clone the DCGM exporter repository:

    git clone https://github.com/NVIDIA/dcgm-exporter.git
    cd dcgm-exporter
    
  4. Build the exporter:

    make binary
    
  5. Create a systemd service file:

    sudo nano /etc/systemd/system/dcgm-exporter.service
    

    Add the following content:

    [Unit]
    Description=NVIDIA DCGM Exporter
    After=nvidia-dcgm.service
    Requires=nvidia-dcgm.service
    
    [Service]
    Type=simple
    ExecStart=/path/to/dcgm-exporter/bin/dcgm-exporter
    
    [Install]
    WantedBy=multi-user.target
    
  6. Enable and start the service:

    sudo systemctl daemon-reload
    sudo systemctl enable dcgm-exporter
    sudo systemctl start dcgm-exporter
    

Collection Interval Configuration

You can configure the collection interval by setting the collect-interval parameter:

# For systemd service
ExecStart=/path/to/dcgm-exporter/bin/dcgm-exporter --collect-interval=30

The interval is specified in seconds (default is 30 seconds).

Troubleshooting

  1. No metrics are reported:

    • Verify NVIDIA drivers are loaded: nvidia-smi
    • Check DCGM service status: sudo systemctl status nvidia-dcgm
    • Ensure the exporter can connect to DCGM: dcgmi discovery -l
  2. Missing GPU metrics:

    • Verify GPU health: nvidia-smi -q
    • Check your metrics configuration file
    • Ensure the user running the exporter has access to GPU devices

Performance Impact

DCGM exporter has been designed to have minimal impact on GPU performance:

  • The default collection interval (30s) is appropriate for most workloads
  • Consider increasing the interval for production systems with heavy GPU utilization
  • Collecting fewer metrics can reduce the overhead

Resource Usage:

DCGM typically uses less than 1% of CPU time and minimal memory. The exporter itself has negligible impact on GPU performance.

Security Considerations

  • Restrict access to the metrics endpoint with firewall rules
  • Consider using a reverse proxy with TLS for secure communication
  • Run the exporter with minimal permissions required to access GPU devices