Adding Exporters to Prometheus
Integrating node_exporter, ipmi_exporter, and dcgm_exporter with Prometheus
Prerequisites
Before proceeding, ensure that:
- Prometheus is installed and running
- The exporters (node_exporter, ipmi_exporter, and/or dcgm_exporter) are installed and operational
- You have access to the Prometheus configuration file
Basic Prometheus Configuration
Updating prometheus.yml
-
Locate your Prometheus configuration file:
# Default location varies by installation method # Common paths include: /etc/prometheus/prometheus.yml /opt/prometheus/prometheus.yml
-
Edit the configuration file to add scrape configurations for your exporters:
sudo nano /etc/prometheus/prometheus.yml
-
Add the following scrape configurations to the file:
scrape_configs: # Existing scrape configs... # Node Exporter - job_name: "node" static_configs: - targets: ["localhost:9100"] # IPMI Exporter - job_name: "ipmi" scrape_interval: 1m scrape_timeout: 30s static_configs: - targets: ["localhost:9290"] # DCGM Exporter - job_name: "dcgm" scrape_interval: 30s static_configs: - targets: ["localhost:9400"]
-
Reload Prometheus to apply the changes:
sudo systemctl restart prometheus
Configuration Validation:
Always validate your Prometheus configuration before applying it: bash promtool check config /etc/prometheus/prometheus.yml
Advanced Configuration
Monitoring Multiple Hosts
To monitor multiple hosts, update your configuration with multiple targets:
scrape_configs:
- job_name: "node"
static_configs:
- targets:
- "server1:9100"
- "server2:9100"
- "server3:9100"
labels:
environment: "production"
Using Service Discovery
For dynamic environments, consider using service discovery:
scrape_configs:
- job_name: "node"
consul_sd_configs:
- server: "localhost:8500"
services: ["node-exporter"]
IPMI Exporter with Remote Targets
For monitoring multiple BMCs with a single IPMI exporter:
scrape_configs:
- job_name: 'ipmi'
scrape_interval: 1m
params:
module: [default]
static_configs:
- targets:
- '192.168.1.101' # BMC IP addresses
- '192.168.1.102'
- '192.168.1.103'
labels:
exporter: 'ipmi'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 'localhost:9290' # IPMI exporter address
Metric Relabeling
Customizing Labels
You can use metric relabeling to add or modify labels:
scrape_configs:
- job_name: "node"
static_configs:
- targets: ["localhost:9100"]
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: "(.*):.+"
replacement: "${1}"
- target_label: datacenter
replacement: "dc1"
Filtering Metrics
To reduce storage requirements, you can filter out unnecessary metrics:
scrape_configs:
- job_name: "node"
static_configs:
- targets: ["localhost:9100"]
metric_relabel_configs:
- source_labels: [__name__]
regex: "node_disk_io_time_seconds.*"
action: drop
Security Considerations
TLS Configuration
For secure communication with exporters:
scrape_configs:
- job_name: "node"
scheme: https
tls_config:
cert_file: /path/to/cert
key_file: /path/to/key
ca_file: /path/to/ca
static_configs:
- targets: ["secure-node:9100"]
Basic Authentication
If your exporters are protected with basic authentication:
scrape_configs:
- job_name: "node"
basic_auth:
username: prometheus
password: secret
static_configs:
- targets: ["localhost:9100"]
Secure Storage:
Store sensitive information like passwords using Prometheus' file-based secret management or a secret management tool like HashiCorp Vault.
Verifying Configuration
After configuring Prometheus, verify that it's correctly scraping metrics:
- Access the Prometheus web interface (usually at
http://localhost:9090
) - Navigate to Status > Targets to see all configured targets
- Check that all exporters are in the "UP" state
- Use the Graph interface to query for metrics from each exporter:
- Node Exporter:
node_cpu_seconds_total
- IPMI Exporter:
ipmi_temperature_celsius
- DCGM Exporter:
DCGM_FI_DEV_GPU_UTIL
- Node Exporter:
Labels Inspection:
Inspect the available labels for each metric to understand how to query and filter your data effectively. This is especially useful for creating dashboards and alerts.
Maintenance and Best Practices
Scrape Interval Optimization
Balance between data granularity and storage requirements:
- Node Exporter: 15-30s for most environments
- IPMI Exporter: 60s or longer (hardware metrics change slowly)
- DCGM Exporter: 15-30s for active GPU monitoring
Storage Retention
Configure appropriate storage retention based on your needs:
# In prometheus.yml
storage:
tsdb:
path: /var/lib/prometheus/data
retention.time: 15d # Retain data for 15 days
retention.size: 30GB # Maximum storage size
Troubleshooting Common Issues
Target Scrape Failures
If a target is showing as "DOWN":
-
Check connectivity:
curl http://target-host:port/metrics
-
Verify firewall rules to ensure the Prometheus server can access the exporter.
-
Check exporter logs for any errors:
sudo journalctl -u node_exporter sudo journalctl -u ipmi_exporter sudo docker logs dcgm-exporter
Missing Metrics
If expected metrics are not appearing:
-
Check if the metric exists in the raw exporter output:
curl http://localhost:9100/metrics | grep node_cpu curl http://localhost:9290/metrics | grep ipmi_temperature curl http://localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL
-
Verify metric relabeling isn't dropping your metrics.
-
Check collector configuration for specialized exporters.
Conclusion
By following this guide, you've successfully integrated multiple exporters with Prometheus to monitor your infrastructure comprehensively. The combination of node_exporter, ipmi_exporter, and dcgm_exporter provides visibility into system metrics, hardware health, and GPU performance, enabling proactive monitoring and troubleshooting of your entire stack.
For further assistance or advanced configurations, refer to the official documentation for each component: