HPC Dashboard Tutorial G.O.O.D. 2025
Integrating the HPC Dashboard with Open OnDemand
HPC Dashboard Getting Started Guide
Welcome to the HPC Dashboard Tutorial! This guide will walk you through deploying and running the HPC Dashboard on your provided virtual machine (VM).
Table of Contents
- Introduction
- Prerequisites
- Environment Setup
- Repository Setup
- Configuration
- Dashboard Deployment
- Theme Customization
- Production Setup
- SSL Certificate Setup
- Open OnDemand Integration
- Usage
- Best Practices
- Additional Resources
Introduction
The HPC Dashboard is a Next.js application designed for real-time monitoring of SLURM nodes. It provides detailed insights into CPU/GPU utilization, node status, job histories, and more. This tutorial will guide you through:
- Deploying the dashboard on a VM with a public IP
- Connecting to a remote SLURM API and Prometheus instance
- Setting up the Open OnDemand integration
Note:
Next.js is chosen for its server-side rendering capabilities and ease of deployment, making it an excellent option for creating dynamic and interactive dashboards. SLURM is widely used in high-performance computing environments to manage jobs and resources.
Prerequisites
Ensure you have the following:
- VM Access
- A linux Virtual Machine
- Open OnDemand running either on the same VM or on another System
- Prometheus running with different exporters
- SLURM API key
- SLURM and Prometheus IP addresses
Note:
Make sure your VM is securely configured and that you keep your SLURM API key confidential. Unauthorized access could lead to misuse of your computing resources.
Environment Setup
SSH Connection
To get started, you will need to SSH to the system.
ssh rocky@YOUR_VM
Note:
SSH (Secure Shell) is used to securely access remote systems. Always ensure that your SSH credentials are kept safe and never shared publicly.
Repository Setup
After logging in, switch to root and set up the dashboard:
sudo su -
cd /var/www/
npx create-slurm-dashboard slurm-node-dashboard
Note:
Explanation:
sudo su -
switches to the root user, ensuring you have the necessary permissions for installation.
cd /var/www/
changes the directory to the web server's root folder.
npx create-slurm-dashboard slurm-node-dashboard
uses NPX to scaffold the HPC Dashboard project in a directory called slurm-node-dashboard
.
Note:
If you accidentally select the wrong version during the setup prompts, simply delete the slurm-node-dashboard
directory and run the command again.
Configuration
Install Dependencies
Change into the dashboard directory, and install required Node.js packages:
cd slurm-node-dashboard
npm install
Environment Configuration
Set up the production environment by renaming the environment file, make sure to move the file to avoid having another environment causing conflicts:
mv .env.production .env
Then, update the .env
file with your configuration:
COMPANY_NAME="Tutorial"
VERSION=1.1.2
CLUSTER_NAME="Tutorial"
CLUSTER_LOGO="/logo.png"
PROMETHEUS_URL=""
OPENAI_API_KEY=""
NODE_ENV="production"
REACT_EDITOR="code"
SLURM_API_VERSION="v0.0.40"
SLURM_SERVER="your_slurm_server_ip_address
SLURM_API_TOKEN="your_slurm_api_token"
SLURM_API_ACCOUNT="slurm"
Note:
Do not expose your .env
file or its contents in a public repository as it contains sensitive information like API keys and tokens.
Dashboard Deployment
Development Mode
Start the development server with the following command:
npm run dev
Access the dashboard at http://your_vm_ip:3020
Note:
Running npm run dev
starts a development server that automatically reloads changes. This mode is ideal for testing and development but not for production.
The dashboard displays all tutorial systems, which are running as SLURM compute nodes. Hover over nodes to view system details including hostname, load, and core usage. Click on nodes to view running jobs.
The interface includes:
- Color scheme options (top right)
- Menu access to historical data and modules pages
- GitHub link for bug reports and forking
Adding Prometheus Integration
Currently, the dashboard supports node_exporter, ipmi_exporter and dcgm_exporter
To enable Prometheus:
-
Stop the development server (Ctrl+C)
-
Update the
.env
file:# Change this line PROMETHEUS_URL="http://prometheus_ip_address:9090"
-
Restart the development server:
npm run dev
Note:
Prometheus is a powerful monitoring system and time series database. Integrating it allows you to gather and display metrics like power data and node performance. Even though the data is simulated for VMs, in production, this will display real-time hardware metrics.
Production Setup
For production deployment, use PM2 to manage the dashboard:
# Install PM2
npm install -g pm2
# Build and start the application
npm run build
pm2 start npm --name "hpc-dashboard" -- start -- --port 3020
pm2 save
Note:
PM2 is a process manager that ensures your application stays up and can restart automatically after crashes.
npm run build
compiles the application for production.
pm2 save
stores the current process list so that PM2 can resurrect these processes on server restart.
Note:
Always test your production build in a staging environment before deploying to live users.
SSL Certificate Setup
Let's set up SSL certificates for both the dashboard and Open OnDemand. Secure communication is critical to protect data and ensure that connections between users and your services are encrypted. If some of these steps are only required if you are running both OOD and the dashboard on the same system. If you are not, you will not need to update the port on nginx.
1. Install Required Packages
dnf install -y nginx certbot
2. Configure Nginx to Use a Different Port
Update Nginx to listen on port 8080 instead of the default port 80:
sed -i 's/ 80 default_server;/ 8080 default_server;/g' /etc/nginx/nginx.conf
sed -i 's/]:80 default_server;/]:8080 default_server;/g' /etc/nginx/nginx.conf
systemctl restart nginx
For the next step to work, we need to stop HTTPD (where open ondemand is running) or nginx if nginx is still running on port 80
systemctl stop httpd nginx
3. Generate SSL Certificate
Generate a certificate using your hostname:
certbot certonly --standalone --register-unsafely-without-email -d $(hostname -f) --agree-tos
Note:
The --register-unsafely-without-email
flag means you won't receive renewal reminders. Consider registering with a valid email for production environments.
4. Create Dashboard Nginx Configuration
Create a configuration file for the dashboard:
cat > /etc/nginx/conf.d/slurmdash.conf << EOF
server {
listen 8443 ssl;
server_name $(hostname)
ssl_certificate /etc/letsencrypt/live/$(hostname -f)/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/$(hostname -f)/privkey.pem;
location / {
proxy_pass http://127.0.0.1:3020;
proxy_set_header Host \$host;
proxy_set_header X-Real-IP \$remote_addr;
proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto https;
}
}
EOF
5. Configure Open OnDemand SSL
Add SSL configuration to Open OnDemand:
cat >> /etc/ood/config/ood_portal.yml << EOF
ssl:
- 'SSLCertificateFile "/etc/letsencrypt/live/$(hostname -f)/cert.pem"'
- 'SSLCertificateKeyFile "/etc/letsencrypt/live/$(hostname -f)/privkey.pem"'
EOF
6. Restart Services
Restart the necessary services to apply the new configurations:
systemctl restart httpd nginx
Note:
Restarting services may temporarily disrupt active connections. Ensure this is done during a maintenance window if deployed in a production environment.
Now both services should be accessible via HTTPS:
https://YOUR_OOD_HOSTNAME # Open OnDemand
https://YOUR_DASH_HOSTNAME:8443 # Dashboard
Open OnDemand Integration
Open OnDemand is pre-installed on the VM. Your rocky credentials will work here as well.
Note:
Open OnDemand provides a web-based interface to HPC resources, making it easier for users to submit jobs, monitor progress, and manage files without needing to use the command line directly.
Installation Steps
- Clone the integration repository:
cd /var/www/ood/apps/sys/
git clone https://github.com/thediymaker/ood-status-iframe.git && cd ood-status-iframe
- Set up the Python environment:
python3 -m venv ood-status-iframe
source ood-status-iframe/bin/activate
python3 -m pip install -r requirements.txt
chmod +x bin/python
deactivate
Configuration Steps
- Update the iframe URL in
templates/layout.html
:
sudo sed -i "s|<iframe src=\".*\"|<iframe src=\"https://$(hostname -f):8443\"|" templates/layout.html
- Configure
manifest.yml
. You can update this to fit your needs:
name: System Status
description: HPC Status Page
category: System
subcategory: System Information
icon: fa://bar-chart
show_in_menu: true
Note:
If you update the name of the environment, ensure that the path in the bin/python
file is updated accordingly. Also, make sure the bin/python
file remains executable.
Access Open OnDemand at https://YOUR_OOD_HOSTNAME/
The status page is available under System → System Status.
Note:
The first time you access this, you will need to initialize the app by clicking the red box "Initialize App"
Usage
To submit a test job:
- Switch to the rocky user and prepare the batch script:
su - rocky
cd /scratch
cp /packages/slurm/submit.sbatch ./$(hostname -s).sbatch
- Edit the script to specify your node, the script will be named something like good-c1.sbatch
#SBATCH -w YOUR_HOSTNAME - for example #SBATCH -w good-c1
- Submit and monitor jobs:
sbatch $(hostname -s).sbatch
scontrol show jobs
Note:
This section demonstrates a simple workflow to submit a job to the SLURM scheduler. Make sure the node specified in your batch script corresponds to an available compute node.
Best Practices
- Implement SSL certificates (as covered in this guide)
- Deploy behind NGINX or HTTPD with SSL (as covered in this guide)
- Enable authentication for Open OnDemand
- Restrict dashboard access
- Maintain regular security updates
Note:
Following best practices is crucial for maintaining a secure and stable HPC environment.
Additional Resources
- Slurm Dashboard Wiki - More details and troubleshooting tips.
- HPC Dashboard Repository - Source code and contributions.
- Next.js Documentation - Learn more about Next.js.
- Tailwind CSS Documentation - For styling and design customization.
- SLURM Documentation - Detailed SLURM configuration and usage.
- Prometheus Documentation - Learn how to effectively use Prometheus for monitoring.
- shadcn/ui Documentation - Component library documentation.
Video Tutorials
- Quick Start Guide - A fast-paced walkthrough of the setup.
- Open OnDemand Integration - Detailed explanation of integrating Open OnDemand.
Made with ❤️ for HPC