HPC Dashboard Tutorial G.O.O.D. 2025

Integrating the HPC Dashboard with Open OnDemand

HPC Dashboard Getting Started Guide

Welcome to the HPC Dashboard Tutorial! This guide will walk you through deploying and running the HPC Dashboard on your provided virtual machine (VM).

Table of Contents

  1. Introduction
  2. Prerequisites
  3. Environment Setup
  4. Repository Setup
  5. Configuration
  6. Dashboard Deployment
  7. Theme Customization
  8. Production Setup
  9. SSL Certificate Setup
  10. Open OnDemand Integration
  11. Usage
  12. Best Practices
  13. Additional Resources

Introduction

The HPC Dashboard is a Next.js application designed for real-time monitoring of SLURM nodes. It provides detailed insights into CPU/GPU utilization, node status, job histories, and more. This tutorial will guide you through:

  • Deploying the dashboard on a VM with a public IP
  • Connecting to a remote SLURM API and Prometheus instance
  • Setting up the Open OnDemand integration

Note:

Next.js is chosen for its server-side rendering capabilities and ease of deployment, making it an excellent option for creating dynamic and interactive dashboards. SLURM is widely used in high-performance computing environments to manage jobs and resources.


Prerequisites

Ensure you have the following:

  • VM Access
    • A linux Virtual Machine
    • Open OnDemand running either on the same VM or on another System
    • Prometheus running with different exporters
    • SLURM API key
    • SLURM and Prometheus IP addresses

Note:

Make sure your VM is securely configured and that you keep your SLURM API key confidential. Unauthorized access could lead to misuse of your computing resources.


Environment Setup

SSH Connection

To get started, you will need to SSH to the system.

ssh rocky@YOUR_VM

Note:

SSH (Secure Shell) is used to securely access remote systems. Always ensure that your SSH credentials are kept safe and never shared publicly.


Repository Setup

After logging in, switch to root and set up the dashboard:

sudo su -
cd /var/www/
npx create-slurm-dashboard slurm-node-dashboard

Note:

Explanation:
sudo su - switches to the root user, ensuring you have the necessary permissions for installation.
cd /var/www/ changes the directory to the web server's root folder.
npx create-slurm-dashboard slurm-node-dashboard uses NPX to scaffold the HPC Dashboard project in a directory called slurm-node-dashboard.

Note:

If you accidentally select the wrong version during the setup prompts, simply delete the slurm-node-dashboard directory and run the command again.


Configuration

Install Dependencies

Change into the dashboard directory, and install required Node.js packages:

cd slurm-node-dashboard
npm install

Environment Configuration

Set up the production environment by renaming the environment file, make sure to move the file to avoid having another environment causing conflicts:

mv .env.production .env

Then, update the .env file with your configuration:

COMPANY_NAME="Tutorial"
VERSION=1.1.2
CLUSTER_NAME="Tutorial"
CLUSTER_LOGO="/logo.png"

PROMETHEUS_URL=""
OPENAI_API_KEY=""

NODE_ENV="production"
REACT_EDITOR="code"

SLURM_API_VERSION="v0.0.40"
SLURM_SERVER="your_slurm_server_ip_address
SLURM_API_TOKEN="your_slurm_api_token" 
SLURM_API_ACCOUNT="slurm"

Note:

Do not expose your .env file or its contents in a public repository as it contains sensitive information like API keys and tokens.


Dashboard Deployment

Development Mode

Start the development server with the following command:

npm run dev

Access the dashboard at http://your_vm_ip:3020

Note:

Running npm run dev starts a development server that automatically reloads changes. This mode is ideal for testing and development but not for production.

The dashboard displays all tutorial systems, which are running as SLURM compute nodes. Hover over nodes to view system details including hostname, load, and core usage. Click on nodes to view running jobs.

The interface includes:

  • Color scheme options (top right)
  • Menu access to historical data and modules pages
  • GitHub link for bug reports and forking

Adding Prometheus Integration

Currently, the dashboard supports node_exporter, ipmi_exporter and dcgm_exporter

To enable Prometheus:

  1. Stop the development server (Ctrl+C)

  2. Update the .env file:

    # Change this line
    PROMETHEUS_URL="http://prometheus_ip_address:9090"
    
  3. Restart the development server:

    npm run dev
    

Note:

Prometheus is a powerful monitoring system and time series database. Integrating it allows you to gather and display metrics like power data and node performance. Even though the data is simulated for VMs, in production, this will display real-time hardware metrics.


Production Setup

For production deployment, use PM2 to manage the dashboard:

# Install PM2
npm install -g pm2

# Build and start the application
npm run build
pm2 start npm --name "hpc-dashboard" -- start -- --port 3020
pm2 save

Note:

PM2 is a process manager that ensures your application stays up and can restart automatically after crashes.
npm run build compiles the application for production.
pm2 save stores the current process list so that PM2 can resurrect these processes on server restart.

Note:

Always test your production build in a staging environment before deploying to live users.


SSL Certificate Setup

Let's set up SSL certificates for both the dashboard and Open OnDemand. Secure communication is critical to protect data and ensure that connections between users and your services are encrypted. If some of these steps are only required if you are running both OOD and the dashboard on the same system. If you are not, you will not need to update the port on nginx.

1. Install Required Packages

dnf install -y nginx certbot

2. Configure Nginx to Use a Different Port

Update Nginx to listen on port 8080 instead of the default port 80:

sed -i 's/ 80 default_server;/ 8080 default_server;/g' /etc/nginx/nginx.conf
sed -i 's/]:80 default_server;/]:8080 default_server;/g' /etc/nginx/nginx.conf
systemctl restart nginx

For the next step to work, we need to stop HTTPD (where open ondemand is running) or nginx if nginx is still running on port 80

systemctl stop httpd nginx

3. Generate SSL Certificate

Generate a certificate using your hostname:

certbot certonly --standalone --register-unsafely-without-email -d $(hostname -f) --agree-tos

Note:

The --register-unsafely-without-email flag means you won't receive renewal reminders. Consider registering with a valid email for production environments.

4. Create Dashboard Nginx Configuration

Create a configuration file for the dashboard:

cat > /etc/nginx/conf.d/slurmdash.conf << EOF
server {
    listen 8443 ssl;
    server_name $(hostname)

    ssl_certificate /etc/letsencrypt/live/$(hostname -f)/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/$(hostname -f)/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:3020;
        proxy_set_header Host \$host;
        proxy_set_header X-Real-IP \$remote_addr;
        proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto https;
    }
}
EOF

5. Configure Open OnDemand SSL

Add SSL configuration to Open OnDemand:

cat >> /etc/ood/config/ood_portal.yml << EOF
ssl:
  - 'SSLCertificateFile "/etc/letsencrypt/live/$(hostname -f)/cert.pem"'
  - 'SSLCertificateKeyFile "/etc/letsencrypt/live/$(hostname -f)/privkey.pem"'
EOF

6. Restart Services

Restart the necessary services to apply the new configurations:

systemctl restart httpd nginx

Note:

Restarting services may temporarily disrupt active connections. Ensure this is done during a maintenance window if deployed in a production environment.

Now both services should be accessible via HTTPS:

https://YOUR_OOD_HOSTNAME        # Open OnDemand
https://YOUR_DASH_HOSTNAME:8443    # Dashboard

Open OnDemand Integration

Open OnDemand is pre-installed on the VM. Your rocky credentials will work here as well.

Note:

Open OnDemand provides a web-based interface to HPC resources, making it easier for users to submit jobs, monitor progress, and manage files without needing to use the command line directly.

Installation Steps

  1. Clone the integration repository:
cd /var/www/ood/apps/sys/
git clone https://github.com/thediymaker/ood-status-iframe.git && cd ood-status-iframe
  1. Set up the Python environment:
python3 -m venv ood-status-iframe
source ood-status-iframe/bin/activate
python3 -m pip install -r requirements.txt
chmod +x bin/python
deactivate

Configuration Steps

  1. Update the iframe URL in templates/layout.html:
sudo sed -i "s|<iframe src=\".*\"|<iframe src=\"https://$(hostname -f):8443\"|" templates/layout.html
  1. Configure manifest.yml. You can update this to fit your needs:
name: System Status
description: HPC Status Page
category: System
subcategory: System Information
icon: fa://bar-chart
show_in_menu: true

Note:

If you update the name of the environment, ensure that the path in the bin/python file is updated accordingly. Also, make sure the bin/python file remains executable.

Access Open OnDemand at https://YOUR_OOD_HOSTNAME/

The status page is available under System → System Status.

Note:

The first time you access this, you will need to initialize the app by clicking the red box "Initialize App"


Usage

To submit a test job:

  1. Switch to the rocky user and prepare the batch script:
su - rocky
cd /scratch
cp /packages/slurm/submit.sbatch ./$(hostname -s).sbatch
  1. Edit the script to specify your node, the script will be named something like good-c1.sbatch
#SBATCH -w YOUR_HOSTNAME - for example #SBATCH -w good-c1
  1. Submit and monitor jobs:
sbatch $(hostname -s).sbatch
scontrol show jobs

Note:

This section demonstrates a simple workflow to submit a job to the SLURM scheduler. Make sure the node specified in your batch script corresponds to an available compute node.


Best Practices

  • Implement SSL certificates (as covered in this guide)
  • Deploy behind NGINX or HTTPD with SSL (as covered in this guide)
  • Enable authentication for Open OnDemand
  • Restrict dashboard access
  • Maintain regular security updates

Note:

Following best practices is crucial for maintaining a secure and stable HPC environment.


Additional Resources


Video Tutorials


Made with ❤️ for HPC