HPC Dashboard Tutorial G.O.O.D. 2025

Integrating the HPC Dashboard with Open OnDemand

HPC Dashboard Getting Started Guide

Welcome to the HPC Dashboard Tutorial! This guide will walk you through deploying and running the HPC Dashboard on your provided virtual machine (VM).

Introduction
Prerequisites
Environment Setup
Repository Setup
Configuration
Dashboard Deployment
Theme Customization
Production Setup
SSL Certificate Setup
Open OnDemand Integration
Usage
Best Practices
Additional Resources

Introduction

The HPC Dashboard is a Next.js application designed for real-time monitoring of SLURM nodes. It provides detailed insights into CPU/GPU utilization, node status, job histories, and more. This tutorial will guide you through:

Deploying the dashboard on a VM with a public IP
Connecting to a remote SLURM API and Prometheus instance
Setting up the Open OnDemand integration

Note:

Next.js is chosen for its server-side rendering capabilities and ease of deployment, making it an excellent option for creating dynamic and interactive dashboards. SLURM is widely used in high-performance computing environments to manage jobs and resources.

Prerequisites

Ensure you have the following:

VM Access
- A linux Virtual Machine
- Open OnDemand running either on the same VM or on another System
- Prometheus running with different exporters
- SLURM API key
- SLURM and Prometheus IP addresses

Note:

Make sure your VM is securely configured and that you keep your SLURM API key confidential. Unauthorized access could lead to misuse of your computing resources.

Environment Setup

SSH Connection

To get started, you will need to SSH to the system.

ssh rocky@YOUR_VM

Note:

SSH (Secure Shell) is used to securely access remote systems. Always ensure that your SSH credentials are kept safe and never shared publicly.

Repository Setup

After logging in, switch to root and set up the dashboard:

sudo su -
cd /var/www/
npx create-slurm-dashboard slurm-node-dashboard

Note:

Explanation:
sudo su - switches to the root user, ensuring you have the necessary permissions for installation.
cd /var/www/ changes the directory to the web server's root folder.
npx create-slurm-dashboard slurm-node-dashboard uses NPX to scaffold the HPC Dashboard project in a directory called slurm-node-dashboard.

Note:

If you accidentally select the wrong version during the setup prompts, simply delete the slurm-node-dashboard directory and run the command again.

Configuration

Install Dependencies

Change into the dashboard directory, and install required Node.js packages:

cd slurm-node-dashboard
npm install

Environment Configuration

Set up the production environment by renaming the environment file, make sure to move the file to avoid having another environment causing conflicts:

mv .env.production .env

Then, update the .env file with your configuration:

COMPANY_NAME="Tutorial"
VERSION=1.1.2
CLUSTER_NAME="Tutorial"
CLUSTER_LOGO="/logo.png"

PROMETHEUS_URL=""
OPENAI_API_KEY=""

NODE_ENV="production"
REACT_EDITOR="code"

SLURM_API_VERSION="v0.0.40"
SLURM_SERVER="your_slurm_server_ip_address
SLURM_API_TOKEN="your_slurm_api_token" 
SLURM_API_ACCOUNT="slurm"

Note:

Do not expose your .env file or its contents in a public repository as it contains sensitive information like API keys and tokens.

Dashboard Deployment

Development Mode

Start the development server with the following command:

npm run dev

Access the dashboard at http://your_vm_ip:3020

Note:

Running npm run dev starts a development server that automatically reloads changes. This mode is ideal for testing and development but not for production.

The dashboard displays all tutorial systems, which are running as SLURM compute nodes. Hover over nodes to view system details including hostname, load, and core usage. Click on nodes to view running jobs.

The interface includes:

Color scheme options (top right)
Menu access to historical data and modules pages
GitHub link for bug reports and forking

Adding Prometheus Integration

Currently, the dashboard supports node_exporter, ipmi_exporter and dcgm_exporter

To enable Prometheus:

Stop the development server (Ctrl+C)

Update the .env file:

# Change this line
PROMETHEUS_URL="http://prometheus_ip_address:9090"

Restart the development server:
```
npm run dev
```

Note:

Prometheus is a powerful monitoring system and time series database. Integrating it allows you to gather and display metrics like power data and node performance. Even though the data is simulated for VMs, in production, this will display real-time hardware metrics.

Production Setup

For production deployment, use PM2 to manage the dashboard:

# Install PM2
npm install -g pm2

# Build and start the application
npm run build
pm2 start npm --name "hpc-dashboard" -- start -- --port 3020
pm2 save

Note:

PM2 is a process manager that ensures your application stays up and can restart automatically after crashes.
npm run build compiles the application for production.
pm2 save stores the current process list so that PM2 can resurrect these processes on server restart.

Note:

Always test your production build in a staging environment before deploying to live users.

SSL Certificate Setup

Let's set up SSL certificates for both the dashboard and Open OnDemand. Secure communication is critical to protect data and ensure that connections between users and your services are encrypted. If some of these steps are only required if you are running both OOD and the dashboard on the same system. If you are not, you will not need to update the port on nginx.

1. Install Required Packages

dnf install -y nginx certbot

2. Configure Nginx to Use a Different Port

Update Nginx to listen on port 8080 instead of the default port 80:

sed -i 's/ 80 default_server;/ 8080 default_server;/g' /etc/nginx/nginx.conf
sed -i 's/]:80 default_server;/]:8080 default_server;/g' /etc/nginx/nginx.conf
systemctl restart nginx

For the next step to work, we need to stop HTTPD (where open ondemand is running) or nginx if nginx is still running on port 80

systemctl stop httpd nginx

3. Generate SSL Certificate

Generate a certificate using your hostname:

certbot certonly --standalone --register-unsafely-without-email -d $(hostname -f) --agree-tos

Note:

The --register-unsafely-without-email flag means you won't receive renewal reminders. Consider registering with a valid email for production environments.

4. Create Dashboard Nginx Configuration

Create a configuration file for the dashboard:

cat > /etc/nginx/conf.d/slurmdash.conf << EOF
server {
    listen 8443 ssl;
    server_name $(hostname)

    ssl_certificate /etc/letsencrypt/live/$(hostname -f)/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/$(hostname -f)/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:3020;
        proxy_set_header Host \$host;
        proxy_set_header X-Real-IP \$remote_addr;
        proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto https;
    }
}
EOF

5. Configure Open OnDemand SSL

Add SSL configuration to Open OnDemand:

cat >> /etc/ood/config/ood_portal.yml << EOF
ssl:
  - 'SSLCertificateFile "/etc/letsencrypt/live/$(hostname -f)/cert.pem"'
  - 'SSLCertificateKeyFile "/etc/letsencrypt/live/$(hostname -f)/privkey.pem"'
EOF

6. Restart Services

Restart the necessary services to apply the new configurations:

systemctl restart httpd nginx

Note:

Restarting services may temporarily disrupt active connections. Ensure this is done during a maintenance window if deployed in a production environment.

Now both services should be accessible via HTTPS:

https://YOUR_OOD_HOSTNAME        # Open OnDemand
https://YOUR_DASH_HOSTNAME:8443    # Dashboard

Open OnDemand Integration

Open OnDemand is pre-installed on the VM. Your rocky credentials will work here as well.

Note:

Open OnDemand provides a web-based interface to HPC resources, making it easier for users to submit jobs, monitor progress, and manage files without needing to use the command line directly.

Installation Steps

Clone the integration repository:

cd /var/www/ood/apps/sys/
git clone https://github.com/thediymaker/ood-status-iframe.git && cd ood-status-iframe

Set up the Python environment:

python3 -m venv ood-status-iframe
source ood-status-iframe/bin/activate
python3 -m pip install -r requirements.txt
chmod +x bin/python
deactivate

Configuration Steps

Update the iframe URL in templates/layout.html:

sudo sed -i "s|<iframe src=\".*\"|<iframe src=\"https://$(hostname -f):8443\"|" templates/layout.html

Configure manifest.yml. You can update this to fit your needs:

name: System Status
description: HPC Status Page
category: System
subcategory: System Information
icon: fa://bar-chart
show_in_menu: true

Note:

If you update the name of the environment, ensure that the path in the bin/python file is updated accordingly. Also, make sure the bin/python file remains executable.

Access Open OnDemand at https://YOUR_OOD_HOSTNAME/

The status page is available under System → System Status.

Note:

The first time you access this, you will need to initialize the app by clicking the red box "Initialize App"

Usage

To submit a test job:

Switch to the rocky user and prepare the batch script:

su - rocky
cd /scratch
cp /packages/slurm/submit.sbatch ./$(hostname -s).sbatch

Edit the script to specify your node, the script will be named something like good-c1.sbatch

#SBATCH -w YOUR_HOSTNAME - for example #SBATCH -w good-c1

Submit and monitor jobs:

sbatch $(hostname -s).sbatch
scontrol show jobs

Note:

This section demonstrates a simple workflow to submit a job to the SLURM scheduler. Make sure the node specified in your batch script corresponds to an available compute node.

Best Practices

Implement SSL certificates (as covered in this guide)
Deploy behind NGINX or HTTPD with SSL (as covered in this guide)
Enable authentication for Open OnDemand
Restrict dashboard access
Maintain regular security updates

Note:

Following best practices is crucial for maintaining a secure and stable HPC environment.

Additional Resources

Slurm Dashboard Wiki - More details and troubleshooting tips.
HPC Dashboard Repository - Source code and contributions.
Next.js Documentation - Learn more about Next.js.
Tailwind CSS Documentation - For styling and design customization.
SLURM Documentation - Detailed SLURM configuration and usage.
Prometheus Documentation - Learn how to effectively use Prometheus for monitoring.
shadcn/ui Documentation - Component library documentation.

Video Tutorials

Quick Start Guide - A fast-paced walkthrough of the setup.
Open OnDemand Integration - Detailed explanation of integrating Open OnDemand.

Made with ❤️ for HPC

PreviousSite Themes

Getting Started

Dashboard Overview

Installation

Advanced Features

AI Integration

Reporting

Integrations

Customization

Tutorials

HPC Dashboard Tutorial G.O.O.D. 2025

HPC Dashboard Getting Started Guide

Table of Contents

Introduction

Prerequisites

Environment Setup

SSH Connection

Repository Setup

Configuration

Install Dependencies

Environment Configuration

Dashboard Deployment

Development Mode

Adding Prometheus Integration

Production Setup

SSL Certificate Setup

1. Install Required Packages

2. Configure Nginx to Use a Different Port

3. Generate SSL Certificate

4. Create Dashboard Nginx Configuration

5. Configure Open OnDemand SSL

6. Restart Services

Open OnDemand Integration

Installation Steps

Configuration Steps

Usage

Best Practices

Additional Resources

Video Tutorials

On this page