Basic Features

Overview of core HPC Dashboard capabilities

Basic Dashboard Features

Introduction

The HPC Dashboard provides a comprehensive monitoring and management interface for High-Performance Computing environments. Designed with both system administrators and end users in mind, it offers real-time insights into cluster status, job information, and resource utilization.

History

This dashboard project replaces our legacy system, which was built with PHP and relied heavily on complex SQL queries. The previous implementation often required nearly a minute for page reloads and sometimes exceeded browser memory limits. While effective for its time, the original dashboard eventually became difficult to maintain as its complexity grew.

Figure 1: PHP Agave Dashboard

Figure 1. PHP based Agave cluster status

When we brought the new Sol cluster online, we needed an interim solution. We couldn't support the new cluster with our legacy system, as it required custom queries and databases we were no longer using. This led us to develop the Python-based status page shown here, which was only available through Open OnDemand.

Figure 2: Python Sol Dashboard

Figure 2. Python based Sol cluster status

After evaluating our requirements and community needs, we implemented our current solution. The vision behind this new implementation was to leverage modern tools and established frameworks to create a solution that could be easily shared, maintained, and expanded upon by the HPC community.

Figure 3: Next.js Sol Dashboard

Figure 3. Next.js based Sol cluster status

Overview

The primary purpose of this dashboard is to provide real-time visibility into a site's HPC system. It enables end users to monitor the cluster's operational status, stay informed about upcoming maintenance events, access detailed job information, and view individual node statuses. The dashboard serves as a central information hub that promotes transparency and efficient resource utilization.

Real-time Node Status

The most recognizable component of this dashboard is the landing page, featuring color-coded squares that provide an immediate visual indication of system status.

This interface is designed to function effectively in Network Operations Center (NOC) environments. It allows operators and users to assess system health and availability at a glance. The intuitive color scheme helps quickly identify operational nodes, nodes with issues, and those undergoing maintenance.

Figure 4: Node Status View

Figure 4. Node status

Maintenance Notifications

A highly valued feature is the maintenance notification system, which provides timely reminders about scheduled system downtime.

These notifications appear as prominent warning messages that will not reappear for 24 hours after being dismissed. This ensures users are informed without being repeatedly interrupted. The maintenance information is pulled directly from the Slurm workload manager (a job scheduling system), ensuring consistency across all system interfaces.

Figure 5: Maintenance Alerts

Figure 5. Maintenance alerts

Rack View

This enhanced visualization provides a detailed breakdown of the physical locations of systems. Alternatively, it allows administrators to organize nodes according to any classification scheme they find useful.

The rack view offers an intuitive representation of the cluster's physical architecture, helping both administrators and users better understand the system's organization.

Figure 6: Rack Visualization

Figure 6. Rack view

Note: We plan to add a Netbox plugin in the future to automatically pull rack and node details from the Netbox API, further streamlining system management.

User Job Details

The user job details feature enables end users to search for specific job IDs and retrieve comprehensive information about those jobs.

For completed jobs, the system provides valuable performance metrics, including resource utilization efficiency. This information helps users optimize their workflows and make more effective use of cluster resources.

Figure 7: Job Details Interface

Figure 7. Job details

Note: We are working to add GPU efficiency metrics to the job details panel in the next release, enhancing visibility into accelerator utilization.

Historical Cluster Status

The historical cluster status page maintains configurable snapshots of your system. This allows administrators and users to review the status of the system at previous points in time.

By default, these snapshots are captured at one-hour intervals. This provides a useful timeline of system performance and utilization patterns that can inform future resource planning.

Figure 8: Historical Status View

Figure 8. Historical view

Partition, Feature, and QoS Information

The dashboard provides users with the ability to filter and sort nodes based on various attributes, including:

  • Node types
  • Features
  • States
  • Partitions

This functionality helps users identify the most appropriate resources for their specific computational needs, improving overall system utilization and job performance.

Modules

The dashboard offers an easy way for users to browse the available software modules on the system. Users can quickly find which applications and libraries are installed and ready for use.

We are currently developing additional functionality to show usage statistics, such as the number of module loads over a specified period. This will allow users to browse modules by popularity, helping them discover commonly used tools in their research domain.

Figure 8: Module View

Figure 9. Module view