Basic Features
Overview of core HPC Dashboard capabilities
Basic Dashboard Features
Introduction
The HPC Dashboard provides a comprehensive monitoring and management interface for High-Performance Computing environments. Designed with both system administrators and end users in mind, it offers real-time insights into cluster status, job information, and resource utilization.
History
This dashboard project replaces our legacy system, which was built with PHP and relied heavily on complex SQL queries. The previous implementation often required nearly a minute for page reloads and sometimes exceeded browser memory limits. While effective for its time, the original dashboard eventually became difficult to maintain as its complexity grew.
When we brought the new Sol cluster online, we needed an interim solution. We couldn't support the new cluster with our legacy system, as it required custom queries and databases we were no longer using. This led us to develop the Python-based status page shown here, which was only available through Open OnDemand.
After evaluating our requirements and community needs, we implemented our current solution. The vision behind this new implementation was to leverage modern tools and established frameworks to create a solution that could be easily shared, maintained, and expanded upon by the HPC community.
Overview
The primary purpose of this dashboard is to provide real-time visibility into a site's HPC system. It enables end users to monitor the cluster's operational status, stay informed about upcoming maintenance events, access detailed job information, and view individual node statuses. The dashboard serves as a central information hub that promotes transparency and efficient resource utilization.
Real-time Node Status
The most recognizable component of this dashboard is the landing page, featuring color-coded squares that provide an immediate visual indication of system status.
This interface is designed to function effectively in Network Operations Center (NOC) environments. It allows operators and users to assess system health and availability at a glance. The intuitive color scheme helps quickly identify operational nodes, nodes with issues, and those undergoing maintenance.
Maintenance Notifications
A highly valued feature is the maintenance notification system, which provides timely reminders about scheduled system downtime.
These notifications appear as prominent warning messages that will not reappear for 24 hours after being dismissed. This ensures users are informed without being repeatedly interrupted. The maintenance information is pulled directly from the Slurm workload manager (a job scheduling system), ensuring consistency across all system interfaces.
Rack View
This enhanced visualization provides a detailed breakdown of the physical locations of systems. Alternatively, it allows administrators to organize nodes according to any classification scheme they find useful.
The rack view offers an intuitive representation of the cluster's physical architecture, helping both administrators and users better understand the system's organization.
Note: We plan to add a Netbox plugin in the future to automatically pull rack and node details from the Netbox API, further streamlining system management.
User Job Details
The user job details feature enables end users to search for specific job IDs and retrieve comprehensive information about those jobs.
For completed jobs, the system provides valuable performance metrics, including resource utilization efficiency. This information helps users optimize their workflows and make more effective use of cluster resources.
Note: We are working to add GPU efficiency metrics to the job details panel in the next release, enhancing visibility into accelerator utilization.
Historical Cluster Status
The historical cluster status page maintains configurable snapshots of your system. This allows administrators and users to review the status of the system at previous points in time.
By default, these snapshots are captured at one-hour intervals. This provides a useful timeline of system performance and utilization patterns that can inform future resource planning.
Partition, Feature, and QoS Information
The dashboard provides users with the ability to filter and sort nodes based on various attributes, including:
- Node types
- Features
- States
- Partitions
This functionality helps users identify the most appropriate resources for their specific computational needs, improving overall system utilization and job performance.
Modules
The dashboard offers an easy way for users to browse the available software modules on the system. Users can quickly find which applications and libraries are installed and ready for use.
We are currently developing additional functionality to show usage statistics, such as the number of module loads over a specified period. This will allow users to browse modules by popularity, helping them discover commonly used tools in their research domain.