High-Availability HPC Cluster Design for Mission-Critical Public Infrastructure: Lessons from Energy Grid and Government Deployments

Main Article Content

Rakesh Challa

Abstract

The paper addresses the design issue of fault-tolerant high-availability HPC clusters of mission-critical systems in the public-sector, such as energy grid operators and government data centers. The paper focuses on such architectural designs as redundant pathing, multi-layer failure, homogeneity in the firmware and risk managed cutovers to assist in achieving zero-downtime operations. The system performance was quantitatively evaluated by running a simulation-based experiment which incorporated Monte Carlo fault injection. Multi-layer failover with firmware homogeneity availability, MTTR, and recovery success was 99.97, 38 and 99.9 seconds respectively, whereas controlled cutovers also increased availability to 99.98. The workload was completed over 99.5 percent even with several faults. These findings point to the fact that properly designed HPC systems may be maintained, and used in handling large volumes of data and be cyber-resilient in serving millions of people. The paper provides a methodical method of designing and assessing the high availability HPC systems that can be applied in the national interest activities.

Article Details

Section
Articles