The Fabriscale Monitoring System
The Fabriscale Monitoring System (FMS) is a cluster interconnect monitoring software that provides a visual insight into the status of your InfiniBand cluster. Fabriscale Monitoring System gives you a quick overview of performance, help you visualize your topology, and let you drill-down into statistics, alerts and key metrics. By using the Fabriscale Monitoring System, monitoring of the cluster is automated and the system raise alarms only when the operator’s attention is required. At this time, the operator will quickly be pointed to where the problem has occurred, supported by relevant metrics and statistics with strong analytics support. This saves the operator time, leads to faster recovery from error situations, less strain on key operator resources and finally reduced downtime for your cluster.
Figure 1:The Fabriscale Monitoring System dashboard gives the operator a quick overview of the state and performance of the cluster, and is an entry point to dive into alerts and statistics when required.
Workload manager integration
Fabriscale Monitoring System interfaces seamlessly with the MOAB HPC suite from Adaptive Computing. MOAB intelligently places workloads and adapts resources to optimize application performance, increase system utilization and achieve organizational objectives. Fabriscale Monitoring System leverage job scheduling information from MOAB to present cluster performance information as a function of workload. Potential network bottlenecks can be identified per job, and utilization for a specific job can be specified per port. Fabriscale Monitor also interface with the SLURM open source workload manager.
Figure 2: In the port view you can inspect the details for a port including the the jobs using this port.
The Fabriscale Fabric Manager
The Fabriscale Fabric Manager (FFM) is a new fabric management software that ensures more efficient and reliable operation of data centres and high performance computing clusters based on InfiniBand technology. The FFM provides efficient routing and optimized fault-tolerance capabilities, which ensures improved performance, fast failover and graceful degradation in the case of faults.
Fast and dynamic fault-tolerance
Whenever a fault occurs in the network (e.g. link failure) the Fabriscale Fabric Manager will automatically detect the problem and quickly reconfigure the network to use a pre-computed backup path, which reduces the time it takes to handle network faults from several minutes to less than a second when compared to existing solutions. This leads to reduced downtime and improved utilisation of the cluster.
Graceful degradation in presence of faults
The Fabriscale Fabric Manager optimize the balancing of the network paths and utilise InfiniBand virtual lanes to improve network performance. Should a fault occur, running applications are minimally affected by network problems because the faults are automatically fixed and the Fabriscale Fabric Manager ensures that network traffic is evenly distributed across the network.
Figure 3:The FFM provides efficient routing and optimized fault-tolerance capabilities, which ensures improved performance, fast failover and graceful degradation in the case of faults.