Fabriscale Hawk-eye

Fabriscale Hawk-eye is an InfiniBand interconnect monitoring software that provides visual insight into the status of your InfiniBand cluster. Hawk-eye gives you a quick overview of performance, helps you visualize your topology, and lets you drill-down into statistics, alerts and key metrics. With Hawk-eye, monitoring of your InfiniBand network is automated, and the system raise alarms (link failures, port error rates, congestion etc.) only when the operator’s attention is required. The operator will then be pointed to where the problem has occurred, supported by relevant metrics and statistics with strong analytics support. This saves the operator time, leads to faster error recovery situations, less strain on key operator resources and finally reduced downtime for your cluster.

Figure 1:The Fabriscale Monitoring System dashboard gives the operator a quick overview of the state and performance of the cluster, and is an entry point to dive into alerts and statistics when required.

Hawk-eye integrates seamlessly with the Torque, Slurm and other workload managers in order to leverage job scheduling information to visualise jobs in the cluster, identify potential job specific network bottlenecks and simple job management.

Figure 2: In the port view you can inspect the details for a port including the the jobs using this port.

Fabriscale Wingman

Fabriscale Wingman for InfiniBand is a fabric manager that ensures a more efficient and reliable operation of HPC clusters based on InfiniBand technology. Wingman provides optimized routing, fast fault-tolerance capabilities and dynamic partitioning. Wingman outperforms any equivalent software solutions on the market today.

Figure 3:The Fabriscale Wingman software optimise routing and can increase performance and reliability of your InfiniBand network.

Wingman´s support for optimized routing for fat-tree topologies and deadlock free routing for random topologies ensures high performance and reliability in the case of network faults or non standard topology configurations. Wingman also supports using multiple virtual lanes for routing and custom configuration of how traffic is distributed across virtual lanes.

Wingman´s support for fast and dynamic fault-tolerance reduces the time it takes to handle link failures by eliminating the need for a full reconfiguration of the routing tables when a network fault occurs. This is achieved by precomputing redundant paths in advanced and automatic activation of these paths when faults occur. This leads to less network downtime and a reduced chance for abnormal application termination. Wingman´s support for dynamic reconfiguration makes it possible to set up and tear down network partitions on the fly. This feature can be integrated with job scheduling or virtual machine provisioning to provide on demand network isolation per job, user or virtual cluster.