Join Fabriscale at HPC Asia 2019 in Guangzhou, China, January 14. – 16. to see our new Hawk-Eye and Wingman fabric management and monitoring tools for InfiniBand. Join also our talk “Next generation Fabric Manager and Advanced Monitoring System” during the Vendor Vision on January 15, 13:00-17:10 by Dr. Haakon Bryhni, Fabriscale Product Manager. in the Function Room #2.
Abstract of the talk: Next generation Fabric Manager and Advanced Monitoring System”
As HPC systems are built with more powerful compute nodes, GPU and FPGA accelerators and network links with higher bandwidth, there is an increasing need for optimal management of network resources. We will show how many applications in InfiniBand-based HPC clusters can greatly benefit from improved network management. For communication-intensive workloads, Fabriscale has proven up to 40% improved performance of a standard HPC system with InfiniBand network using our “Wingman” software fabric manager instead of traditional OpenSM. The presentation will explain how improved InfiniBand routing by smart software can reduce runtime and enable HPC systems to increase job throughput and increase system reliability. Example of how performance of typical HPC applications can be improved by high-end InfiniBand routing will be presented, using benchmarks from leading HPC systems in China and the USA. Another challenge facing HPC owners and operators are how to understand the interplay between jobs and how they affect performance of the overall HPC system. We will present the network intelligence and analytics platform “Hawk-Eye”, and how it is used by HPC owners to simplify management and provide analytics which can be used for troubleshooting and performance improvement Hawk-Eye collect and analyse performance statistics gathered from the entire HPC system and data is visualised directly in the HPC topology. The monitoring system is closely integrated with Slurm, Torque and other job management systems to leverage job scheduling information to visualise jobs in the cluster, identify potential job specific network bottlenecks and conduct job management.operators. All performance data is stored in a scalable database so the operator can compare job performance as function of time, including information on how each job is using the network. Hawk-Eye automatically monitor the HPC system and raise alarms (link failures, port error rates, congestion notification etc.) only when the operator’s attention is required. The presentation will demonstrate how improved fabric management software and advanced network analytics can improve performance and simplify operation of state-of-the art HPC systems.