Targeting the Cray/HPE Slingshot interconnect
High-speed network interconnects are a key component of supercomputers. The challenge from a software packaging viewpoint is to provide an MPI stack able to get the performance out of that specialized networking hardware. As packagers, our approach has been to provide MPI packages that get the best performance of the underlying interconnect, be it Omni-Path, InfiniBand, or any other type of interconnect.
Any interconnect? Not exactly: the Slingshot interconnect by HPE (formerly Cray) was missing. Slingshot powers major supercomputers such as Frontier in the US, LUMI, the Tier-0 EuroHPC cluster hosted in Finland, and Adastra, a Tier-1 cluster here in France; lack of support was a problem. With HPE officially releasing libcxi and related software for Slingshot just a few days ago, we’re happy to report this gap has been closed.
Figures
Patches adding libcxi and related packages to Guix and extending libfabric and Open MPI with Slingshot support are pending review and will soon be merged. The dependency graph in all its glory (stripped at distance 2) looks like this:
This dependency graph is big enough that users and system administrators had rather rely on (and collaborate with) a distribution than build it by themselves…
Since the Adastra supercomputer does not run Guix at this time, we
tested our software stack by packing it with guix pack
, together with
the Intel MPI
benchmarks, and
copying it over to the supercomputer:
$ guix pack -RR -S /etc=etc -S /bin=bin intel-mpi-benchmarks
…
/gnu/store/…-intel-mpi-benchmarks-tarball-pack.tar.gz
$ scp /gnu/store/…-intel-mpi-benchmarks-tarball-pack.tar.gz adastra:pack.tar.gz
On Adastra, we extract the archive and run it right from there:
$ tar xf pack.tar.gz
$ srun --time=00:19:00 -N2 --account=cad15174 --exclusive -C GENOA \
--mpi=pmi2 ./bin/IMB-MPI1 PingPong
srun: job 2536202 queued and waiting for resources
srun: job 2536202 has been allocated resources
#----------------------------------------------------------------
# Intel(R) MPI Benchmarks 2021.7, MPI-1 part
#----------------------------------------------------------------
# Date : Mon Nov 18 14:07:02 2024
# Machine : x86_64
# System : Linux
[…]
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
[…]
65536 640 8.50 7710.13
131072 320 12.74 10290.51
262144 160 17.45 15021.07
524288 80 26.91 19485.63
1048576 40 48.53 21608.90
2097152 20 92.22 22741.06
4194304 10 179.23 23401.37
# All processes entering MPI_Finalize
That’s it! With the whole stack built on our machines, we get the expected peak bandwidth, around 25 GB/s.
We can similarly build intel-mpi-benchmarks
against
mpich-ofi
, the MPICH
package in Guix that is linked against libfabric, using a package
transformation
to replace openmpi
by mpich-ofi
:
$ guix pack -RR -S /etc=etc -S /bin=bin \
--with-input=openmpi=mpich-ofi intel-mpi-benchmarks
…
/gnu/store/…-intel-mpi-benchmarks-tarball-pack.tar.gz
Performance is about the same, slightly worse on very large messages.
Bring your own MPI—fearlessly
It is worth stressing that the runs above do not use software pre-installed by HPE on the machine; everything down to the C library used by the benchmark above is defined as Guix packages and built in a reproducible fashion. Our default MPI stack is based on Open MPI while the one installed by HPE is a variant of MPICH (“Cray MPICH”). Intel’s MPI Benchmarks built on the machine against Cray MPICH and all the vendor-provided modules achieve the same performance as the Guix-provided Open MPI and MPICH stacks:
From an engineering viewpoint, this is not surprising: they use the same low-level network support libraries, libfabric and libcxi. But it does go against common wisdom in HPC suggesting that, somehow, vendor-provided MPI libraries perform better than everything else—as if they had some secret sauce. In a 2020 paper about MPI performance for container images, the author writes:
The primary disadvantage of a purely BYO-MPI [“Bring-Your-Own-MPI”] scenario is that you might not get the best performance from the system compared to the system provided MPI. This is often the result of system architecture and network tuning in the system provided MPI. If the system provided MPI is open source then it is likely that you could match its configuration. However, if the system provided MPI is proprietary then it is not likely that you will be able to do so.
The figures above are further evidence that indeed, with open source support libraries, there is no reason to believe in a vendor secret sauce.
For open science, for reliability
By providing a fully transparent, auditable software stack, we hope to encourage rational analysis of its behavior. If performance isn’t there, there must be an explanation in the way packages are built. Transparency and rational thought are pillars of open science; HPC software deployment should not be exempt from these principles. In fact, MPI and related high-speed communication layers are a research topic for some, such as the TADaaM team at Inria. The ability to use, study, and modify software like libcxi and libfabric is a precondition for their work.
Carrying the entire software stack to our supercomputer, thanks to guix pack
, has another benefit that is often overlooked: reliability.
Grounded in the belief that vendor-provided MPI implementations perform
better, or due to software being proprietary (as was the case for
libcxi), packagers have come up with
fragile
shenanigans
to link against those secret-sauce binaries. Techniques used with
container images aren’t very different: replacing the MPI
implementation within the container
image or just its
low-level support
libraries.
What makes the inclusion of these external binaries “fragile”? The paper mentioned above summarizes it:
For the replacement [of MPI libraries] to be possible, the MPI implementations in the container image and on the host must belong to the same family (e.g. OpenMPI or MPICH-based) and be ABI compatible.
Replacing a lower-level library—libfabric or a libfabric “provider”—might reduce the risk, but again, one can never be sure that what the image provides is compatible with what’s being injected here. Injection also leads to software artifacts that are opaque and non-portable and that behave differently depending on the machine you are on.
By bringing our own MPI, we can be sure that by design it will support a certain number of interconnects. More importantly, it guarantees performance portability: as a user, one can deploy the same stack on different supercomputers with the same interconnect—e.g., Adastra, LUMI, and Frontier—or with other supported interconnects. No reason to choose between performance and reproducibility.
Acknowledgments
Thanks to Alexandre Denis at Inria for feedback on MPI benchmarks. Huge thanks to the engineers at HPE who supported us and sped up the release process of libcxi, and to colleagues at CINES for their help.
Unless otherwise stated, blog posts on this site are copyrighted by their respective authors and published under the terms of the CC-BY-SA 4.0 license and those of the GNU Free Documentation License (version 1.3 or later, with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts).