High-speed network interconnects are a key component of supercomputers. The challenge from a software packaging viewpoint is to provide an MPI stack able to get the performance out of that specialized networking hardware. As packagers, our approach has been to provide MPI packages that get the best performance of the underlying interconnect, be it Omni-Path, InfiniBand, or any other type of interconnect.

Any interconnect? Not exactly: the Slingshot interconnect by HPE (formerly Cray) was missing. Slingshot powers major supercomputers such as Frontier in the US, LUMI, the Tier-0 EuroHPC cluster hosted in Finland, and Adastra, a Tier-1 cluster here in France; lack of support was a problem. With HPE officially releasing libcxi and related software for Slingshot just a few days ago, we’re happy to report this gap has been closed.

Figures

Patches adding libcxi and related packages to Guix and extending libfabric and Open MPI with Slingshot support are pending review and will soon be merged. The dependency graph in all its glory (stripped at distance 2) looks like this:

Dependency graph showing the Open MPI package and its dependencies up to libcxi.

This dependency graph is big enough that users and system administrators had rather rely on (and collaborate with) a distribution than build it by themselves…

Since the Adastra supercomputer does not run Guix at this time, we tested our software stack by packing it with guix pack, together with the Intel MPI benchmarks, and copying it over to the supercomputer:

$ guix pack -RR -S /etc=etc -S /bin=bin intel-mpi-benchmarks 
…
/gnu/store/…-intel-mpi-benchmarks-tarball-pack.tar.gz
$ scp /gnu/store/…-intel-mpi-benchmarks-tarball-pack.tar.gz adastra:pack.tar.gz

On Adastra, we extract the archive and run it right from there:

$ tar xf pack.tar.gz
$ srun --time=00:19:00 -N2 --account=cad15174 --exclusive -C GENOA \
       --mpi=pmi2 ./bin/IMB-MPI1 PingPong
srun: job 2536202 queued and waiting for resources
srun: job 2536202 has been allocated resources
#----------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2021.7, MPI-1 part
#----------------------------------------------------------------
# Date                  : Mon Nov 18 14:07:02 2024
# Machine               : x86_64
# System                : Linux
[…]
#---------------------------------------------------
# Benchmarking PingPong 
# #processes = 2 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
[…]
        65536          640         8.50      7710.13
       131072          320        12.74     10290.51
       262144          160        17.45     15021.07
       524288           80        26.91     19485.63
      1048576           40        48.53     21608.90
      2097152           20        92.22     22741.06
      4194304           10       179.23     23401.37


# All processes entering MPI_Finalize

That’s it! With the whole stack built on our machines, we get the expected peak bandwidth, around 25 GB/s.

We can similarly build intel-mpi-benchmarks against mpich-ofi, the MPICH package in Guix that is linked against libfabric, using a package transformation to replace openmpi by mpich-ofi:

$ guix pack -RR -S /etc=etc -S /bin=bin  \
  --with-input=openmpi=mpich-ofi intel-mpi-benchmarks 
…
/gnu/store/…-intel-mpi-benchmarks-tarball-pack.tar.gz

Performance is about the same, slightly worse on very large messages.

Bring your own MPI—fearlessly

It is worth stressing that the runs above do not use software pre-installed by HPE on the machine; everything down to the C library used by the benchmark above is defined as Guix packages and built in a reproducible fashion. Our default MPI stack is based on Open MPI while the one installed by HPE is a variant of MPICH (“Cray MPICH”). Intel’s MPI Benchmarks built on the machine against Cray MPICH and all the vendor-provided modules achieve the same performance as the Guix-provided Open MPI and MPICH stacks:

Intel MPI Benchmarks plot for Cray MPICH vs. Open MPI and MPICH as packaged by Guix.

From an engineering viewpoint, this is not surprising: they use the same low-level network support libraries, libfabric and libcxi. But it does go against common wisdom in HPC suggesting that, somehow, vendor-provided MPI libraries perform better than everything else—as if they had some secret sauce. In a 2020 paper about MPI performance for container images, the author writes:

The primary disadvantage of a purely BYO-MPI [“Bring-Your-Own-MPI”] scenario is that you might not get the best performance from the system compared to the system provided MPI. This is often the result of system architecture and network tuning in the system provided MPI. If the system provided MPI is open source then it is likely that you could match its configuration. However, if the system provided MPI is proprietary then it is not likely that you will be able to do so.

The figures above are further evidence that indeed, with open source support libraries, there is no reason to believe in a vendor secret sauce.

For open science, for reliability

By providing a fully transparent, auditable software stack, we hope to encourage rational analysis of its behavior. If performance isn’t there, there must be an explanation in the way packages are built. Transparency and rational thought are pillars of open science; HPC software deployment should not be exempt from these principles. In fact, MPI and related high-speed communication layers are a research topic for some, such as the TADaaM team at Inria. The ability to use, study, and modify software like libcxi and libfabric is a precondition for their work.

Carrying the entire software stack to our supercomputer, thanks to guix pack, has another benefit that is often overlooked: reliability. Grounded in the belief that vendor-provided MPI implementations perform better, or due to software being proprietary (as was the case for libcxi), packagers have come up with fragile shenanigans to link against those secret-sauce binaries. Techniques used with container images aren’t very different: replacing the MPI implementation within the container image or just its low-level support libraries.

What makes the inclusion of these external binaries “fragile”? The paper mentioned above summarizes it:

For the replacement [of MPI libraries] to be possible, the MPI implementations in the container image and on the host must belong to the same family (e.g. OpenMPI or MPICH-based) and be ABI compatible.

Replacing a lower-level library—libfabric or a libfabric “provider”—might reduce the risk, but again, one can never be sure that what the image provides is compatible with what’s being injected here. Injection also leads to software artifacts that are opaque and non-portable and that behave differently depending on the machine you are on.

By bringing our own MPI, we can be sure that by design it will support a certain number of interconnects. More importantly, it guarantees performance portability: as a user, one can deploy the same stack on different supercomputers with the same interconnect—e.g., Adastra, LUMI, and Frontier—or with other supported interconnects. No reason to choose between performance and reproducibility.

Acknowledgments

Thanks to Alexandre Denis at Inria for feedback on MPI benchmarks. Huge thanks to the engineers at HPE who supported us and sped up the release process of libcxi, and to colleagues at CINES for their help.

Targeting the Cray/HPE Slingshot interconnect

Figures

Bring your own MPI—fearlessly

For open science, for reliability

Acknowledgments