What’s in a package
There is no shortage of package managers. Each tool makes its own set of tradeoffs regarding speed, ease of use, customizability, and reproducibility. Guix occupies a sweet spot, providing reproducibility by design as pioneered by Nix, package customization à la Spack from the command line, the ability to create container images without hassle, and more.
Beyond the “feature matrix” of the tools themselves, a topic that is often overlooked is packages—or rather, what’s inside of them. Chances are that a given package may be installed using any of the many tools at your disposal. But are you really getting the same thing regardless of the tool you are using? The answer is “no”, contrary to what one might think. The author realized this very acutely while fearlessly attempting to package the PyTorch machine learning framework for Guix.
This post is about the journey packaging PyTorch the Guix way, the rationale, a glimpse at what other PyTorch packages out there look like, and conclusions we can draw for high-performance computing and scientific workflows.
Getting PyTorch in Guix
One can install PyTorch in literally seconds with pip
:
$ time pip install torch
Collecting torch
Downloading https://files.pythonhosted.org/packages/69/f2/2c0114a3ba44445de3e6a45c4a2bf33c7f6711774adece8627746380780c/torch-1.9.0-cp38-cp38-manylinux1_x86_64.whl (831.4MB)
|████████████████████████████████| 831.4MB 91kB/s
Collecting typing-extensions (from torch)
Downloading https://files.pythonhosted.org/packages/74/60/18783336cc7fcdd95dae91d73477830aa53f5d3181ae4fe20491d7fc3199/typing_extensions-3.10.0.2-py3-none-any.whl
Installing collected packages: typing-extensions, torch
real 0m24.502s
user 0m19.711s
sys 0m3.811s
Since it’s on PyPI, the Python Package Index, one might think it’s a simple Python package that can be imported in Guix the easy way. That’s unfortunately not the case:
$ guix import pypi torch
guix import: error: no source release for pypi package torch 1.9.0
The reason guix import
bails out is that the only thing PyPI provides
is a binary-only “wheels” package: the
.whl
file downloaded above contains pre-built binaries only, not
source.
In Guix we insist on building software from source: it’s a matter of transparency, auditability, and provenance tracking. We want to make sure our users can see the source code that corresponds to the code they run; we want to make sure they can build it locally, should they choose not to trust the project’s pre-built binaries; or, when they do use pre-built binaries, we want to make sure they can verify that those binaries correspond to the source code they claim to match.
Transparency, provenance tracking, verifiability: it’s about extending the scientific method to the whole computational experiment, including software that powers it.
Bundling
The first surprise when starting packaging PyTorch is that, despite
being on PyPI, PyTorch is first and
foremost a large C++ code base.
It does have a
setup.py
as
commonly found in pure Python packages, but that file delegates the bulk
of the work to
CMake.
The second surprise is that PyTorch bundles (or “vendors”, as some would say) source code for no less than 41 dependencies, ranging from small Python and C++ helper libraries to large C++ neural network tools. Like other distributions such as Debian, Guix avoids bundling: we would rather have one Guix package for each of these dependencies. The rationale is manifold, but it boils down to keeping things auditable, reducing resource usage, and making security updates practical.
Long story short: “unbundling” is often tedious, all the more so in this case. We ended up packaging about ten dependencies that were not already available or were otherwise outdated or incomplete, including big C++ libraries like the XNNPACK and onnx neural network helper libraries. Each of these typically bundles code for yet another bunch of dependencies. Often, the CMake-based build system of these packages would need patching so we could use our own copies of the dependencies. Curious readers can take a look at the commits leading to XNNPACK and those leading to onnx. Another interesting thing is the use of derivatives: PyTorch depends on both QNNPACK and XNNPACK, even though the latter is a derivative of the former, and of course, it bundles both.
Icing on the cake: most of these machine learning software packages do not have proper releases—no Git tag, nothing—so we were left to pick the commit du jour or the one explicitly referred to by Git submodules.
Most PyTorch dependencies were unbundled. The end result is a PyTorch package in its full glory, actually built from source. Phew! Its dependency graph looks like this (only showing dependencies at distance 2 or less):
With this many dependencies bundled, these projects resemble the JavaScript dystopia Christine Lemmer-Webber described. Anyway, PyTorch is now also installable with Guix in seconds when enabling pre-built binaries:
$ time guix install python-pytorch
The following package will be installed:
python-pytorch 1.9.0
52.3 MB will be downloaded
python-pytorch-1.9.0 49.9MiB 6.2MiB/s 00:08 [##################] 100.0%
The following derivation will be built:
/gnu/store/yvygv6nlichbzyynvg4w04xa7xarx3rp-profile.drv
applying 16 grafts for /gnu/store/6qgcb3a7x1wg4havsryjh6zsy3za7h3b-python-pytorch-1.9.0.drv ...
building profile with 2 packages...
real 0m20.697s
user 0m3.604s
sys 0m0.118s
This time though, one can view the self-contained package definition by
running guix edit python-pytorch
and, say, rebuild it locally to
verify the source/binary correspondence:
guix build python-pytorch --no-grafts --check
… or at least it will be possible once NNPACK’s build system generates code in a deterministic order.
pip & CONDA
Having done all this work, the author entered a soul-searching phase:
sure, the rationale is well documented, but is it worth it? It looks
as though everyone (everyone?) is installing PyTorch using pip
anyway and considering it good enough. Also, why was it so much work to
package PyTorch for Guix? Could it be that we’re missing packaging
tricks that make it so easy for others to provide PyTorch & co.?
To answer these questions, let’s first take a look at what pip
provides. The pip install
command above completed after less than
thirty seconds, and most of that time went into downloading an 831 MiB
archive—no less. What’s in there? Those .whl
files are actually zip
archives, which one can easily inspect:
$ wget -qO /tmp/pytorch.zip https://files.pythonhosted.org/packages/69/f2/2c0114a3ba44445de3e6a45c4a2bf33c7f6711774adece8627746380780c/torch-1.9.0-cp38-cp38-manylinux1_x86_64.whl
$ unzip -l /tmp/pytorch.zip | grep '\.so'
29832 06-12-2021 00:37 torch/_dl.cpython-38-x86_64-linux-gnu.so
29296 06-12-2021 00:37 torch/_C.cpython-38-x86_64-linux-gnu.so
372539384 06-12-2021 00:37 torch/lib/libtorch_cpu.so
43520 06-12-2021 00:37 torch/lib/libnvToolsExt-3965bdd0.so.1
28964064 06-12-2021 00:37 torch/lib/libtorch_python.so
46351784 06-12-2021 00:37 torch/lib/libcaffe2_detectron_ops_gpu.so
1159370040 06-12-2021 00:37 torch/lib/libtorch_cuda.so
4862944 06-12-2021 00:37 torch/lib/libnvrtc-builtins.so
168720 06-12-2021 00:37 torch/lib/libgomp-a34b3233.so.1
116240 06-12-2021 00:37 torch/lib/libtorch.so
523816 06-12-2021 00:37 torch/lib/libcudart-80664282.so.10.2
222224 06-12-2021 00:37 torch/lib/libc10_cuda.so
36360 06-12-2021 00:37 torch/lib/libshm.so
47944 06-12-2021 00:37 torch/lib/libcaffe2_module_test_dynamic.so
22045456 06-12-2021 00:37 torch/lib/libnvrtc-08c4863f.so.10.2
12616 06-12-2021 00:37 torch/lib/libtorch_global_deps.so
21352 06-12-2021 00:37 torch/lib/libcaffe2_nvrtc.so
842376 06-12-2021 00:37 torch/lib/libc10.so
552808 06-12-2021 00:37 torch/lib/libcaffe2_observers.so
46651272 06-12-2021 00:37 caffe2/python/caffe2_pybind11_state.cpython-38-x86_64-linux-gnu.so
47391432 06-12-2021 00:37 caffe2/python/caffe2_pybind11_state_gpu.cpython-38-x86_64-linux-gnu.so
$ unzip -l /tmp/pytorch.zip | grep '\.so' | wc -l
21
Twenty-one pre-compiled shared libraries in there! Most are part of
PyTorch, but some are external dependencies. First there’s libgomp,
GCC’s OpenMP and OpenACC run-time support
library; we can guess it’s
shipped to avoid incompatibilities with the user-installed libgomp, but
it could also be a fork of the official libgomp—hard to tell. Then
there’s libcudart
and libnvToolsExt
, both of which are proprietary
NVIDIA GPU support libraries—a bit of a surprise, and a bad one, as
nothing indicated that pip
fetched proprietary software alongside
PyTorch. What’s also interesting is dependencies that are not there,
such as onnx and XNNPACK; we can only guess that they’re statically
linked within libtorch.so
.
Will these binaries work? On my system, they won’t work without
tweaks, such as setting LD_LIBRARY_PATH
, so these libraries find those
they depend on. Using ldd
shows
the “system libraries” that are assumed to be available; this includes
GNU libstdc++ and GCC’s run-time support library:
$ ldd torch/lib/libtorch_cpu.so
linux-vdso.so.1 (0x00007ffca6d31000)
libgomp-a34b3233.so.1 => /tmp/pt/torch/lib/libgomp-a34b3233.so.1 (0x00007ff435723000)
…
libstdc++.so.6 => not found
libgcc_s.so.1 => not found
Not providing those libraries, or providing a variant that is not
binary-compatible with what libtorch_cpu.so
expects, is the end of the
game. Fortunately these two libraries rarely change, so the assumption
made here is that “most” users will have them. It’s interesting that
the authors deemed it necessary to ship libgomp.so
and not
libstdc++.so
—maybe a mixture of insider knowledge and dice roll.
How were these binaries built in the first place? Essentially, by
running python setup.py bdist_wheel
“on some system” which, as we saw,
invokes cmake
to build PyTorch and all its bundled dependencies. But
the PyTorch project does a little bit
more
than this to build and publish binaries for pip and CONDA. The entry
point for both is
binary_linux_build.sh
,
which in turn delegates to scripts living in another repo,
build_pytorch.sh
for CONDA or one of the wheels
scripts;
it’s one of these scripts that’s in charge of embedding libgomp.so
,
libcudart.so
, and other libraries present on the
system.
And where do these libraries come from? They come from the GNU/Linux distribution beneath it which, going back to the initial repository, may typically be some version of Ubuntu or CentOS running on the machines of CircleCI or Microsoft Azure.
At the end of the process is a bunch of wheel or CONDA archives ready to be uploaded as-is to Anaconda or to PyPI.
Looking at these scripts gives useful hints. But going back to the code
pip and CONDA users are actually running: is libgomp-a34b3233.so.1
the libgomp, or is it a modified version? Is libtorch_cpu.so
really obtained by building source from the 1.9.0
Git
tag?
Let’s make it clear: verifying the source/binary correspondence for all
the bits in the pip and CONDA packages is practically infeasible. Merely
rebuilding them locally is hard. Reasoning about the build process is
hard because of all the layers involved and because of the ball of
spaghetti that these scripts are. Such a setup rightfully raises red
flags for any security-minded person—we’ll get to that below—or
freedom-conscious user: it’s also about user
freedom. Is PyPI
conveying the Corresponding Source of libgomp, as per Section 6 of its
license? Probably
not. PyTorch’s own license doesn’t have this requirement, but there’s
certainly a tacit agreement that pip install torch
provides the
PyTorch, and it’s unpleasant at best that this claim is unverifiable in
practice. This, should be a red flag for anyone doing reproducible
science—in other words, science.
Source-based distros
CONDA and pip (at least the “wheels” part of it) are essentially “binary distros”: they focus on distributing pre-built binaries without concern on how they were built, nor whether they can actually be built from source. Without a conscious effort to require reproducible builds so that anyone can independently verify binaries, these tools are doomed to be not only unsafe but also opaque—and there are to date no signs of CONDA and PyPI/pip moving in that direction.
Update (2021-09-21): Bovy on Twitter mentions conda-forge as a possible answer. Public build recipes (here’s that of PyTorch) and automated builds improve transparency compared to binaries uploaded straight from developer machines, but build reproducibility remains to be addressed.
Like Guix, Spack and Nix are source-based: their primary job is to build software from source and use of pre-built binaries is “an optimization”. The Spack package and the Nixpkgs package are all about building it all from source. The Spack package avoids using some of the bundled dependencies, though it does use large ones: XNNPACK and onnx; the Nixpkgs package makes no such effort and builds it all as-is.
Unlike Nix or Guix, Spack assumes core packages—for some definition of “core”, but that includes at least a C/C++ compiler, a C library, and a Python interpreter—are already available. Thus, by definition, the Spack package is not self-contained and may fail to build, plain and simple, if some of the implicit assumptions are not met. When fetching pre-built binaries from a “binary cache”, the problems are similar to those of CONDA and pip: binaries might not work if assumptions about system libraries are not met (though Spack mitigates this risk by tying binaries to the underlying GNU/Linux distro), and it may be hard to verify them through rebuilding, again because these implicit assumptions have an impact on the bits in the resulting binaries.
On convenience, security, and reproducible science
The convenience and ease of use of pip and CONDA has undeniable appeal. That one can, in a matter of minutes, install the tool and use it to deploy a complex software stack like that of PyTorch has certainly contributed to their success. Our view though, as Guix packagers, is that we should take a step back and open the package—look at what’s inside and the impact it has.
What we see when we look inside PyPI wheels and CONDA packages is opaque binaries built on a developer’s machine and later uploaded to the central repository. They are opaque because, lacking reproducible build methodology and tooling, one cannot independently verify that they correspond to the presumed source code. They may also be deceptive: you get not just PyTorch but also the binary of a proprietary piece of software.
In their ESEC/FSE 2021 paper on
LastPyMile, Duc-Ly Vu
et al. empirically show that “the last mile from source to package”
on PyPI is indeed the weakest link in the software supply chain, and
that actual differences between packaged source code and upstream source
code are observed in the wild. And this is only source code—for
binaries as found in the torch
wheel, there is just no practical way
to verify that they genuinely correspond to that source code.
Machine-learning software is fast-moving. The desire to be fast already shows in upstream development practices: lack of releases for important dependencies, careless dependency bundling. Coupled with the user’s legitimate demand for “easy installation”, this turned PyPI, in the footsteps of CONDA, into a huge software supply chain vulnerability waiting to be exploited. It’s a step backwards several years in the past, when Debian hadn’t yet put an end to its “dirtiest secret”—that Debian packages would be non-reproducible, built on developer machines, and uploaded to the servers. Reproducible builds should be the norm; building from source, too, should be the norm.
It is surprising that such a blatant weakness goes unnoticed, especially on high-performance computing clusters that are usually subject to strict security policies. Even more so at a time where awareness about software supply chain security grows, and when the US Government’s Executive Order on cybersecurity, for example, explicitly calls for work on subjects as concrete as “using administratively separate build environments” and “employing automated tools (…) to maintain trusted source code supply chains”.
Beyond security, what are the implications for scientific workflows? Can we build reproducible computational workflows using software that is itself non-reproducible, non-verifiable? The answer is “yes”, one can do that. However, just like one wouldn’t build a house on a quagmire, building scientific workflows on shaky foundations is inadvisable. Far from being an abstract principle, it has concrete implications: scientists and their peers need to be able to reproduce the software environment, all of it; they need the ability to customize it and experiment with it, as opposed to merely running code from an “inert” binary.
It is time to stop running opaque binaries and to value transparency and verifiability for our foundational software, as much as we value transparency and verifiability for scientific work.
Acknowledgments
The author thanks Ricardo Wurmus and Simon Tournier for insightful feedback and suggestions on an earlier draft of this post.
Unless otherwise stated, blog posts on this site are copyrighted by their respective authors and published under the terms of the CC-BY-SA 4.0 license and those of the GNU Free Documentation License (version 1.3 or later, with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts).