On Building 30K Debian Packages
As part of my ongoing attempts to create some nice datasets for training large code models for C/C++, I've recently been attempting to build every package in Debian Unstable from source using bear to log the compilation and generate a compile_commands.json database for each build. Since it's not possible, in general, to parse C/C++ code without knowing what flags were used (e.g., so you can find header files, know what preprocessor defines are in use, etc.), this will open up some nice possibilities like:
- Getting ASTs for each source file
- Rebuilding each file and generating its LLVM IR (-emit-llvm) or assembly (-S)
- Extracting comments associated with individual functions
- Isolation: Run the build for each package in some kind of isolated environment. You know how packages sometimes have install-time conflicts? It's 100x worse for build-time conflicts.
- Use an SSD: Make sure to build things somewhere with fast storage. A huge amount of compiling stuff is just reading it off disk and writing it back. Because my main Docker stores its images on spinning rust, I ran a separate Docker daemon for the SSD with a minimal config file. Then you can just set DOCKER_HOST=unix:///var/run/docker-nvme.sock and build/run your images.
- Log everything, especially exit codes. I got through a whole pass before realizing I didn't have a reliable way to tell which packages had built successfully (dpkg-buildpackage emits an exciting array of inconsistent messages), and had to re-run everything.
- Turn off stuff you don't want. I don't care about running tests or building documentation, so I set DEB_BUILD_OPTIONS="nodoc notest nocheck". Unfortunately, not every package respects the build options, but it's worth a try.
- Don't build as root. A number of packages detect if you're trying to build stuff as root and will die (coreutils is one example). This is an easy mistake to make in Docker, where running as root is the default. Run as a normal user, and use "dpkg-buildpackage -rfakeroot" so that it can pretend to be root for packages that do want to be built as root.
- Run non-interactively. There are a few packages that, when installed, try to ask the user some questions and will hang forever unless DEBIAN_FRONTEND=noninteractive is set. So set it, and make sure it gets passed on child processes (a particularly annoying example is sudo, where you have to add -E to make it inherit the environment).
- Use timeouts. Particularly in an isolated environment like Docker, sometimes stuff will just hang during build (or maybe in some cases it's bear's fault, IDK). Some common culprits I've found so far are xvfb-run and erl_child_setup, and (maybe) things that expect dbus to be present. Aside from setting a timeout, I also ran a script in the background to find and kill any of those processes that were hanging around longer than a few minutes. [Actually, rather than killing them, which will make them exit with a non-zero status and cause the build to error out, I used this nice trick from Kyle Huey to attach to them with gdb and inject a call to exit(0)]
- Clean up. Since you're using a nice fast SSD, it's probably not enormous (mine is a measly 2TB). Builds are big. You may want to remember to move your build artifacts to somewhere roomier so that you don't run out of space (this tends to make build systems very unhappy).
- Stay up to date. Initially I just parsed Sources.gz, grabbed all the source packages, and then tried to fetch their build-deps. But it turns out Debian moves too fast for this; by the time I got around to building some package a few days later, its build-deps had in some cases been updated and weren't available in apt any more. Now I instead start each build with an apt-get -y update, and then fetch the most recent sources package info and build dependencies right before attempting the build.
- Avoid shell hackery. This is probably controversial, and I'm sure someone better and more careful at bash could do it, but trying to automate everything in a language where failures are silent and can do exciting things like call "rm -rf /" when you meant "rm -rf ${foo}/${bar}" is painful. Python has its own issues, but it was nice to at least get noisy errors as soon as things went wrong (example script: this one which uses python-apt to get source package info, rather than "parsing" Sources.gz with grep/awk/sed).
- Expect to be disappointed. Even after all of this a lot of stuff is going to fail to build. Other things will be weird in ways you never dreamed software could be weird (hello, packages that spend 12 hours generating documentation using xsltproc!). You'll find fun stuff like packages that have clear security vulnerabilities, as revealed by compiler diagnostics like -Wformat-security (presumably these packages built fine under older, dumber compilers). Some of this can probably be mitigated by targeting Debian stable; unstable is, well, unstable, and brokenness is expected.
- Reproducible Debian uses rebuilderd to measure how much of Debian can be rebuilt reproducibly, ideally producing bit-for-bit identical binaries. This is run by NYU CCS alum Santiago Torres-Arias, now an assistant prof at Purdue!
- Debian also has its own reproducible builds project (at least, I think the two projects are separate); you can see the status of their efforts on the reproducible builds dashboard.
- Debian's QA team runs periodic build attempts. I think these use sbuild and some scripts to run things on AWS.
- These folks are attempting to build all of Debian with clang. They're taking a slightly more invasive approach than I'd like, though, by deleting gcc/g++ and replacing it with a symlink to clang/clang++.
Comments