On Building 30K Debian Packages

As part of my ongoing attempts to create some nice datasets for training large code models for C/C++, I've recently been attempting to build every package in Debian Unstable from source using bear to log the compilation and generate a compile_commands.json database for each build. Since it's not possible, in general, to parse C/C++ code without knowing what flags were used (e.g., so you can find header files, know what preprocessor defines are in use, etc.), this will open up some nice possibilities like:

  • Getting ASTs for each source file
  • Rebuilding each file and generating its LLVM IR (-emit-llvm) or assembly (-S)
  • Extracting comments associated with individual functions
I'll probably have more to say about this dataset once I actually get around to doing something fun with it, but for now I wanted to just jot down some notes on stuff I wish I had known before trying to do this:

  • Isolation: Run the build for each package in some kind of isolated environment. You know how packages sometimes have install-time conflicts? It's 100x worse for build-time conflicts.
  • Use an SSD: Make sure to build things somewhere with fast storage. A huge amount of compiling stuff is just reading it off disk and writing it back. Because my main Docker stores its images on spinning rust, I ran a separate Docker daemon for the SSD with a minimal config file. Then you can just set DOCKER_HOST=unix:///var/run/docker-nvme.sock and build/run your images.
  • Log everything, especially exit codes. I got through a whole pass before realizing I didn't have a reliable way to tell which packages had built successfully (dpkg-buildpackage emits an exciting array of inconsistent messages), and had to re-run everything.
  • Turn off stuff you don't want. I don't care about running tests or building documentation, so I set DEB_BUILD_OPTIONS="nodoc notest nocheck". Unfortunately, not every package respects the build options, but it's worth a try.
  • Don't build as root. A number of packages detect if you're trying to build stuff as root and will die (coreutils is one example). This is an easy mistake to make in Docker, where running as root is the default. Run as a normal user, and use "dpkg-buildpackage -rfakeroot" so that it can pretend to be root for packages that do want to be built as root.
  • Run non-interactively. There are a few packages that, when installed, try to ask the user some questions and will hang forever unless DEBIAN_FRONTEND=noninteractive is set. So set it, and make sure it gets passed on child processes (a particularly annoying example is sudo, where you have to add -E to make it inherit the environment).
  • Use timeouts. Particularly in an isolated environment like Docker, sometimes stuff will just hang during build (or maybe in some cases it's bear's fault, IDK). Some common culprits I've found so far are xvfb-run and erl_child_setup, and (maybe) things that expect dbus to be present. Aside from setting a timeout, I also ran a script in the background to find and kill any of those processes that were hanging around longer than a few minutes. [Actually, rather than killing them, which will make them exit with a non-zero status and cause the build to error out, I used this nice trick from Kyle Huey to attach to them with gdb and inject a call to exit(0)]
  • Clean up. Since you're using a nice fast SSD, it's probably not enormous (mine is a measly 2TB). Builds are big. You may want to remember to move your build artifacts to somewhere roomier so that you don't run out of space (this tends to make build systems very unhappy).
  • Stay up to date. Initially I just parsed Sources.gz, grabbed all the source packages, and then tried to fetch their build-deps. But it turns out Debian moves too fast for this; by the time I got around to building some package a few days later, its build-deps had in some cases been updated and weren't available in apt any more. Now I instead start each build with an apt-get -y update, and then fetch the most recent sources package info and build dependencies right before attempting the build.
  • Avoid shell hackery. This is probably controversial, and I'm sure someone better and more careful at bash could do it, but trying to automate everything in a language where failures are silent and can do exciting things like call "rm -rf /" when you meant "rm -rf ${foo}/${bar}" is painful. Python has its own issues, but it was nice to at least get noisy errors as soon as things went wrong (example script: this one which uses python-apt to get source package info, rather than "parsing" Sources.gz with grep/awk/sed).
  • Expect to be disappointed. Even after all of this a lot of stuff is going to fail to build. Other things will be weird in ways you never dreamed software could be weird (hello, packages that spend 12 hours generating documentation using xsltproc!). You'll find fun stuff like packages that have clear security vulnerabilities, as revealed by compiler diagnostics like -Wformat-security (presumably these packages built fine under older, dumber compilers). Some of this can probably be mitigated by targeting Debian stable; unstable is, well, unstable, and brokenness is expected.
No doubt I've missed lots of things that make this a more pleasant and reliable experience! There are a number of other projects that are also attempting to build all (or large portions) of Debian, which I probably should have looked at in more detail before attempting to roll my own (my only excuse is that I wanted something I knew how to extend and modify to do weird stuff like tracing build commands and recompiling individual files with other flags):
I'm hoping to dig into these more established efforts and see what tips and tricks I can steal for my own infrastructure. And if you know of other helpful hints, please let me know!

Comments

Gleber said…
Or just use nixpkgs and modify stdenv a bit to produce the additional files. It'll take care of the rest
Erico said…
You should check out SUSE's Open Build Service. We've been doing reproducible builds for the entire distro for more than a decade now, and I think it matches most of your criteria. It's also free to use and has a command-line client (osc) that could be used for extracting the data you require: http://build.opensuse.org

Popular posts from this blog

Someone’s Been Messing With My Subnormals!

Breaking Spotify DRM with PANDA