Posts

Showing posts from February, 2022

On Building 30K Debian Packages

As part of my ongoing attempts to create some nice datasets for training large code models for C/C++ , I've recently been attempting to build every package in Debian Unstable from source using bear to log the compilation and generate a compile_commands.json database for each build. Since it's not possible, in general, to parse C/C++ code without knowing what flags were used (e.g., so you can find header files, know what preprocessor defines are in use, etc.), this will open up some nice possibilities like: Getting ASTs for each source file Rebuilding each file and generating its LLVM IR (-emit-llvm) or assembly (-S) Extracting comments associated with individual functions I'll probably have more to say about this dataset once I actually get around to doing something fun with it, but for now I wanted to just jot down some notes on stuff I wish I had known before trying to do this: Isolation : Run the build for each package in some kind of isolated environment. You know how...