Someone’s Been Messing With My Subnormals!

TL;DR: After noticing an annoying warning, I went on an absurd yak shave, and discovered that because of a tiny handful of Python packages built with an appealing-sounding but dangerous compiler option, more than 2,500 Python packages—some with more than a million downloads per month—could end up causing any program that uses them to compute incorrect numerical results.

Once Upon a Time in My Terminal

Recently, whenever I tried to import certain Python packages (notably, some models from Huggingface Transformers), I would see this weird and unsightly warning:

Python 3.8.10 (default, Jun 22 2022, 20:18:18) 

Type 'copyright', 'credits' or 'license' for more information

IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.


In [1]: from transformers import CodeGenForCausalLM

/home/moyix/.virtualenvs/sfcodegen/lib/python3.8/site-packages/numpy/core/getlimits.py:498: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.

  setattr(self, word, getattr(machar, word).flat[0])

/home/moyix/.virtualenvs/sfcodegen/lib/python3.8/site-packages/numpy/core/getlimits.py:88: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.

  return self._float_to_str(self.smallest_subnormal)

/home/moyix/.virtualenvs/sfcodegen/lib/python3.8/site-packages/numpy/core/getlimits.py:498: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.

  setattr(self, word, getattr(machar, word).flat[0])

/home/moyix/.virtualenvs/sfcodegen/lib/python3.8/site-packages/numpy/core/getlimits.py:88: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.

  return self._float_to_str(self.smallest_subnormal)

Someone was messing with my floating point subnormals! Aside from the error messages being really annoying, it also made me a little worried. Floating point math is notoriously tricky, and if something is changing the behavior of the floating point unit (FPU) on the CPU it can cause all sorts of weird problems. For example, some numerical algorithms depend on the standard FPU behavior and will fail to converge if the FPU is set to treat subnormal/denormal numbers as zero (on x86, by setting the FTZ/DAZ flags in the MXCSR register).

Some Googling led me to this issue, which pointed toward some shared library compiled with the gcc/clang option -ffast-math that was being loaded as the culprit. It turns out (somewhat insanely) that when -ffast-math is enabled, the compiler will link in a constructor that sets the FTZ/DAZ flags whenever the library is loaded — even on shared libraries, which means that any application that loads that library will have its floating point behavior changed for the whole process. And -Ofast, which sounds appealingly like a "make my program go fast" flag, automatically enables -ffast-math, so some projects may unwittingly turn it on without realizing the implications.

But what shared libraries were even being loaded? We can find out by taking the PID of our Python interpreter and then looking at /proc/[PID]/maps; this will show all the mapped memory regions in the process and show the names for the ones that are file-backed (which shared libraries are). We can then filter that by grepping for r-xp (readable + executable + copy-on-write) to only list code sections, and then look for shared objects:

moyix@isabella:~$ cat /proc/2902749/maps | grep 'r-xp' | grep -F '.so' | wc -l

158

Oof, 158 shared objects? In my Python process? It's more common than you think.

So now I wanted to narrow down which library (or libraries) was actually setting FTZ/DAZ. After going down a somewhat painful blind alley where I tried to modify QEMU's user-mode emulation so that all updates to MXCSR would get logged along with the name of the library containing the current program counter (NB: it turns out to be really annoying to map the program counter back to a library name), I hit on a simpler strategy.

Python lets you load arbitrary shared objects using ctypes.CDLL. So if I could just load each of those shared libraries one at a time and then trigger the numpy code that prints the warning, I could identify which library was messing with my floating point behavior. By reading through the getlimits.py file mentioned in the warning, I figured out that the check could be triggered by printing out numpy.finfo(numpy.float32), and ended up with this script:

import sys

from ctypes import CDLL

CDLL(sys.argv[1])

import numpy as np

print(np.finfo(np.float32))

We can check that it works by building an empty C file as a shared library with and without -Ofast. Without -Ofast:

(sfcodegen) moyix@isabella:~$ pygmentize foo.c

void empty() { }

(sfcodegen) moyix@isabella:~$ gcc -fpic -shared foo.c -o foo.so

(sfcodegen) moyix@isabella:~$ python fptest.py ./foo.so

Machine parameters for float32

---------------------------------------------------------------

precision =   6   resolution = 1.0000000e-06

machep =    -23   eps =        1.1920929e-07

negep =     -24   epsneg =     5.9604645e-08

minexp =   -126   tiny =       1.1754944e-38

maxexp =    128   max =        3.4028235e+38

nexp =        8   min =        -max

smallest_normal = 1.1754944e-38   smallest_subnormal = 1.4012985e-45

---------------------------------------------------------------

But with -Ofast enabled:

/home/moyix/.virtualenvs/sfcodegen/lib/python3.8/site-packages/numpy/core/getlimits.py:498: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.

  setattr(self, word, getattr(machar, word).flat[0])

/home/moyix/.virtualenvs/sfcodegen/lib/python3.8/site-packages/numpy/core/getlimits.py:88: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.

  return self._float_to_str(self.smallest_subnormal)

Machine parameters for float32

---------------------------------------------------------------

precision =   6   resolution = 1.0000000e-06

machep =    -23   eps =        1.1920929e-07

negep =     -24   epsneg =     5.9604645e-08

minexp =   -126   tiny =       1.1754944e-38

maxexp =    128   max =        3.4028235e+38

nexp =        8   min =        -max

smallest_normal = 1.1754944e-38   smallest_subnormal = 0.0000000e+00

---------------------------------------------------------------

Great! So now we have a detector for shared libraries that set FTZ/DAZ. It's kind of annoying that we have to load the library to check, but we'll fix that later. Running it in a loop over all the libraries that we found were loaded in the Python process earlier, I found that the culprit was gevent, of all things (why is an event-based networking library messing with floating point behavior??).

Of course, now that I knew it was gevent, some more Googling located the relevant bug report. It seems that it was a known bug with an attempted fix, but the fix didn't quite work (it turns out that when you use -Ofast-fno-fast-math does not, in fact, disable fast math. lol. lmao.) and so the most recent version of gevent on PyPI still messes with floating point behavior for no good reason.

What else is out there?

With the immediate mystery solved, I wanted to figure out how many other projects might have (intentionally or inadvertently) enabled -ffast-math in their shared libraries uploaded to PyPI. So I decided to take the top 25% of projects on PyPI by number of downloads and scan their binary wheels (Python jargon for precompiled binaries) to see if they, too, messed with floating point behavior.

Let's start by finding the top 25% of projects. We can get a list of the number of downloads for each project on PyPI by querying a handy BigQuery table that PyPI publishes. Skipping over all the warnings about how inaccurate the download numbers are, I was able to write this little bit of BigQuery SQL that gives the total downloads for each project over the past 30 days:

#standardSQL
SELECT file.project, COUNT(*) AS num_downloads
FROM `bigquery-public-data.pypi.file_downloads`
WHERE
-- Only query the last 30 days of history
DATE(timestamp)
BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
AND CURRENT_DATE()
GROUP BY
file.project

And then save that as a CSV file for further analysis. Once we have the CSV, we can drop it into pandas (or your favorite data analysis toolbox) and extract out a list of the names of the top 25% of projects by download count pretty easily:

(sfcodegen) moyix@isabella:~$ ipython

Python 3.8.10 (default, Jun 22 2022, 20:18:18) 

Type 'copyright', 'credits' or 'license' for more information

IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.


In [1]: import pandas as pd


In [2]: import numpy as np


In [3]: df = pd.read_csv('pypi_downloads_20220901_30d.csv')


In [4]: top_25p = df[df['num_downloads'] > df['num_downloads'].quantile(0.75)]


In [5]: top_25p.head()

Out[5]: 

        project  num_downloads

0          lima           2384

1    kiwisolver       25721252

2   quill-delta           6128

3       aiorpcx          12998

4  flake8-flask           6843


In [6]: len(top_25p)

Out[6]: 102864


In [7]: np.savetxt('top_25p_projects.txt', top_25p['project'].values, fmt='%s')

Okay, around 100K projects, that's manageable. Now it was time to try to download the wheels for those projects, extract them, and check for any floating point funny business. But wait, how are we going to do that check? I don't really want to load a bunch of .so files downloaded off the internet, given that any of them could execute arbitrary code. [My worries here about executing arbitrary code will seem richly ironic later in the post.]

Interlude: A Static Checker for crtfastmath

How can we check if a library will mess with FTZ/DAZ without actually running any of its code? Let's go back to how the code that sets these bits actually gets executed in the first place. Shared libraries on Linux get loaded by the dynamic loader, which loops over the contents of the .init_array section looking for constructors that should be called when the library is loaded and then calling each one. We can print out the contents of this section using objdump:

(sfcodegen) moyix@isabella:~$ objdump -s -j .init_array foo.so


foo.so:     file format elf64-x86-64


Contents of section .init_array:

 3e78 10110000 00000000 40100000 00000000  ........@.......

Each entry in this array is a pointer stored in little-endian form; on a 64-bit system a pointer is 8 bytes, so in this example we have two constructors, and we can print their addresses with this awful incantation (my specialty!):

(sfcodegen) moyix@isabella:~$ objdump -s -j .init_array foo.so | sed -e '1,/Contents/ d' | cut -c 7-40 | xxd -r -p | od -An -t x8 -w8

 0000000000001110

 0000000000001040

Now we can disassemble them with objdump. The first one is not very exciting:

(sfcodegen) moyix@isabella:~$ objdump -d --start-address=0x0000000000001110 foo.so | head -20


foo.so:     file format elf64-x86-64



Disassembly of section .text:


0000000000001110 <frame_dummy>:

    1110:       f3 0f 1e fa             endbr64 

    1114:       e9 77 ff ff ff          jmpq   1090 <register_tm_clones>

    1119:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

But the second one is what we want to look for (note: in stripped binaries, you won't see the nice function name, so we'll have to detect it by looking at the actual instructions):

(sfcodegen) moyix@isabella:~$ objdump -d --start-address=0x0000000000001040 foo.so | head -20


foo.so:     file format elf64-x86-64



Disassembly of section .text:


0000000000001040 <set_fast_math>:

    1040:       f3 0f 1e fa             endbr64 

    1044:       0f ae 5c 24 fc          stmxcsr -0x4(%rsp)

    1049:       81 4c 24 fc 40 80 00    orl    $0x8040,-0x4(%rsp)

    1050:       00 

    1051:       0f ae 54 24 fc          ldmxcsr -0x4(%rsp)

    1056:       c3                      retq   

    1057:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)

    105e:       00 00 

This snippet uses stmxcsr to get the value of the MXCSR register, ORs it with 0x8040 (MTZ | DAZ), and then uses ldmxcsr to save it back into the MXCSR register. Simple enough!

I wrote a hacky Python script that uses basically these same objdump commands to disassemble each constructor up until the first return (retq) instruction, and report if any of them contain all three of stmxcsr, ldmxcsr, and the constant 0x8040. This will miss any code pattern that doesn't exactly match the code found in gcc's crtfastmath library, but it's good enough and shouldn't have false positives.

Unfortunately, it also turns out to be quite slow, since it invokes objdump once to get the .init_array entries, and then N times to disassemble (once for each constructor). This really adds up when you want to scan thousands of files, especially because (as I now am doomed to know forever) shared libraries built in C++ can have thousands of constructors (the highest I saw was in scine-serenity-wrapper's serenity.module.so, which has a whopping 11,841 different constructors (!!!)).

Instead I switched to searching for the exact byte sequence that encodes the stmxcsr, orl, ldmxcsr, and retq instructions. This is even more brittle than the disassembly-based technique, but since none of the instructions involved use any memory addresses (which would change depending on exactly where crtfastmath was linked into the binary), the sequence is pretty consistent over the range of compilers I could check (gcc-{5,7,8,9,11} and clang-{10,11,12,14}). The only difference I saw is that newer compilers start the function with endbr64 (an instruction that tells Intel's Control-flow Enforcement Technology, an exploit mitigation technique, that this is a valid jump target).

The final script can be found here. It uses pyelftools to parse the binary and extract the list of constructors, then tests for a string match at each of the constructors listed. This is much faster than multiple calls to objdump and can check thousands of binaries per second, which is the kind of scale we need.

Obtaining 100GB 11.6TB of Shared Libraries from PyPI

In the first draft of this post, I was going to wimp out and just do the top 25%, and only one binary wheel for the most recent version of each project. But my ambition and curiosity (and a strong desire to procrastinate on preparing for my first lecture on Tuesday) got the better of me, so I decided to go ahead and get all of the x86-64 binary wheels for every version of all packages on PyPI.

We start by getting a list of all Python package names using the PyPI Simple Index, which is basically just a giant HTML file with one link per package.

Next we want to get the metadata for each package, including the list of releases and all the URLs to the actual binary packages. Each package has a JSON description at https://pypi.org/pypi/NAME/json; for example, here's the metadata for gevent, the package that kicked off this quixotic quest. We can download all of them using wget; I'll also request gzip compression so that I can save PyPI some bandwidth:

(sfcodegen) moyix@isabella:/fastdata/pypi_wheels$ cat all_packages.txt | sed 's|^|https://pypi.org/pypi/|;s|$|/json|' > pypi_all_urls.txt

(sfcodegen) moyix@isabella:/fastdata/pypi_wheels$ wget --no-verbose --header="Accept-Encoding: gzip" -O pypi_all.json.gz -i pypi_all_urls.txt 2> wget_errors.txt

When you use -O with multiple files like this, wget will concatenate all of them together into one big file, but luckily it turns out that the concatenation of multiple gzip files is itself a valid gzip file, so we end up with all package metadata (well, almost all – about 10,723 packages returned 404, but that's not too bad compared to the 386,544 packages we did manage to get info for).

After about an hour and an embarrassing amount of searching StackOverflow, I came up with a terrifying jq one-liner to extract out all the URLs of the x86-64 Linux wheels from the 1.5GB of compressed JSON data:

(sfcodegen) moyix@isabella:/fastdata/pypi_wheelszcat  pypi_all.json.gz | jq -cr '.releases[] | .[] | [select(.packagetype | contains("bdist_wheel"))] | .[] | select(.url | test("manylinux.*x86_64")) | [.size, .url] | join(" ")' > pypi_all_wheels.txt

This gives us 269,752 packages to download, along with their sizes. Adding up all the sizes, we get a grand total of 4 TB, which I managed to download in about 12 hours using aria2 (while I love wget, for a job this big I wanted something that could save/resume sessions and make many connections at once).

Now we have a small snag. While I do have quite a lot (10 TB) of NVME storage, it's still not enough to unpack all the shared objects at once for scanning. Instead I put together a small script to unpack just one package at a time, scan it, and then remove the extracted data. Finally, I am legally obligated to inform you that I used GNU Parallel to scan all the wheels:

(sfcodegen) moyix@isabella:/fastdata/pypi_wheels$ find wheels -type f -print0 | parallel --progress -0 --xargs python extract_and_scan.py {#} {}

[This didn't go quite as smoothly as I'm making out — when you're scanning ~2.4 million shared libraries across ~270K wheels, you find all sorts of edge cases in your code; I discovered .so files that were in fact text files, ARM and 32-bit x86 libraries bundled into an x86-64 package, and some partially mangled zip files.]

And so now we have everything we need to count how many x86-64 shared libraries there are in all of PyPI that were linked with -ffast-math:

(sfcodegen) moyix@isabella:/fastdata/pypi_wheels$ grep -F 'contains ffast-math constructor' ffast_results_all.txt | wc -l

5830

A mere 5,830 out of the 2.4 million we scanned!

On the Origin of Bad Floating Point Math

To find out a little more about the packages that they're associated with, we can extract the dist-info metadata from each wheel. Unfortunately the first library I found to do this was extremely strict, and barfed on the many, many wheels in my dataset with slightly malformed metadata. I went on a yet another giant yak shaving expedition and wrote my own robust parser.

Along the way I learned a lot of fun facts about Python's packaging metadata. Did you know that the format of the METADATA file is actually based on email? And that because email is notoriously difficult to specify, the standard says that the format is "[...] what the standard library email.parser module can parse using the compat32 policy"? Or that the various files that can appear in the dist-info directory are an exciting menagerie of CSV, JSON, and Windows INI formats? So much knowledge that I now wish I could unlearn!

Anyway, let's get some stats. I dumped all shared library info and package metadata into a big pandas DataFrame to play around with for this.

First, how many distinct packages have at least one binary that was built with -ffast-math?

In [28]: df['Package'].unique()

Out[28]: 

array(['archive-pdf-tools', 'bgfx-python',

       'bicleaner-ai-glove', 'BTrees', 'cadbiom',

       'ctranslate2', 'dyNET', 'dyNET38', 'gevent',

       'glove-python-binary', 'higra', 'hybridq', 'ikomia',

       'ioh', 'jij-cimod', 'lavavu', 'lavavu-osmesa',

       'MulticoreTSNE', 'neural-compressor', 'nwhy',

       'openjij', 'openturns', 'perfmetrics', 'pHashPy',

       'pyace-lite', 'pyapr', 'pycompadre',

       'pycompadre-serial', 'PyKEP', 'pykep',

       'pylimer-tools', 'pyqubo', 'pyscf', 'PyTAT',

       'python-prtree', 'qiskit-aer', 'qiskit-aer-gpu',

       'RelStorage', 'sail-ml', 'segmentation', 'sente',

       'sinr', 'snapml', 'superman', 'symengine',

       'systran-align', 'texture-tool', 'tsne-mp', 'xcsf'],

      dtype=object)

In [29]: len(df['Package'].unique())

Out[29]: 49

Not nearly as bad as I thought! There are only 49 packages that, at some point in their history, released a binary wheel that had a shared library linked with -ffast-math. We can find out more about them by having pandas make an HTML table with the package summary for us:

In [54]: counts = df.groupby(['Package','Summary']).size()


In [55]: counts = counts.reset_index().rename(columns={0:'Count'})


In [56]: counts = counts.sort_values(by='Count',ascending=False)


In [57]: counts.to_html('count.html',justify='center',index=False)

And we get a (pretty bare-bones) table that you can look at here. Here are the first few rows:

PackageSummaryCount
BTreesScalable persistent object containers1166
geventCoroutine-based network library1054
qiskit-aerQiskit Aer - High performance simulators for Qiskit589
qiskit-aer-gpuQiskit Aer - High performance simulators for Qiskit448
ctranslate2Optimized inference engine for OpenNMT models335
snapmlSnap Machine Learning258
neural-compressorRepository of Intel® Neural Compressor234
RelStorageA backend for ZODB that stores pickles in a relational database.222
ikomiaIkomia Python API for Computer Vision workflow and plugin integration in Ikomia Studio165

Unsurprisingly, a lot of these are various kinds of scientific software. I have never met a scientist who can resist the lure of fast-but-dangerous math when doing numerical simulations. But others, I think, are much more likely to simply be mistakes:

  • BTrees - Scalable persistent object containers. I don't think a generic data structure library should be changing the floating point behavior. And indeed if we look at their GitHub repo, there's an open pull request to disable the use of -Ofast.
  • gevent - Coroutine-based network library. I covered this one in the intro; it definitely way out of line for a networking library to be messing with the FPU; we found the pull request that fixes it (also still un-merged, sadly) earlier.
  • RelStorage - A backend for ZODB that stores pickles in a relational database. I can't imagine how storing pickles in a database would need floating point math, so this seems like a mistake. This time I don't see any issues or PRs asking about -Ofast, but we can see in the script used to generate the manylinux builds that it is indeed enabled.
  • perfmetrics - Send performance metrics about Python code to Statsd. This has nothing to do with floating point math. Once again no PR or issue, but I notice that this, RelStorage, and BTrees are all Zope Foundation projects. The BTrees issue mentioned the Zope Foundation meta project as the source of the build configurations, and look what we find there (in config/c-code/tests.yml.j2):

# Initially copied from

# https://github.com/actions/starter-workflows/blob/main/ci/python-package.yml

# And later based on the version jamadden updated at

# gevent/gevent, and then at zodb/relstorage and zodb/perfmetrics

So in fact all the packages we've seen so far can trace their use of -Ofast to gevent!

How Bad is This? It Depends

Well who cares about all this, you might ask; I can just avoid using using those libraries until the problem is fixed, right? Well, maybe. But I had no idea I was using gevent until numpy started yelling at me, and I still don't know the exact chain of dependencies that caused it to get installed in the first place. So can we get a count of how many packages might (directly or indirectly) depend on a package that has an -ffast-math library?

This is not as easy as it seems. Python packaging metadata is in a pretty sorry state, and you can't easily get a simple list of the dependencies of each package in machine-readable form. It turns out that the recommended way to get the dependencies of a particular package is to... just install it with pip and see what happens. And since we're looking for reverse dependencies, it's even worse -- we would have to install every package on PyPI and see if any of them pulled in one of the libraries we discovered!

I actually started down this path and set about running pip install --dry-run --ignore-installed --report on all 397,267 packages. This turned out to be a terrible idea. Unbeknownst to me, even with --dry-run pip will execute arbitrary code found in the package's setup.py. In fact, merely asking pip to download a package can execute arbitrary code (see pip issues 7325 and 1884 for more details)! So when I tried to dry-run install almost 400K Python packages, hilarity ensued. I spent a long time cleaning up the mess, and discovered some pretty poor setup.py practices along the way. But hey, at least I got two free pictures of anime catgirls, deposited directly into my home directory. Convenient!

Once I had managed to clean up the mess (or hopefully, anywayI never did find out what package tried to execute sudo), I decided I needed a different approach. Eduardo Vela helpfully pointed me to Google's Open Source Insights project, deps.dev, which catalogues the dependencies and dependents (reverse dependencies) for PyPI, npm, Go, Maven, and Cargo. They don't have an official API, but a bit of poking around in Chrome Devtools' Network tab turned up a simple JSON endpoint that I could query. And I figured since I was only querying 422 package+version combinations, I could probably get away with it.

That, finally, let me produce this table, which you can see in its full form here (I've omitted all packages with less than 4 dependents below since this post is already pretty long):


PackageVersionCount
gevent21.12.01592
BTrees4.10.0618
gevent20.9.051
gevent21.1.236
gevent20.6.224
gevent21.8.013
perfmetrics3.2.012
ctranslate22.17.09
gevent20.12.19
qiskit-aer0.8.29
qiskit-aer0.10.39
dyNET2.1.28
qiskit-aer0.9.17
dyNET382.16
gevent20.6.16
qiskit-aer0.7.56
qiskit-aer0.1.15
gevent20.6.04
glove-python-binary0.2.04
pyqubo1.0.134
qiskit-aer0.6.14
qiskit-aer0.7.34
qiskit-aer0.9.04

A total of 2,514 packages eventually depend on a package that uses -ffast-math. (This number is still approximate; deps.dev doesn't tell you the actual names and versions of all of the reverse dependencies (just a random sample), so the counts here may include some duplicates. It may also underestimate the extent of the problem, since there are some fairly popular packages like PyTorch that aren't on PyPI).

Also, even with only the random sample of dependents provided by deps.dev, we can use the download count CSV we generated at the beginning of this post to get a sense of how popular the affected projects are. Assuming we have a list of the names of the reverse dependencies in a file named "rdeps.txt", we can just do:
(sfcodegen) moyix@isabella:/fastdata/pypi_wheels$ ipython
Python 3.8.10 (default, Jun 22 2022, 20:18:18) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: df = pd.read_csv('pypi_downloads_20220901_30d.csv')

In [3]: rdeps = set(open('rdeps.txt').read().splitlines())

In [4]: sorted_df = df.sort_values(by='num_downloads', ascending=False, ignore_index=True)

In [5]: sorted_df['rank'] = sorted_df.index+1

In [6]: rdep_ranks = sorted_df[sorted_df['project'].isin(rdeps)]

In [7]: top20 = rdep_ranks.nsmallest(20, 'rank')

In [8]: from tabulate import tabulate

In [9]: print(tabulate(top20, headers='keys', tablefmt='psql', showindex=False))
+--------------------------+-----------------+--------+
| project                  |   num_downloads |   rank |
|--------------------------+-----------------+--------|
| geventhttpclient         |         1116639 |   1296 |
| locust                   |         1045789 |   1345 |
| flask-socketio           |          914846 |   1424 |
| dagster                  |          497982 |   1974 |
| grequests                |          365292 |   2328 |
| dedupe                   |          358737 |   2352 |
| websocket                |          346515 |   2397 |
| gevent-websocket         |          322249 |   2466 |
| dagster-graphql          |          310382 |   2525 |
| locust-plugins           |          273452 |   2669 |
| interpret-community      |          268129 |   2692 |
| zope-index               |          240577 |   2836 |
| dedupe-variable-datetime |          223080 |   2932 |
| parallel-ssh             |          220882 |   2947 |
| azureml-interpret        |          211233 |   3009 |
| locustio                 |          102322 |   4203 |
| allennlp                 |           92268 |   4498 |
| interpret                |           91404 |   4521 |
| rasa-sdk                 |           87875 |   4609 |
| pykafka                  |           78679 |   4837 |
+--------------------------+-----------------+--------+

So there are some very popular packages that will pull in one of the 49 packages that was built with -ffast-math! With a little more work we can include the name of the fast-math-enabled package as well:

+--------------------------+----------+-----------------+--------+
| project                  | source   |   num_downloads |   rank |
|--------------------------+----------+-----------------+--------|
| geventhttpclient         | gevent   |         1116639 |   1296 |
| locust                   | gevent   |         1045789 |   1345 |
| flask-socketio           | gevent   |          914846 |   1424 |
| dagster                  | gevent   |          497982 |   1974 |
| grequests                | gevent   |          365292 |   2328 |
| dedupe                   | btrees   |          358737 |   2352 |
| websocket                | gevent   |          346515 |   2397 |
| gevent-websocket         | gevent   |          322249 |   2466 |
| dagster-graphql          | gevent   |          310382 |   2525 |
| locust-plugins           | gevent   |          273452 |   2669 |
| interpret-community      | gevent   |          268129 |   2692 |
| zope-index               | btrees   |          240577 |   2836 |
| dedupe-variable-datetime | btrees   |          223080 |   2932 |
| parallel-ssh             | gevent   |          220882 |   2947 |
| azureml-interpret        | gevent   |          211233 |   3009 |
| locustio                 | gevent   |          102322 |   4203 |
| allennlp                 | gevent   |           92268 |   4498 |
| interpret                | gevent   |           91404 |   4521 |
| rasa-sdk                 | gevent   |           87875 |   4609 |
| pykafka                  | gevent   |           78679 |   4837 |
+--------------------------+----------+-----------------+--------+

Conclusion

So after all this work, what did we learn?

  • Turning on -Ofast will end up turning on -ffast-math, and that can cause all sorts of problems for any program unlucky enough to load them.
  • Even if you explicitly ask for no fast math, you will still get fast math as long as -Ofast is enabled.
  • It is surprisingly feasible (though perhaps not wise) for a single individual with a good internet connection to download 4 TB of Python packages and scan 11 TB of shared libraries in a single day.
  • It is definitely not wise to try to run pip download or pip install --dry-run on every package listed in PyPI, at least not without some good sandboxing, because it will execute tons of random code from setup.py files and leave you with a giant mess to clean up.
  • Because of highly connected nature of the modern software supply chain, even though a mere 49 packages were actually built with -ffast-math, thousands of other packages, with a total of at least 9.7 million downloads over the past 30 days, are affected.

What can we actually do about it?

Well, for now you can just try to be careful about what libraries you use, perhaps with the help of the tables I generated in this post. If you have a numerical function in Python that you really don't want to be affected by this, I wrote a somewhat alarming script, ensure_fp.py, that provides a decorator named @ensure_clean_fpu_state. The decorator resets the value of MXCSR to its power-on state for the duration of the function. (I say "alarming" because it runs assembly code from Python by mapping an executable memory region and then copying in the raw code bytes; hopefully it goes without saying that it needs some work before it's production-ready.)

Some absolutely 31337 ASCII-art I put together for ensure_fp.py. Thanks to Ben Gras, Peter Goodman, and David R. MacIver on Twitter for helping me rephrase the caption so that it was fully justified to the width of that box in the top-left.


Longer-term, gcc and clang should provide more sane defaults. Ideally, -ffast-math should simply not enable FTZ/DAZ; that functionality (if anyone wants it) could be split out into an option and definitely not enabled by -ffast-math. A less radical compromise would be to at least avoid linking in crtfastmath when building a shared library. I'm not all that optimistic about this, though, given that the relevant gcc bug report will turn 10 years old at the end of this November and is still marked NEW. Still, maybe if more people complain about it (ideally with a pull request or patch attached) it will get fixed.

Oh, and as for gevent? I decided to just replace the code that messes with MXCSR with no-ops until they get around to making a release that doesn't mess with my FPU.

Comments

Unknown said…
I just wanted to say - this is an absolutely brilliant write up. Thanks for all your hard work.
Crazy awesome work, breh! You should be knighted for this.
luciano said…
Great research. For a reference, qiskit-aer and qiskit-aer-gpu fixed the issue in 0.10.4, few months ago https://github.com/Qiskit/qiskit-aer/pull/1469
Thanks! I've been meaning to update the post with a note indicating which packages in the list still have it enabled in their latest version.
ljbo3 said…
Great research! Note that if you need to have the FP standard behaviour, then you can use the _MM_SET_FLUSH_ZERO_MODE and _MM_SET_DENORMALS_ZERO_MODE. It is as simple as writing a tiny Python extension, and load it after everything else that messes up with those FTZ and DAZ flags.
hanfried said…
Very englightning and a good, even though scary read.

I think, the arbitrary code execution whenever you touch pip is much more scary than this subtle floating point behaviour change. It's a big reminder really always work into a sandbox (and the sandbox shouldn't just be virtualenv).

Also, I wonder how many data science/ML projects will be affected. Using gevent is probably very often used as a result of being microservices. Maybe not that big deal, if the models are correctly tested in the production pipeline, but let's be honest, usually we would check our models via some Jupyter code like code (without the microservice) and while in the production pipeline we would check that the endpoints yield valid responsed, I don't think, we'd very often check whether the training in the production microservice is exactly the same as in the Jupyter environment. Maybe even this is not so wild, as somehow this might be more a regularisation/augmentation and shouldn't affect that much, but of course, everything untested is potentially broken.
But for anything really dependending on exact numerical solutions (like a PDE solver) and not on some statistical learning approach where some fault tolerance is included by default, this is very good to know to check for it.
And as usual I really good story, why you should always respect showing warnings (very often, the only consequence is to shut off the warning, but not fix the reason for it).

Popular posts from this blog

Decrypting LSA Secrets

SysKey and the SAM