OpenSSL hangs CPU with Python <= 2.7.3 on Windows

by Ben on April 25, 2013

If you use Python on Windows and you have programs or servers which allocate a lot of items on the heap (both of which we do), you should upgrade to Python 2.7.4. Especially if you do anything with HTTPS/SSL connections.

Python versions 2.7.3 and below use an older version of OpenSSL, which has a serious bug that can cause minutes-long, CPU-bound hangs in your Python process. Apart from the process taking over your CPU, the symptom we saw was a socket.error with the message “[Errno 10054] An existing connection was forcibly closed by the remote host”. This is because the HTTPS request is opened before the OpenSSL hang kicks in, and it takes so long that the remote server times out and closes the connection.

The cause of the bug is actually quite arcane: the Windows version of OpenSSL uses a Win32 function called Heap32Next to walk the heap and generate random data for cryptographic purposes.

However, a call to Heap32Next is O(N) if there are N items in the heap, so walking the heap is an O(N2) operation! Of course, if you’ve got 10 million items on the heap, this takes about 5 minutes. The first connection to an HTTPS server (which uses OpenSSL) essentially brings Python to a grinding halt for this time.

There’s a workaround: call the ssl.RAND_status() function on startup, before you’ve allocated the big data on your heap. That seemed to fix it, though we didn’t dig too deep to guarantee the fix. We were still running on Python 2.6, and given that the just-released 2.7.4 addressed this issue by using a newer version of OpenSSL, we fixed this by simply upgrading to Python 2.7.4. Note that even Python 2.7.3 has the older version of OpenSSL, so be careful.

Other interesting things we found while hunting down this bug:

  • At first we thought this was a bug in Python’s SSL handling, and it turns out there’s a strangely similar bug in Python 2.6′s SSL module. This was interesting, but it wasn’t our problem.
  • Microsoft’s Raymond Chen has a very good historical explanation of why walking the heap with Heap32Next is O(N2), and why OpenSSL shouldn’t really be using this function.
  • You can reproduce the Heap32Next hang just by allocating a ton of Python objects (eg: x = [{i: i} for i in range(500000)]) and seeing the first HTTPS request take ages, with the CPU sitting at around 100%.
  • A blog post with graphs showing Heap32Next’s O(N) behaviour, as well as the connection to OpenSSL.
  • What’s new in Python 2.7.4 notes the update to the bug-fixed OpenSSL version 0.9.8y on Windows.
  • This is the second bug we’ve found due to running something of an eccentric architecture (6GB of website data cached in Python dicts). The other one was related to garbage collection, and incidentally the handling of that was improved in Python 2.7 too. Yes, I know, somebody will leave a comment about how we should be using memcached for this, and they’d probably be right, except for this. :-)

Previous post:

Next post: