10x faster python test iteration via fork(2)

Published in

Benchling Engineering

8 min readJul 20, 2023

It’s ideal to get feedback on your code faster — to make a code change and see the result instantly. But, as projects get larger, reload times get longer. Each incremental dependency or bootstrap code block that adds 200ms feels worth it, but 50 of them later and it takes 10 seconds to see the result of a code change.

On the Build team at Benchling, that’s where we found ourselves one day. We used 146 packages which pull in 128 transitive dependencies for a total of 274 packages. We also spent a lot of time waiting for SQLAlchemy models to initialize. The result is our test harness took 10 seconds to set up. After making a code change, you’d start the test runner, wait a few seconds, alt+tab to your browser, get distracted for a few minutes, and then find out you had a typo in your code.

This is a common challenge for a growing codebase, but it’s something we knew we needed to fix. Here’s the process we arrived at which allowed the second run of tests to start 10x faster — 90% less waiting. While it’ll work a little differently for your codebase depending on the language, dependencies, etc. you’re using, hopefully this can inspire you on your journey to faster feedback and testing.

importlib.reload()

Since the problem is that we spend so long setting up a bunch of modules just right and then want to see the change in a single file we’re editing, the most obvious solution is to use importlib.reload from the standard library.

import importlib
import sys
import test_harness_stuff  # takes 10 seconds
import tests
def rerun_tests(changed_path):
    for mod in sys.modules.values():
        if mod.__file__ == changed_path:
            importlib.reload(mod)
            tests.run_tests()
            break
if __name__ == '__main__':
  setup_file_watcher(rerun_tests)
  tests.run_tests()

This (with some special handling for built-in modules, relative path resolution, and batching to handle editors that perform multiple filesystem operations per save) works alright when the file being changed is a test file (or any other leaf node in the dependency tree).

However, as you’ve probably guessed from the very long documentation for reload(), this doesn’t work in many other cases. A very common one is if you have animal.py:

cow = "woof"

and then cow_say.py:

from animal import cow

If you change cow = "moo", reloading animal.pyis not enough because cow_say.py has its own global bound to the old str. After reloading animal.py, you must then reload all reverse dependencies in topological order. You must also ensure that if a class definition is changed, all instantiations of that are reinitialized. For projects of almost any complexity, this is not feasible.

Not importing

Despite reload() not solving our problems, thinking about its issues is helpful in building a more useful solution. The giant list of caveats with reload() means you need to do surgery on the already-loaded modules.

What if we just didn’t load the code you were going to change until after you changed it? Then we wouldn’t need to do surgery! It’s not too hard to guess what code might be changed. Roughly speaking, our codebase has 3 kinds of modules: 3rd-party dependencies, SQLAlchemy models, and actual app code/tests. More than 90% of the time, we’re working in that last category, so we can just import the 3rd-party dependencies and SQLAlchemy models and not load the app/tests until we’re ready to run a test.

zeus, fork()

That leaves one problem: after we run a test, the test is loaded. How do we reset back to the state where dependencies and models were loaded but not app/tests? zeus actually solved this for Rails: load Rails, fork(), then load app code.

fork() creates a new process by duplicating the calling process. […] The child process and the parent process run in separate memory spaces. At the time of fork() both memory spaces have the same content. Memory writes […] performed by one of the processes do not affect the other.

So we can use fork()to snapshot the parent, import some code that is going to change (app/tests), and then rewind back to the snapshot later. Rather than doing surgery on in-memory modules, we can just let the child process exit, re-fork, and re-import any changed code.

import os
import sys
import test_harness_stuff  # takes 10 seconds
def run_tests():
    pid = os.fork()
    if pid == 0:  # child
        import tests
        tests.run_tests()
        sys.exit()
    else:  # parent
        os.waitpid(pid, 0)
if __name__ == '__main__':
    setup_file_watcher(run_tests)
    run_tests()

Something like this sped up our test iteration time from 10 seconds to 1 second, which is a workflow-altering speed improvement (someone told me “I wouldn’t have bothered writing this tricky test if it weren’t for the fast reloader”).

zeus actually has a multi-level process tree and, when a file changes, it identifies which level imported it and terminates that process and all its ancestors. We do this too at Benchling: we divide up our modules into tiers based on how often developers work on them and where they fall in our dependency tree and then import each tier after forking. This allows us to discard as little import work as possible when a file closer to the root of our dependency tree changes.

We actually ended up with some other components for ergonomics (a terminal forwarder that uses libreadline) and performance (file watcher that can’t fork because it’s threaded).

Bonus: memory savings by not garbage collecting

Once you start running python code after os.fork(), you start running into the same memory usage problems Instagram faced. They run a Django web server and load up all their dependencies before forking the web workers. At first, they tried to solve their runaway memory usage by disabling garbage collection entirely. Later, they came up with a more elegant solution and upstreamed it into CPython 3.7.

But what caused the memory usage? In short, copy-on-write pages and reference counting.

Copy-on-write

The fork() docs say “At the time of fork() both memory spaces have the same content. Memory writes […] performed by one of the processes do not affect the other”. The simplest way to implement this is to copy all the memory from the parent into the child.

The Linux kernel doesn’t do that. Instead, it makes new page tables for the child process that point back at the parent’s memory and marks them both as read-only. When the child tries to write to any memory, it triggers a page fault. The kernel’s page fault handler looks at the page, sees that it was a copy-on-write page, makes an actual copy of the page, and lets the child retry the write operation.

Parent and child processes sharing the same physical memory

As you can imagine, this saves a lot of memory (and makes fork quite a bit faster). So what’s the problem? The child rarely writes to any modules imported by the parent (the app/tests code rarely makes any changes to SQLAlchemy models or 3rd-party dependencies); it only reads them and calls functions defined in them.

gc_refs

Python’s garbage collector needs to know which objects are safe to free. To do this, every object has a gc_refs field stored in its header that is incremented whenever it is referred to (for example, added to a list).

This means that if a module imported by our parent process defines a str and we later read that str in the child (which we do all the time), we will modify its object header to increment the ref count and trigger the kernel’s copy-on-write behavior.

Child process with its own memory after incrementing gc_refs

gc.freeze()

Instagram’s solution to this problem is to (rewrite all the CPython code that looks at gc_refs, introduce, and then) call gc.freeze(). This tells the interpreter that all existing objects should be considered ineligible for garbage collection and future accesses shouldn’t increment the ref counter. (The new object header layout, after Instagram’s changes in 3.7 and after another change in 3.12, is documented here.)

Implementing this is very easy: just call gc.freeze() right before you fork()! Running a typical test, we saw a 160 MiB reduction in unique set size.

Don’t `gc.collect()`!

Now that you’re thinking about the garbage collector, you might be tempted to call gc.collect() right before freezing and forking. It sounds like it would save memory — otherwise, objects with no refs in the parent will stick around forever in both the parent and the child. Unfortunately, that’s a bad idea.

When the garbage collector actually “collects” something, the object allocator “frees” that object’s memory. This doesn’t return any memory back to the system; it simply marks that memory as unused. It also creates a “hole” in the memory. A later allocation can fill that hole by using that freed memory.

If we think about what happens in the child after GC has created “holes” in the memory, we realize that the child will fill those holes in copy-on-write pages. In your development environment, your pages are likely 4 KiB. If you free a 1 KiB object, in the absolute best case it resides entirely within a page boundary and you replace it with another 1 KiB worth of objects. When the child tries to allocate 1 KiB, the kernel copies the entire 4 KiB page: you spent 3 KiB to save 1 KiB.

This is why Instagram actually disables GC entirely in the parent. In their words, “we’re wasting a bit of memory in the shared pages to save a lot of memory later (that would otherwise be wasted on copying entire pages after forking).”

General applicability

The approach we’ve described here solves a problem that we think a lot of others face — if you rack up enough dependencies, you probably have slow startup/reload times. It works on any system with fork (everything but Windows sans WSL). There are a few caveats, though:

You need to be able to fork and then continue executing your code. Some languages’ standard libraries, such as nodejs, don’t offer this out of the box, so you may need platform-specific C extensions.
Your language needs to be able to dynamically load modules at runtime. This is pretty tricky for most compiled languages.
If you want this to work for your webserver (like zeus), it’s a bit more work. You need to integrate with your WSGI/rack/etc. server to handle requests in a properly setup child process. Each server is different, so we don’t have any general advice for how to do this.

Also, the benefits are only realized after you separate out modules based on their position in the dependency tree and frequency of edit. Because this is going to be different for everyone, we don’t have much code to share. We undertook this project because we noticed that SQLAlchemy models were close to the root of our dependency tree and took up the majority of startup time, but your mileage may vary.

We’re hiring!

If you’re interested in working with us to solve complex engineering problems, check out our careers page or contact us!