NinetyFive Blog

Engineering insights and research from the NinetyFive team

How Python's Garbage Collection Works

November 2025

While debugging long pauses in our inference server, we found a bug in the garbage collector's allocation counting. Tracking it down meant understanding how Python's garbage collection works and the source is more readable than you'd expect.

The two-part system

Python uses two complementary mechanisms for memory management.

  1. Reference counting: frees objects immediately when no references remain
  2. Cyclic garbage collection: detects and frees reference cycles

Reference counting handles most objects. The cyclic collector only runs periodically to catch cycles that reference counting can't detect.

Reference counting

Every Python object has a reference count stored in ob_refcnt. When you assign an object to a variable, pass it to a function, or store it in a container, the count increases. When a reference goes away, it decreases.

The core operations are defined in v3.14.0Include/refcount.h:

// Increment reference count
static inline void Py_INCREF(PyObject *op)
{
    op->ob_refcnt++;
}

// Decrement reference count, free if zero
static inline void Py_DECREF(PyObject *op)
{
    if (--op->ob_refcnt == 0) {
        _Py_Dealloc(op);
    }
}

When the count hits zero, _Py_Dealloc is called immediately to free the object, without garbage collection.

The problem with cycles

Reference counting fails when objects reference each other in a cycle.

a = []
b = []
a.append(b)  # a -> b
b.append(a)  # b -> a
del a, b     # Both have refcount=1, never freed!

After del a, b, both lists still have a reference count of 1 (from each other), so reference counting won't free them. This is where the cyclic garbage collector comes in.

Cyclic garbage collection

The cyclic collector periodically probes container objects (lists, dicts, classes, etc.) to catch unreachable cycles. The core algorithm is in v3.14.0Python/gc.c#L1694-L1697 and even includes a cueing comment.

/* This is the main function. Read this to understand how the
 * collection process works. */
static void
gc_collect_region(PyThreadState *tstate,
                  PyGC_Head *from,
                  PyGC_Head *to,
                  struct gc_collection_stats *stats)

In short: copy each object's reference count to a temporary field, then traverse references and decrement counts. Any object left with count zero is only reachable from within the cycle, aka garbage.

When does garbage collection run?

Python optimizes its garbage collector with generational collection because most objects die young. Focusing on younger objects reduces the number of objects each collection pass needs to scan. Objects start in generation 0 (young), and survivors get promoted to generation 1 (old, pending), then 2 (old, visited). Each generation has its own threshold, defined at v3.14.0Include/internal/pycore_runtime_init.h#L140-L144.

.young = { .threshold = 2000, },  // gen 0 (young)
.old = {
    { .threshold = 10, },             // gen 1 (old, pending)
    { .threshold = 0, },              // gen 2 (old, visited)
},

Tracking the allocation count, however, differs between the standard build and the free-threaded build.

The standard build relies on the Global Interpreter Lock, so only one thread runs Python code at a time. A single global counter can safely track allocation counts. When the free-threaded build was introduced in Python 3.13, removing the lock meant multiple threads could now allocate simultaneously and race when updating that counter. A new garbage collector implementation was needed, but updating a global counter on every allocation would block threads waiting for access, essentially reintroducing the Global Interpreter Lock.

The solution? Maintain a local count per thread and periodically flush it to the global count.

Thread-local allocation counting

In v3.14.0Python/gc_free_threading.c#L66-L69, a threshold is defined for each thread's local count.

// Each thread buffers the count of allocated objects in a thread-local
// variable up to +/- this amount to reduce the overhead of updating
// the global count.
#define LOCAL_ALLOC_COUNT_THRESHOLD 512

As the comment suggests, when a thread's local count reaches +/-512, it flushes to the global count in v3.14.0Python/gc_free_threading.c#L2106-L2139.

static void
record_allocation(PyThreadState *tstate)
{
    struct _gc_thread_state *gc = &((_PyThreadStateImpl *)tstate)->gc;

    gc->alloc_count++;
    if (gc->alloc_count >= LOCAL_ALLOC_COUNT_THRESHOLD) {
        GCState *gcstate = &tstate->interp->gc;
        _Py_atomic_add_int(&gcstate->young.count, (int)gc->alloc_count);
        gc->alloc_count = 0;

        if (gc_should_collect(gcstate)) {
            _Py_ScheduleGC(tstate);
        }
    }
}

static void
record_deallocation(PyThreadState *tstate)
{
    struct _gc_thread_state *gc = &((_PyThreadStateImpl *)tstate)->gc;

    gc->alloc_count--;
    if (gc->alloc_count <= -LOCAL_ALLOC_COUNT_THRESHOLD) {
        GCState *gcstate = &tstate->interp->gc;
        _Py_atomic_add_int(&gcstate->young.count, (int)gc->alloc_count);
        gc->alloc_count = 0;
    }
}

Then, when the global count exceeds the threshold in v3.14.0Python/gc_free_threading.c#L2084-L2104, collection is triggered.

static bool
gc_should_collect(GCState *gcstate)
{
    int count = _Py_atomic_load_int_relaxed(&gcstate->young.count);
    int threshold = gcstate->young.threshold;
    if (count <= threshold || threshold == 0) {
        return false;
    }
    // ...
    return true;
}

The bug we found

Python has an API to access these counters, gc.get_count(), which returns a tuple of (gen0, gen1, gen2) counts. However, on our server, we were getting negative counts like (-2311926, 0, 0). This shouldn't be possible, and it's problematic: with a count of -2311926, it would take another 2311926 + 2000 allocations before the count reaches threshold and garbage collection triggers again. Pauses between garbage collections kept increasing, and each subsequent collection started to take multiple seconds to complete, grinding our server to a halt.

The standard build gave us a clue: it also decrements on deallocation, but only when the count is positive. This prevents the negative spiral we observed. v3.14.0Python/gc.c#L2355-L2381.

void
PyObject_GC_Del(void *op)
{
    // ...
    PyGC_Head *g = AS_GC(op);
    if (_PyObject_GC_IS_TRACKED(op)) {
        gc_list_remove(g);
        // ...
    }
    GCState *gcstate = get_gc_state();
    if (gcstate->young.count > 0) {
        gcstate->young.count--;
    }
    // ...
    PyObject_Free(op);
}

The root cause of the bug: the free-threaded record_deallocation function in v3.14.0Python/gc_free_threading.c#L2128-L2139 implemented a different strategy. Instead of checking if the count is positive before decrementing, it always decremented and flushed negative values to the global count.

static void
record_deallocation(PyThreadState *tstate)
{
    struct _gc_thread_state *gc = &((_PyThreadStateImpl *)tstate)->gc;

    gc->alloc_count--;
    if (gc->alloc_count <= -LOCAL_ALLOC_COUNT_THRESHOLD) {
        GCState *gcstate = &tstate->interp->gc;
        _Py_atomic_add_int(&gcstate->young.count, (int)gc->alloc_count);  // flushes negative value
        gc->alloc_count = 0;
    }
}

The problem scenario

  1. Multiple threads each have local_count = 300
  2. Garbage collection runs, resets global = 0 (but local counts stay at 300)
  3. Each thread deallocates 812 objects: local = 300 - 812 = -512
  4. Threshold hit, each thread flushes -512 to global
  5. With N threads: global = -512 * N

The compounding happens because garbage collection first resets the global count to zero, then frees objects. Each freed object calls record_deallocation(), driving local counts negative. These negative values get flushed to global without any clamping, and each cycle adds more negative values.

The fix allows thread-local counts to go negative but clamps the global counter to non-negative using a compare-and-exchange loop.

 static void
 record_deallocation(PyThreadState *tstate)
 {
     struct _gc_thread_state *gc = &((_PyThreadStateImpl *)tstate)->gc;

     gc->alloc_count--;
     if (gc->alloc_count <= -LOCAL_ALLOC_COUNT_THRESHOLD) {
         GCState *gcstate = &tstate->interp->gc;
-        _Py_atomic_add_int(&gcstate->young.count, (int)gc->alloc_count);+        int count = _Py_atomic_load_int_relaxed(&gcstate->young.count);+        int new_count;+        do {+            if (count == 0) {+                break;+            }+            new_count = count + (int)gc->alloc_count;+            if (new_count < 0) {+                new_count = 0;+            }+        } while (!_Py_atomic_compare_exchange_int(&gcstate->young.count,+                                                  &count,+                                                  new_count));         gc->alloc_count = 0;
     }
 }

The loop atomically loads the current count, computes a new value (clamped to zero if negative), and attempts to swap it in. If another thread modified the count in the meantime, the compare-and-exchange fails and we retry.

The impact

With the fix applied, garbage collection runs more frequently but each pass completes quickly. Before the fix, our server would go minutes between collections, accumulating millions of objects, resulting in multi-second pauses. Now collections happen every few thousand allocations as intended, and with our fix merged yours will too.

You can see this with gc.set_debug(gc.DEBUG_STATS), which prints statistics after each collection.

Before:

gc: collecting generation 0...
gc: objects in each generation: 4892745 0 0
gc: objects in permanent generation: 0
gc: done, 3.127s elapsed

After:

gc: collecting generation 0...
gc: objects in each generation: 2145 0 0
gc: objects in permanent generation: 0
gc: done, 0.001s elapsed

Many fast collections instead of few slow ones.

Love this shit? Join us on Discord or email hello@ninetyfive.gg.