NinetyFive Blog
Engineering insights and research from the NinetyFive team
How Python's Garbage Collection Works
November 2025
While debugging long pauses in our inference server, we found a bug in the garbage collector's allocation counting. Tracking it down meant understanding how Python's garbage collection works and the source is more readable than you'd expect.
The two-part system
Python uses two complementary mechanisms for memory management.
- Reference counting: frees objects immediately when no references remain
- Cyclic garbage collection: detects and frees reference cycles
Reference counting handles most objects. The cyclic collector only runs periodically to catch cycles that reference counting can't detect.
Reference counting
Every Python object has a reference count stored in ob_refcnt. When you assign an object to a variable, pass it to a function, or store it in a container, the count increases. When a reference goes away, it decreases.
The core operations are defined in v3.14.0Include/refcount.h:
// Increment reference count
static inline void Py_INCREF(PyObject *op)
{
op->ob_refcnt++;
}
// Decrement reference count, free if zero
static inline void Py_DECREF(PyObject *op)
{
if (--op->ob_refcnt == 0) {
_Py_Dealloc(op);
}
}When the count hits zero, _Py_Dealloc is called immediately to free the object, without garbage collection.
The problem with cycles
Reference counting fails when objects reference each other in a cycle.
a = []
b = []
a.append(b) # a -> b
b.append(a) # b -> a
del a, b # Both have refcount=1, never freed!After del a, b, both lists still have a reference count of 1 (from each other), so reference counting won't free them. This is where the cyclic garbage collector comes in.
Cyclic garbage collection
The cyclic collector periodically probes container objects (lists, dicts, classes, etc.) to catch unreachable cycles. The core algorithm is in v3.14.0Python/gc.c#L1694-L1697 and even includes a cueing comment.
/* This is the main function. Read this to understand how the
* collection process works. */
static void
gc_collect_region(PyThreadState *tstate,
PyGC_Head *from,
PyGC_Head *to,
struct gc_collection_stats *stats)In short: copy each object's reference count to a temporary field, then traverse references and decrement counts. Any object left with count zero is only reachable from within the cycle, aka garbage.
When does garbage collection run?
Python optimizes its garbage collector with generational collection because most objects die young. Focusing on younger objects reduces the number of objects each collection pass needs to scan. Objects start in generation 0 (young), and survivors get promoted to generation 1 (old, pending), then 2 (old, visited). Each generation has its own threshold, defined at v3.14.0Include/internal/pycore_runtime_init.h#L140-L144.
.young = { .threshold = 2000, }, // gen 0 (young)
.old = {
{ .threshold = 10, }, // gen 1 (old, pending)
{ .threshold = 0, }, // gen 2 (old, visited)
},Tracking the allocation count, however, differs between the standard build and the free-threaded build.
- Standard build (with GIL): v3.14.0Python/gc.c
- Free-threaded build: v3.14.0Python/gc_free_threading.c
The standard build relies on the Global Interpreter Lock, so only one thread runs Python code at a time. A single global counter can safely track allocation counts. When the free-threaded build was introduced in Python 3.13, removing the lock meant multiple threads could now allocate simultaneously and race when updating that counter. A new garbage collector implementation was needed, but updating a global counter on every allocation would block threads waiting for access, essentially reintroducing the Global Interpreter Lock.
The solution? Maintain a local count per thread and periodically flush it to the global count.
Thread-local allocation counting
In v3.14.0Python/gc_free_threading.c#L66-L69, a threshold is defined for each thread's local count.
// Each thread buffers the count of allocated objects in a thread-local
// variable up to +/- this amount to reduce the overhead of updating
// the global count.
#define LOCAL_ALLOC_COUNT_THRESHOLD 512As the comment suggests, when a thread's local count reaches +/-512, it flushes to the global count in v3.14.0Python/gc_free_threading.c#L2106-L2139.
static void
record_allocation(PyThreadState *tstate)
{
struct _gc_thread_state *gc = &((_PyThreadStateImpl *)tstate)->gc;
gc->alloc_count++;
if (gc->alloc_count >= LOCAL_ALLOC_COUNT_THRESHOLD) {
GCState *gcstate = &tstate->interp->gc;
_Py_atomic_add_int(&gcstate->young.count, (int)gc->alloc_count);
gc->alloc_count = 0;
if (gc_should_collect(gcstate)) {
_Py_ScheduleGC(tstate);
}
}
}
static void
record_deallocation(PyThreadState *tstate)
{
struct _gc_thread_state *gc = &((_PyThreadStateImpl *)tstate)->gc;
gc->alloc_count--;
if (gc->alloc_count <= -LOCAL_ALLOC_COUNT_THRESHOLD) {
GCState *gcstate = &tstate->interp->gc;
_Py_atomic_add_int(&gcstate->young.count, (int)gc->alloc_count);
gc->alloc_count = 0;
}
}Then, when the global count exceeds the threshold in v3.14.0Python/gc_free_threading.c#L2084-L2104, collection is triggered.
static bool
gc_should_collect(GCState *gcstate)
{
int count = _Py_atomic_load_int_relaxed(&gcstate->young.count);
int threshold = gcstate->young.threshold;
if (count <= threshold || threshold == 0) {
return false;
}
// ...
return true;
}The bug we found
Python has an API to access these counters, gc.get_count(), which returns a tuple of (gen0, gen1, gen2) counts. However, on our server, we were getting negative counts like (-2311926, 0, 0). This shouldn't be possible, and it's problematic: with a count of -2311926, it would take another 2311926 + 2000 allocations before the count reaches threshold and garbage collection triggers again. Pauses between garbage collections kept increasing, and each subsequent collection started to take multiple seconds to complete, grinding our server to a halt.
The standard build gave us a clue: it also decrements on deallocation, but only when the count is positive. This prevents the negative spiral we observed. v3.14.0Python/gc.c#L2355-L2381.
void
PyObject_GC_Del(void *op)
{
// ...
PyGC_Head *g = AS_GC(op);
if (_PyObject_GC_IS_TRACKED(op)) {
gc_list_remove(g);
// ...
}
GCState *gcstate = get_gc_state();
if (gcstate->young.count > 0) {
gcstate->young.count--;
}
// ...
PyObject_Free(op);
}The root cause of the bug: the free-threaded record_deallocation function in v3.14.0Python/gc_free_threading.c#L2128-L2139 implemented a different strategy. Instead of checking if the count is positive before decrementing, it always decremented and flushed negative values to the global count.
static void
record_deallocation(PyThreadState *tstate)
{
struct _gc_thread_state *gc = &((_PyThreadStateImpl *)tstate)->gc;
gc->alloc_count--;
if (gc->alloc_count <= -LOCAL_ALLOC_COUNT_THRESHOLD) {
GCState *gcstate = &tstate->interp->gc;
_Py_atomic_add_int(&gcstate->young.count, (int)gc->alloc_count); // flushes negative value
gc->alloc_count = 0;
}
}The problem scenario
- Multiple threads each have
local_count = 300 - Garbage collection runs, resets
global = 0(but local counts stay at 300) - Each thread deallocates 812 objects:
local = 300 - 812 = -512 - Threshold hit, each thread flushes
-512to global - With N threads:
global = -512 * N
The compounding happens because garbage collection first resets the global count to zero, then frees objects. Each freed object calls record_deallocation(), driving local counts negative. These negative values get flushed to global without any clamping, and each cycle adds more negative values.
The fix allows thread-local counts to go negative but clamps the global counter to non-negative using a compare-and-exchange loop.
static void
record_deallocation(PyThreadState *tstate)
{
struct _gc_thread_state *gc = &((_PyThreadStateImpl *)tstate)->gc;
gc->alloc_count--;
if (gc->alloc_count <= -LOCAL_ALLOC_COUNT_THRESHOLD) {
GCState *gcstate = &tstate->interp->gc;
- _Py_atomic_add_int(&gcstate->young.count, (int)gc->alloc_count);+ int count = _Py_atomic_load_int_relaxed(&gcstate->young.count);+ int new_count;+ do {+ if (count == 0) {+ break;+ }+ new_count = count + (int)gc->alloc_count;+ if (new_count < 0) {+ new_count = 0;+ }+ } while (!_Py_atomic_compare_exchange_int(&gcstate->young.count,+ &count,+ new_count)); gc->alloc_count = 0;
}
}The loop atomically loads the current count, computes a new value (clamped to zero if negative), and attempts to swap it in. If another thread modified the count in the meantime, the compare-and-exchange fails and we retry.
The impact
With the fix applied, garbage collection runs more frequently but each pass completes quickly. Before the fix, our server would go minutes between collections, accumulating millions of objects, resulting in multi-second pauses. Now collections happen every few thousand allocations as intended, and with our fix merged yours will too.
You can see this with gc.set_debug(gc.DEBUG_STATS), which prints statistics after each collection.
Before:
gc: collecting generation 0...
gc: objects in each generation: 4892745 0 0
gc: objects in permanent generation: 0
gc: done, 3.127s elapsedAfter:
gc: collecting generation 0...
gc: objects in each generation: 2145 0 0
gc: objects in permanent generation: 0
gc: done, 0.001s elapsedMany fast collections instead of few slow ones.
Love this shit? Join us on Discord or email hello@ninetyfive.gg.