NinetyFive Blog
Engineering insights and research from the NinetyFive team
torch.compile() Performance without torch.compile() Overhead
December 2025
When PyTorch 2.0 introduced torch.compile(), it promised significant speedups with a single line of code. And it delivers, but at a cost. The first request to a compiled model can take several minutes while PyTorch traces, optimizes, and compiles your model. For a real-time code completion service where every millisecond counts, that's unacceptable. Blue-green deployments? Never heard of it. Boot your server fast enough and nobody will notice the difference.
We wanted the performance benefits without the warmup penalty. So we dug into what torch.compile() actually does and found that most of the speedup comes from one specific optimization: CUDA graphs.
Where does torch.compile() speedup come from?
We benchmarked the time to generate one token on a 32-layer transformer. With torch.compile(mode="reduce-overhead"), latency dropped from 11.1ms to 8.1ms.
When we looked into how this works, we found that torch.compile() leverages NVIDIA's CUDA Graphs API. So we tried using that directly, without torch.compile(), and to our surprise it achieved nearly the same latency.
The speedup comes almost entirely from CUDA graphs eliminating kernel launch overhead.
Aaaand that's basically it! If you're not into marketing disguised as technical blog posts, you can stop here.
How CUDA graphs work
A CUDA graph records a sequence of GPU operations (kernel launches, memory copies, etc.) and replays them as a single unit. Normally, each kernel launch requires a round-trip from CPU to GPU: the CPU submits work, waits for the GPU to acknowledge, then submits the next piece. With hundreds of small kernels per forward pass, this overhead adds up. CUDA graphs eliminate it by letting the CPU issue one command: “replay this graph.”
# Capture phase: record operations into a graph
graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(graph):
output = model(static_input)
# Replay phase: execute all operations with one command
static_input.copy_(new_input)
graph.replay() # Runs entire forward passThe warmup phase
Using the CUDA graphs API directly has one interesting benefit. Look at the first request latency.
First request latency
torch.compile(reduce-overhead): 4,152 ms
Raw CUDA graphs: 17 mstorch.compile() spends 4 seconds on the first request tracing your model, running optimization passes, and generating optimized kernels. And that's on one entrypoint! With varying shapes and many different entrypoints, we end up with nearly minutes of startup overhead. Meanwhile, our CUDA graphs approach doesn't have nearly the same overhead.
We can still improve on 17ms, though. Before capturing a CUDA graph, you need to run a warmup pass to let CUDA initialize its caches and allocate memory. So in practice, the code flow looks like this.
# Warmup phase: run once to initialize CUDA state
output = model(static_input)
# Capture phase: record operations into a graph
graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(graph):
output = model(static_input)
# Replay phase: execute all operations with one command
static_input.copy_(new_input)
graph.replay()This warmup pass produces perfectly valid output, but tools like torch.compile() discard it. Our trick is to return that result to the user and defer graph capture to the second request.
- First request: Run eager (warmup), return the result immediately
- Second request: Capture the CUDA graph while computing, still return a valid result
- Third+ requests: Replay the captured graph at full speed
@cudagraphify
def forward(x):
return model(x)
# Request 1: Runs eager (warmup), returns valid result
# Request 2: Captures graph, returns valid result
# Request 3+: Replays graph at full speedThis gets us down to an essentially imperceptible warmup phase overhead. In a serving environment, you might argue that warmup time doesn't matter. But faster warmup reduces time to deployment and speeds up local development, and it's really the little things that make this thing fly.
Ok. I see you're unconvinced by the desire to save 10ms. Fine. The real reason is, if written correctly, it's actually less code to return the warmup phase as a valid result.
The prefilling gotcha
There's one complication with CUDA graphs. When you capture a graph, it records pointers to specific memory locations. Different tensor shapes get allocated to different memory, so you can't reuse a graph captured for one shape with another.
That's problematic because input prompts can be any length, so the input shape varies depending on how long the prompt is. torch.compile(dynamic=True) solves this by reasoning about shapes symbolically and generating kernels that work for any size. But that's a much harder problem, requiring the compiler to prove that optimizations are valid across all possible shapes. Our problem is simpler because we only vary on one dimension, the input sequence length. A naive approach would capture separate CUDA graphs for different sequence lengths, each with its own buffers, but that wastes O(n²) memory. Instead, we can allocate the maximum size tensor once and reuse slices of it for shorter sequences.
Without sharing
896 units total
With sharing
512 units total
The catch is that we don't know the largest input size ahead of time. We could pre-allocate for some maximum sequence length, but it turns out that it's easier to grow on demand: when a request comes in with a larger sequence length than we've seen before, we invalidate all cached graphs and recapture with larger shared buffers. In practice, a few requests with long prompts early on stabilize the buffer size for the rest of the session.
With shared buffers, we're back to O(n) memory.
Principled objections you might offer
Why not use PyTorch's built-in CUDA graph API? PyTorch provides torch.cuda.make_graphed_callables() which does something similar. The difference is startup behavior: it runs a separate warmup pass that has pretty substantial overhead, incurring more initial latency.
Why not megakernels? You can achieve impressive performance by fusing everything into hand-optimized CUDA kernels. The tradeoff is flexibility: when you want to experiment with a new attention mechanism, add speculative decoding, or try a different sampling strategy, megakernels require rewriting optimized CUDA code. With CUDA graphs over standard PyTorch, we can iterate on model architecture in Python while maintaining production-grade performance. For a research-oriented team that ships to production, this balance matters.
The result
Most of the speedup from torch.compile() comes from CUDA graphs eliminating kernel launch overhead. The additional optimizations, like operator fusion and custom Triton kernels, add seconds to your startup time for marginal steady-state gains.
By managing CUDA graphs directly, we get the lion's share of the performance gains with near-instant warmup.
Eager for efficiency? Join us on Discord or email hello@ninetyfive.gg.