December 2025
torch.compile() Performance without torch.compile() Overhead
How we reach state of the art inference speeds by managing CUDA graphs directly.
Engineering insights and research from the NinetyFive team
How we reach state of the art inference speeds by managing CUDA graphs directly.
A deep dive into CPython's garbage collection internals, from reference counting to cyclic collection, with links to the actual source code. We encountered long pauses in the free-threaded build and traced it to a bug.
How we use Fenwick trees to achieve 10x faster tokenization updates for real-time code completion.
Language models are the latest way to autocomplete text the user is typing. However, implementing this naively results in an autocomplete that feels jarring.