NinetyFive Blog
Engineering insights and research from the NinetyFive team
Incremental Tokenization with Fenwick Trees
November 2025
A NinetyFive code completion request breaks down roughly as 16ms tokenization and 50ms inference. That's 20% of latency spent turning text into tokens. We thought we could do better.
Our autocomplete model operates on tokens, not text. Before we can run inference, we convert the code into a sequence of tokens.
def calculate_sum(numbers):
When a developer types a character, we retokenize the entire file to send to the model for completion. But most edits only affect nearby tokens, not the entire file. We can skip retokenizing the parts that didn't change.
Prelude
To speed up tokenization, tokenizers do a fast first pass that splits text at whitespace and punctuation before running a more expensive algorithm on each chunk. This means tokens never cross these boundaries, known as pretokenization boundaries.
def backpropagation(gradients):
Notice that the three tokens ␣backpropagation share the same pretokenization boundary. We call this grouping a pretoken.
We can use this to our advantage: split the document into pretokens and store the tokens for each. When the user edits the file, we find the affected pretoken and retokenize just that part.
When a request comes in to modify a file, we receive the start position, end position, and the new text.
before: def backward(grad): edit: { start: 4, end: 12, text: "backpropagation" } after: def backpropagation(grad):
How do we find the pretokens that are affected? We could scan through them, adding up their lengths until we pass the edit position. But that's O(n) per edit. We can do better.
Enter the Fenwick tree
A Fenwick tree, also known as a Binary Indexed Tree, is a data structure that supports two operations in O(log n) time.
- Point update: modify xi
- Prefix sum: compute x0 + x1 + ... + xi
The key insight is that we can use pretoken lengths as the elements. The prefix sum at index i gives us the character position where pretoken i ends. To find which pretoken contains a given character position, we traverse the Fenwick tree, accumulating partial sums until we find the matching pretoken.
How it works
Consider a document with 8 pretokens of lengths [4, 7, 3, 8, 5, 2, 6, 4].
| pretoken | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| length | 4 | 7 | 3 | 8 | 5 | 2 | 6 | 4 |
| ends at char | 4 | 11 | 14 | 22 | 27 | 29 | 35 | 39 |
The Fenwick tree stores partial sums. Each node covers a power-of-2 range determined by its index.
| node index | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| covers | [1] | [1, 2] | [3] | [1, 2, 3, 4] | [5] | [5, 6] | [7] | [1..8] |
| stores | 4 | 11 | 3 | 22 | 5 | 7 | 6 | 39 |
Node 8 stores the sum of all pretokens (39). Node 4 stores pretokens 1-4 (22). Node 6 stores pretokens 5-6 (7). The pattern: node i covers a number of pretokens equal to its lowest set bit, ending at pretoken i. Node 6 (0b110) has lowest bit 2, so it covers 2 pretokens ending at pretoken 6: pretokens 5-6. Node 4 (0b100) has lowest bit 4, so it covers 4 pretokens ending at pretoken 4: pretokens 1-4.
The elegance of this structure becomes clear when visualized: each node spans exactly the pretokens it sums, forming a binary tree where parent-child relationships emerge naturally from bit manipulation.
To find which pretoken contains character 25, we traverse the tree from the largest power of 2, accumulating sums. If adding a node would exceed our target, we go left (skip it), otherwise we go right (take it) and continue.
query: find pretoken containing char 25 step 1: tree[8] = 39 > 25? yes, skip step 2: tree[4] = 22 <= 25? yes, take it (sum = 22, pos = 4) step 3: tree[6] = 7, sum + 7 = 29 > 25? yes, skip step 4: tree[5] = 5, sum + 5 = 27 > 25? yes, skip result: pretoken 5 (chars 22-27)
The prefix sum for each pretoken requires summing at most O(log n) nodes.
Thus, a Fenwick tree allows us to find which pretoken contains a given character position in O(log n) time.
The boundary expansion problem
Stepping back, recall our original problem: given an edit at some character position, we want to find the affected pretokens and retokenize only those. With a Fenwick tree in hand, we can find which pretokens a given edit spans in O(log n) time.
There's a subtlety, though. Modifying text near a boundary can cause the boundary to change or disappear completely.
before: hello world edit: delete the leading space from " world" after: helloworld ← pretokens merged!
When this happens, we need to detect it and expand our update region. We solve this with an iterative boundary check.
- Identify the affected region based on the edit position
- Include one pretoken on each side as “context” for boundary checking
- Pretokenize the region with the context pretokens
- Check if the context pretokens survived intact (weren't merged)
- If they merged, expand the region and repeat
In practice, boundary expansion rarely iterates more than once. Most edits are localized within a pretoken boundary, so the initial region is stable.
With boundary expansion we can implement the full algorithm for maintaining pretokenized strings.
def replace(start, end, text):
# Find affected pretokens via Fenwick tree
first = fenwick.find_pretoken(start)
last = fenwick.find_pretoken(end)
# Expand boundaries until stable
while boundaries_changed(first, last, text):
first, last = expand(first, last)
# Retokenize only the affected region
new_pretokens = pretokenize(first, last, text)
new_tokens = tokenize(new_pretokens)
# Update Fenwick tree with new lengths
fenwick.update(first, last, new_pretokens)The astute reader might notice that Fenwick trees don't allow inserting or deleting pretoken boundaries. In practice, we solve this by using a treap instead of a pure Fenwick tree. The same general principles apply though, and the exercise of working through a Fenwick tree was more visually appealing for this blog post. A treap implementation is left as an exercise for the reader.
The impact
We benchmarked the incremental approach against naive full retokenization across different document sizes and edit types. At 100k characters, our incremental approach is over 100x faster than naive retokenization.
For real-time code completion, every millisecond counts. With this change, our tokenization step consistently takes less than 1ms before we can send the completion to the GPU. The 30ms we save on a large file is 30ms less latency before the model can start generating. Combined with our coverage beam search for handling partial tokens, we can provide instant, accurate completions even in large codebases.
The Fenwick tree is a beautiful example of how the right data structure transforms an O(n) operation into O(log n). It's simple to implement, has minimal memory overhead, and meshes well with our incremental update pattern.
Sometimes the best optimizations aren't about clever algorithms or GPU kernels. They're about recognizing that most work is unnecessary and finding the right data structure to exploit that.
Magnate of minimal moves? Join us on Discord or email hello@ninetyfive.gg.