NinetyFive Blog

Engineering insights and research from the NinetyFive team

Incremental Tokenization with Fenwick Trees

November 2025

A NinetyFive code completion request breaks down roughly as 16ms tokenization and 50ms inference. That's 20% of latency spent turning text into tokens. We thought we could do better.

Our autocomplete model operates on tokens, not text. Before we can run inference, we convert the code into a sequence of tokens.

def calculate_sum(numbers):

When a developer types a character, we retokenize the entire file to send to the model for completion. But most edits only affect nearby tokens, not the entire file. We can skip retokenizing the parts that didn't change.

Prelude

To speed up tokenization, tokenizers do a fast first pass that splits text at whitespace and punctuation before running a more expensive algorithm on each chunk. This means tokens never cross these boundaries, known as pretokenization boundaries.

def backpropagation(gradients):

Notice that the three tokens ␣backpropagation share the same pretokenization boundary. We call this grouping a pretoken.

We can use this to our advantage: split the document into pretokens and store the tokens for each. When the user edits the file, we find the affected pretoken and retokenize just that part.

When a request comes in to modify a file, we receive the start position, end position, and the new text.

before: def backward(grad):
edit: { start: 4, end: 12, text: "backpropagation" }
after: def backpropagation(grad):

How do we find the pretokens that are affected? We could scan through them, adding up their lengths until we pass the edit position. But that's O(n) per edit. We can do better.

Enter the Fenwick tree

A Fenwick tree, also known as a Binary Indexed Tree, is a data structure that supports two operations in O(log n) time.

The key insight is that we can use pretoken lengths as the elements. The prefix sum at index i gives us the character position where pretoken i ends. To find which pretoken contains a given character position, we traverse the Fenwick tree, accumulating partial sums until we find the matching pretoken.

How it works

Consider a document with 8 pretokens of lengths [4, 7, 3, 8, 5, 2, 6, 4].

pretoken12345678
length47385264
ends at char411142227293539

The Fenwick tree stores partial sums. Each node covers a power-of-2 range determined by its index.

node index12345678
covers[1][1, 2][3][1, 2, 3, 4][5][5, 6][7][1..8]
stores41132257639

Node 8 stores the sum of all pretokens (39). Node 4 stores pretokens 1-4 (22). Node 6 stores pretokens 5-6 (7). The pattern: node i covers a number of pretokens equal to its lowest set bit, ending at pretoken i. Node 6 (0b110) has lowest bit 2, so it covers 2 pretokens ending at pretoken 6: pretokens 5-6. Node 4 (0b100) has lowest bit 4, so it covers 4 pretokens ending at pretoken 4: pretokens 1-4.

The elegance of this structure becomes clear when visualized: each node spans exactly the pretokens it sums, forming a binary tree where parent-child relationships emerge naturally from bit manipulation.

Node 8
Node 4
Node 2
Node 6
Node 1
Node 3
Node 5
Node 7
1
2
3
4
5
6
7
8
pretoken index

To find which pretoken contains character 25, we traverse the tree from the largest power of 2, accumulating sums. If adding a node would exceed our target, we go left (skip it), otherwise we go right (take it) and continue.

query: find pretoken containing char 25

step 1: tree[8] = 39 > 25? yes, skip
step 2: tree[4] = 22 <= 25? yes, take it (sum = 22, pos = 4)
step 3: tree[6] = 7, sum + 7 = 29 > 25? yes, skip
step 4: tree[5] = 5, sum + 5 = 27 > 25? yes, skip

result: pretoken 5 (chars 22-27)

The prefix sum for each pretoken requires summing at most O(log n) nodes.

pretoken 1:Node 1pretoken 2:Node 2pretoken 3:Node 2 + Node 3pretoken 4:Node 4pretoken 5:Node 4 + Node 5pretoken 6:Node 4 + Node 6pretoken 7:Node 4 + Node 6 + Node 7pretoken 8:Node 8

Thus, a Fenwick tree allows us to find which pretoken contains a given character position in O(log n) time.

The boundary expansion problem

Stepping back, recall our original problem: given an edit at some character position, we want to find the affected pretokens and retokenize only those. With a Fenwick tree in hand, we can find which pretokens a given edit spans in O(log n) time.

There's a subtlety, though. Modifying text near a boundary can cause the boundary to change or disappear completely.

before: hello world
edit: delete the leading space from " world"
after: helloworld ← pretokens merged!

When this happens, we need to detect it and expand our update region. We solve this with an iterative boundary check.

  1. Identify the affected region based on the edit position
  2. Include one pretoken on each side as “context” for boundary checking
  3. Pretokenize the region with the context pretokens
  4. Check if the context pretokens survived intact (weren't merged)
  5. If they merged, expand the region and repeat

In practice, boundary expansion rarely iterates more than once. Most edits are localized within a pretoken boundary, so the initial region is stable.

With boundary expansion we can implement the full algorithm for maintaining pretokenized strings.

def replace(start, end, text):
    # Find affected pretokens via Fenwick tree
    first = fenwick.find_pretoken(start)
    last = fenwick.find_pretoken(end)

    # Expand boundaries until stable
    while boundaries_changed(first, last, text):
        first, last = expand(first, last)

    # Retokenize only the affected region
    new_pretokens = pretokenize(first, last, text)
    new_tokens = tokenize(new_pretokens)

    # Update Fenwick tree with new lengths
    fenwick.update(first, last, new_pretokens)

The astute reader might notice that Fenwick trees don't allow inserting or deleting pretoken boundaries. In practice, we solve this by using a treap instead of a pure Fenwick tree. The same general principles apply though, and the exercise of working through a Fenwick tree was more visually appealing for this blog post. A treap implementation is left as an exercise for the reader.

The impact

We benchmarked the incremental approach against naive full retokenization across different document sizes and edit types. At 100k characters, our incremental approach is over 100x faster than naive retokenization.

For real-time code completion, every millisecond counts. With this change, our tokenization step consistently takes less than 1ms before we can send the completion to the GPU. The 30ms we save on a large file is 30ms less latency before the model can start generating. Combined with our coverage beam search for handling partial tokens, we can provide instant, accurate completions even in large codebases.

The Fenwick tree is a beautiful example of how the right data structure transforms an O(n) operation into O(log n). It's simple to implement, has minimal memory overhead, and meshes well with our incremental update pattern.

Sometimes the best optimizations aren't about clever algorithms or GPU kernels. They're about recognizing that most work is unnecessary and finding the right data structure to exploit that.

Magnate of minimal moves? Join us on Discord or email hello@ninetyfive.gg.