NinetyFive

The fastest autocomplete

Solving Character Prefix Conditioning with Coverage Beam Search

June 2025

Language models are the latest way to autocomplete text the user is typing. However, implementing this naively results in an autocomplete that feels jarring.

Deep Re LU Networks for Nonlinear

The model makes poor suggestions, often invalid words, and sometimes in the wrong language. These suggestions occur because modern language models are trained on sequences of tokens, not characters. In this article, we dive deeper into the problem and an efficient solution.

The problem

Consider the phrase “Deep Recurrent Neural Networks”. With the Qwen 2.5 large language model the phrase is split into the tokens

Deep Recurrent Neural Networks

Now, imagine the user's cursor in the middle of the word “Recurrent”. Tokenizing the input before the cursor results in the tokens

Deep Recurre 

Uh oh! Now the last token the model sees is not urrent but rather urre. This small distinction has a big impact on the output. In fact, it's only at the token boundaries that the model is able to make good predictions.

Deep Re LU Networks for Nonlinear
Deep Rec urrent Neural Networks for Time
Deep Recu ursion
Deep Recur : A Deep Recurrent Neural
Deep Recurr ance Networks for Time Series
Deep Recurre nCE: A Deep Recurrent
Deep Recurren CE: A Deep Recurrent
Deep Recurrent  Neural Networks for Time

How can we improve the model's generated suggestions so they are more consistent and relevant?

The solution

At a high level, our solution is to beam search over the unstable region of the tokenizer, maximizing the joint probability of the tokens that cover the unstable tokens, and use that token to represent the text before the user's cursor.

Let's walk through that step by step.

First, tokenizers preprocess the input text by splitting it with regex into a sequence of large blocks of text which are then tokenized individually. Pre-splitting improves the performance of the tokenizer because the tokenizer search is O(mn) and pre-splitting keeps the m small to approximate a linear, O(n) algorithm. The regex is complex but the key detail is that it always splits on whitespace. Therefore, in the input “Deep Recurrent Neural Networks”, the tokenizer tokenizes “Deep”, “ Recurrent”, “ Neural”, and “ Networks” separately and we know that a token boundary must occur after each of them.

The text that comes after the last token boundary is known as the unstable region. Everything before the unstable boundary is tokenized without any changes.

Deep Recurre
    └──────┘ unstable region

Next, produce a table of valid tokens at each character offset for the text in the unstable region. A valid token is a token that starts with the prefix or vice versa. In code, this is implemented as

Loading...

Some prior attempts to solve this problem stop around here with approaches like masking the output tokens to the set of valid tokens. This is insufficient: in our example where the unstable region is ␣Recurre, the highest likelihood token is   (a space) which isn't the optimal tokenization. Foreshadowing, this approach is equivalent to ours with a beam width of k=1 and is too greedy.

Using this definition of valid tokens, we take the approach proposed by Viera, et al. (2024) and produce sequences of tokens that cover the unstable region.

With a table of valid tokens, the covering represents a valid path through the table. The table of tokens therefore represents the space of all possible completions. For the unstable text ␣Recurre we have the table

Paths for tokenizations  Recurred,  Recurrense, and  Recurrent
␣RecurreRecurreecurrecurreurrerreree
Recurrentlyurrentrreferereneg
␣RReeccuurrrreyexpl
␣ReReccurrentColorurredremoveAttremente
␣Reccururrenciesresentsenviron
currentIndexurretsrelsero
currurrectrefundense
curreurretreldom
............

For example, the tokenizations  Recurred,  Recurrense, and of course  Recurrent are valid coverings among many other paths through the table. In total, for the input ␣Recurre with the Qwen 2.5 tokenizer there are 78372 valid possible coverings. The best suggestion is the covering with maximum likelihood. The covering's likelihood is defined as

P( Recurred) = P( R)
 × P(ec |  R)
 × P(urred |  Rec)

To find that maximum-likelihood covering, we use beam search, observing that each of the product probabilities is found in the model's output logits. We generate candidate tokens for each position in parallel maintaining the joint probability of each beam of tokens, keeping the k-most likely beams at each step. Our beam search terminates once the covering's length exceeds the length of the unstable region. When we find a covering, we have a lower bound for all future beams. The lower bound speeds up the traversal and reduces the long-tail effects of searching over very large tables as it terminates beams that are unlikely before reaching a covering.

Let's walk through how beam search works step by step. The first beam is the stable region and we generate the probabilities of the valid tokens.

BeamsNext Valid Token Probabilities
Deep
p=1.0
p=2.1×10-4
␣R
p=1.2×10-4
␣Rec
p=8.4×10-6
␣Re
p=6.9×10-6

Next, we take the top k beams (in this case all of them since there are only four) and generate next valid token probabilities for each of those beams. The probability of the beam is the product of the previous beam probability and the next token probability.

BeamsNext Valid Token Probabilities
Deep 
p=2.1×10-4
Rec
p=5.2×10-8
Re
p=1.2×10-9
R
p=3.6×10-11
Deep R
p=1.2×10-4
ec
p=1.0×10-6
e
p=3.2×10-8
Deep Rec
p=8.4×10-6
urrent
p=7.2×10-1
urrence
p=3.7×10-2
urrences
p=4.3×10-3
ur
p=1.3×10-3
urrency
p=1.1×10-4
urr
p=1.0×10-4
Deep Re
p=6.9×10-6
current
p=4.6×10-5
cur
p=2.2×10-6
curr
p=2.2×10-6
currently
p=9.4×10-7
currentState
p=7.6×10-7
currentTime
p=6.0×10-7

This step shows why beam search is productive: the highest probability greedy candidate is   (a space) but the subsequent valid tokens have low probabilities. Compare this to  Rec which has a lower initial probability but the next valid tokens have higher probabilities, leading to a higher joint probability. We repeat the process one more time taking the top k=4 beams.

BeamsNext Valid Token Probabilities
Deep Recurrent
p=6.0×10-6
Deep Recurrence
p=3.1×10-7
Deep Recurrences
p=3.6×10-8
Deep Recur
p=1.1×10-8
rent
p=1.3×10-3
rence
p=8.3×10-4
red
p=2.5×10-4
ree
p=2.4×10-4
rection
p=2.0×10-4
rec
p=1.1×10-4

Now among the top k=4 candidates,  Recurrent,  Reccurrence, and  Reccurrences form a covering and  Recurrent is our most likely covering so far. In practice at this point the Deep Recur beam is terminated: the probability for Deep Recurrent is p=6.0×10-6, greater than any other beam. But for the sake of demonstration we will continue the process one more step, selecting the highest probability next valid tokens.

BeamsNext Valid Token Probabilities
Deep Recurrent
p=6.0×10-6
Deep Recurrent
p=1.4×10-11
Deep Recurrence
p=8.8×10-12
Deep Recurred
p=2.8×10-12
Deep Recurree
p=2.6×10-12

Finally, all the beams have reached a covering state so we terminate the beam search. From the output, the most probable candidate is  Recurrent. You might be wondering why the probabilities are so low. Intuitively, one might expect a higher probability from the most likely covering. This is because the probabilities represent the probabilities over all possible sequences of tokens that could be generated, not just the ones that were generated in this beam search.

Another way to understand this approach is a parallel depth-first traversal through the table, greedily traversing the best k paths in parallel.

Beam search
By BogdanShevchenko - Own work, CC BY-SA 4.0, Link

Applying the algorithm to the autocomplete suggestions, the user experiences

Deep Re inforcement Learning for Multi
Deep Rec ognition of the Importance of
Deep Recu rsive Neural Networks for Image
Deep Recur sive Neural Networks for Image
Deep Recurr ent Neural Networks for Time
Deep Recurre nt Neural Networks for Time
Deep Recurren t Neural Networks for Time
Deep Recurrent  Neural Networks for Time

Now the suggested completions are more coherent and stable, and we make the best suggestion earlier!

The results

To analyze the impact of this approach, we take a block of text and at each character offset, autocomplete the text until the generation no longer matches the reference text. The longer the suggestion match, the better.

With a beam width k=2, the tokens generated match over 5x longer and the variance is tighter than naively producing suggestions, corroborating more stable and higher quality suggestions. Additionally, we see that increasing the beam width further has diminishing marginal returns. Therefore, to keep things tidy and performant, it's sufficient to use a beam width of k=2.

The intuition is that one alternate search path is sufficient to find a more optimal tokenization.

The applications

The next article will explore other applications of this solution beyond autocompletes (hint: structured output generation).

Join us on Discord or send us an email at hello@ninetyfive.gg and we'll let you know when that drops.

Of course, this solution is implemented in NinetyFive: the world's fastest autocomplete. Try us out!