NinetyFive
The fastest autocomplete
Solving Character Prefix Conditioning with Coverage Beam Search
June 2025
Language models are the latest way to autocomplete text the user is typing. However, implementing this naively results in an autocomplete that feels jarring.
Deep Re LU Networks for Nonlinear
The model makes poor suggestions, often invalid words, and sometimes in the wrong language. These suggestions occur because modern language models are trained on sequences of tokens, not characters. In this article, we dive deeper into the problem and an efficient solution.
The problem
Consider the phrase “Deep Recurrent Neural Networks”. With the Qwen 2.5 large language model the phrase is split into the tokens
Deep Recurrent Neural Networks
Now, imagine the user's cursor in the middle of the word “Recurrent”. Tokenizing the input before the cursor results in the tokens
Deep Recurre
Uh oh! Now the last token the model sees is not urrent but rather urre. This small distinction has a big impact on the output. In fact, it's only at the token boundaries that the model is able to make good predictions.
Deep Re LU Networks for Nonlinear
Deep Rec urrent Neural Networks for Time
Deep Recu ursion
Deep Recur : A Deep Recurrent Neural
Deep Recurr ance Networks for Time Series
Deep Recurre nCE: A Deep Recurrent
Deep Recurren CE: A Deep Recurrent
Deep Recurrent Neural Networks for Time
How can we improve the model's generated suggestions so they are more consistent and relevant?
The solution
At a high level, our solution is to beam search over the unstable region of the tokenizer, maximizing the joint probability of the tokens that cover the unstable tokens, and use that token to represent the text before the user's cursor.
Let's walk through that step by step.
First, tokenizers preprocess the input text by splitting it with regex into a sequence of large blocks of text which are then tokenized individually. Pre-splitting improves the performance of the tokenizer because the tokenizer search is O(mn) and pre-splitting keeps the m small to approximate a linear, O(n) algorithm. The regex is complex but the key detail is that it always splits on whitespace. Therefore, in the input “Deep Recurrent Neural Networks”, the tokenizer tokenizes “Deep”, “ Recurrent”, “ Neural”, and “ Networks” separately and we know that a token boundary must occur after each of them.
The text that comes after the last token boundary is known as the unstable region. Everything before the unstable boundary is tokenized without any changes.
└──────┘ unstable region
Next, produce a table of valid tokens at each character offset for the text in the unstable region. A valid token is a token that starts with the prefix or vice versa. In code, this is implemented as
Some prior attempts to solve this problem stop around here with approaches like masking the output tokens to the set of valid tokens. This is insufficient: in our example where the unstable region is ␣Recurre, the highest likelihood token is (a space) which isn't the optimal tokenization. Foreshadowing, this approach is equivalent to ours with a beam width of k=1 and is too greedy.
Using this definition of valid tokens, we take the approach proposed by Viera, et al. (2024) and produce sequences of tokens that cover the unstable region.
With a table of valid tokens, the covering represents a valid path through the table. The table of tokens therefore represents the space of all possible completions. For the unstable text ␣Recurre we have the table
␣Recurre | Recurre | ecurre | curre | urre | rre | re | e |
---|---|---|---|---|---|---|---|
␣ | R | e | currently | urrent | r | referer | eneg |
␣R | Re | ec | cu | ur | rr | rey | expl |
␣Re | Rec | currentColor | urred | removeAttr | emente | ||
␣Rec | cur | urrencies | resents | environ | |||
currentIndex | urrets | rels | ero | ||||
curr | urrect | refund | ense | ||||
curre | urret | r | eldom | ||||
... | ... | ... | ... |
For example, the tokenizations Recurred, Recurrense, and of course Recurrent are valid coverings among many other paths through the table. In total, for the input ␣Recurre with the Qwen 2.5 tokenizer there are 78372 valid possible coverings. The best suggestion is the covering with maximum likelihood. The covering's likelihood is defined as
To find that maximum-likelihood covering, we use beam search, observing that each of the product probabilities is found in the model's output logits. We generate candidate tokens for each position in parallel maintaining the joint probability of each beam of tokens, keeping the k-most likely beams at each step. Our beam search terminates once the covering's length exceeds the length of the unstable region. When we find a covering, we have a lower bound for all future beams. The lower bound speeds up the traversal and reduces the long-tail effects of searching over very large tables as it terminates beams that are unlikely before reaching a covering.
Let's walk through how beam search works step by step. The first beam is the stable region and we generate the probabilities of the valid tokens.
Beams | Next Valid Token Probabilities | |||||
---|---|---|---|---|---|---|
Deep p=1.0 | ␣ p=2.1×10-4 | ␣R p=1.2×10-4 | ␣Rec p=8.4×10-6 | ␣Re p=6.9×10-6 |
Next, we take the top k beams (in this case all of them since there are only four) and generate next valid token probabilities for each of those beams. The probability of the beam is the product of the previous beam probability and the next token probability.
Beams | Next Valid Token Probabilities | |||||
---|---|---|---|---|---|---|
Deep p=2.1×10-4 | Rec p=5.2×10-8 | Re p=1.2×10-9 | R p=3.6×10-11 | |||
Deep R p=1.2×10-4 | ec p=1.0×10-6 | e p=3.2×10-8 | ||||
Deep Rec p=8.4×10-6 | urrent p=7.2×10-1 | urrence p=3.7×10-2 | urrences p=4.3×10-3 | ur p=1.3×10-3 | urrency p=1.1×10-4 | urr p=1.0×10-4 |
Deep Re p=6.9×10-6 | current p=4.6×10-5 | cur p=2.2×10-6 | curr p=2.2×10-6 | currently p=9.4×10-7 | currentState p=7.6×10-7 | currentTime p=6.0×10-7 |
This step shows why beam search is productive: the highest probability greedy candidate is (a space) but the subsequent valid tokens have low probabilities. Compare this to Rec which has a lower initial probability but the next valid tokens have higher probabilities, leading to a higher joint probability. We repeat the process one more time taking the top k=4 beams.
Beams | Next Valid Token Probabilities | |||||
---|---|---|---|---|---|---|
Deep Recurrent p=6.0×10-6 | ||||||
Deep Recurrence p=3.1×10-7 | ||||||
Deep Recurrences p=3.6×10-8 | ||||||
Deep Recur p=1.1×10-8 | rent p=1.3×10-3 | rence p=8.3×10-4 | red p=2.5×10-4 | ree p=2.4×10-4 | rection p=2.0×10-4 | rec p=1.1×10-4 |
Now among the top k=4 candidates, Recurrent, Reccurrence, and Reccurrences form a covering and Recurrent is our most likely covering so far. In practice at this point the Deep Recur beam is terminated: the probability for Deep Recurrent is p=6.0×10-6, greater than any other beam. But for the sake of demonstration we will continue the process one more step, selecting the highest probability next valid tokens.
Beams | Next Valid Token Probabilities | |||||
---|---|---|---|---|---|---|
Deep Recurrent p=6.0×10-6 | ||||||
Deep Recurrent p=1.4×10-11 | ||||||
Deep Recurrence p=8.8×10-12 | ||||||
Deep Recurred p=2.8×10-12 | ||||||
Deep Recurree p=2.6×10-12 |
Finally, all the beams have reached a covering state so we terminate the beam search. From the output, the most probable candidate is Recurrent. You might be wondering why the probabilities are so low. Intuitively, one might expect a higher probability from the most likely covering. This is because the probabilities represent the probabilities over all possible sequences of tokens that could be generated, not just the ones that were generated in this beam search.
Another way to understand this approach is a parallel depth-first traversal through the table, greedily traversing the best k paths in parallel.

Applying the algorithm to the autocomplete suggestions, the user experiences
Deep Re inforcement Learning for Multi
Deep Rec ognition of the Importance of
Deep Recu rsive Neural Networks for Image
Deep Recur sive Neural Networks for Image
Deep Recurr ent Neural Networks for Time
Deep Recurre nt Neural Networks for Time
Deep Recurren t Neural Networks for Time
Deep Recurrent Neural Networks for Time
Now the suggested completions are more coherent and stable, and we make the best suggestion earlier!
The results
To analyze the impact of this approach, we take a block of text and at each character offset, autocomplete the text until the generation no longer matches the reference text. The longer the suggestion match, the better.
With a beam width k=2, the tokens generated match over 5x longer and the variance is tighter than naively producing suggestions, corroborating more stable and higher quality suggestions. Additionally, we see that increasing the beam width further has diminishing marginal returns. Therefore, to keep things tidy and performant, it's sufficient to use a beam width of k=2.
The intuition is that one alternate search path is sufficient to find a more optimal tokenization.
The applications
The next article will explore other applications of this solution beyond autocompletes (hint: structured output generation).
Join us on Discord or send us an email at hello@ninetyfive.gg and we'll let you know when that drops.
Of course, this solution is implemented in NinetyFive: the world's fastest autocomplete. Try us out!