NinetyFive

The fastest autocomplete

Solving Character Prefix Conditioning with Coverage Beam Search

June 2025

Language models are the latest way to autocomplete text the user is typing. However, implementing this naively results in an autocomplete that feels jarring.

Deep Re LU Networks for Nonlinear

The model makes poor suggestions, often invalid words, and sometimes in the wrong language. These suggestions occur because modern language models are trained on sequences of tokens, not characters. In this article, we dive deeper into the problem and an efficient solution.

The problem

Consider the phrase “Deep Recurrent Neural Networks”. With the Qwen 2.5 large language model the phrase is split into the tokens

Deep Recurrent Neural Networks

Now, imagine the user's cursor in the middle of the word “Recurrent”. Tokenizing the input before the cursor results in the tokens

Deep Recurre

Uh oh! Now the last token the model sees is not urrent but rather urre. This small distinction has a big impact on the output. In fact, it's only at the token boundaries that the model is able to make good predictions.

Deep Re LU Networks for Nonlinear

Deep Rec urrent Neural Networks for Time

Deep Recu ursion

Deep Recur : A Deep Recurrent Neural

Deep Recurr ance Networks for Time Series

Deep Recurre nCE: A Deep Recurrent

Deep Recurren CE: A Deep Recurrent

Deep Recurrent  Neural Networks for Time

How can we improve the model's generated suggestions so they are more consistent and relevant?

The solution

At a high level, our solution is to beam search over the unstable region of the tokenizer, maximizing the joint probability of the tokens that cover the unstable tokens, and use that token to represent the text before the user's cursor.

Let's walk through that step by step.

First, tokenizers preprocess the input text by splitting it with regular expressions into a sequence of large blocks of text which are then tokenized individually. Pre-splitting improves the performance of the tokenizer because the tokenizer search is O(mn) and pre-splitting keeps the m small to approximate a linear, O(n) algorithm. A key detail is the pre-splitting will always split on whitespaces. Therefore, in the input “Deep Recurrent Neural Networks”, the tokenizer tokenizes “Deep”, “ Recurrent”, “ Neural”, and “ Networks” separately and we know that a token boundary must occur after each of them.

The text that comes after the last token boundary is known as the unstable region. Everything before the unstable boundary is tokenized without any changes.

Deep Recurre

    └──────┘ unstable region

Next, produce a table of valid tokens at each character offset for the text in the unstable region. A valid token is a token that starts with the prefix or vice versa. In code, this is implemented as

Some prior attempts to solve this problem stop around here with approaches like masking the output tokens to the set of valid tokens. This is insufficient: in our example where the unstable region is ␣Recurre, the highest likelihood token is (a space) which isn't the optimal tokenization. Foreshadowing, this approach is equivalent to ours with a beam width of k=1 and is too greedy.

We then take the approach proposed by Viera, et al. (2024) and produce sequences of valid tokens that cover the unstable region.

With a table of valid tokens, the covering represents a valid path through the table. The table of tokens therefore represents the space of all possible completions. For the unstable text ␣Recurre we have the table

Paths for tokenizations Recurred, Recurrense, and Recurrent
␣Recurre	Recurre	ecurre	curre	urre	rre	re	e
␣	R	e	currently	urrent	r	referer	eneg
␣R	Re	ec	cu	ur	rr	rey	expl
␣Re	Rec		currentColor	urred		removeAttr	emente
␣Rec			cur	urrencies		resents	environ
			currentIndex	urrets		rels	ero
			curr	urrect		refund	ense
			curre	urret		r	eldom
			...	...		...	...

For example, the tokenizations Recurred, Recurrense, and of course Recurrent are valid coverings among many other paths through the table. In total, for the input ␣Recurre with the Qwen 2.5 tokenizer there are 78372 valid possible coverings. The best suggestion is the covering with maximum likelihood. The covering's likelihood is defined as

P( Recurred) = P( R)

× P(ec | R)

× P(urred | Rec)

To find that maximum-likelihood covering, we use beam search, observing that each of the product probabilities is found in the model's output probabilities. We generate candidate tokens for each position in parallel maintaining the joint probability of each beam of tokens. Our beam search terminates once the covering's length exceeds the length of the unstable region. When we find a covering, we have a lower bound for all future beams. The lower bound speeds up the traversal and reduces the long-tail effects of searching over very large tables as it terminates beams that are unlikely before reaching a covering.

Another way to understand this approach is a parallel depth-first traversal through the table, greedily traversing the best k paths in parallel.

By BogdanShevchenko - Own work, CC BY-SA 4.0, Link

Let's walk through how beam search works step by step. The first beam is the stable region and we generate the probabilities of the valid tokens.

Beams	Next Valid Token Probabilities
Deep p=1.0	␣ p=2.1×10^-4	␣R p=1.2×10^-4	␣Rec p=8.4×10^-6	␣Re p=6.9×10^-6

Next, we generate next valid token probabilities for each of those beams. The probability of the beam is the product of the previous beam probability and the next token probability.

Beams	Next Valid Token Probabilities
Deep p=2.1×10^-4	Rec p=5.2×10^-8	Re p=1.2×10^-9	R p=3.6×10^-11
Deep R p=1.2×10^-4	ec p=1.0×10^-6	e p=3.2×10^-8
Deep Rec p=8.4×10^-6	urrent p=7.2×10^-1	urrence p=3.7×10^-2	urrences p=4.3×10^-3	ur p=1.3×10^-3	urrency p=1.1×10^-4	urr p=1.0×10^-4
Deep Re p=6.9×10^-6	current p=4.6×10^-5	cur p=2.2×10^-6	curr p=2.2×10^-6	currently p=9.4×10^-7	currentState p=7.6×10^-7	currentTime p=6.0×10^-7

This step shows why beam search is productive: the highest probability greedy candidate is (a space) but the subsequent valid tokens have low probabilities. Compare this to Rec which has a lower initial probability but the next valid tokens have higher probabilities, leading to a higher joint probability. We repeat the process one more time.

Beams	Next Valid Token Probabilities
Deep Recurrent p=6.0×10^-6
Deep Recurrence p=3.1×10^-7
Deep Recurrences p=3.6×10^-8
Deep Recur p=1.1×10^-8	rent p=1.3×10^-3	rence p=8.3×10^-4	red p=2.5×10^-4	ree p=2.4×10^-4	rection p=2.0×10^-4	rec p=1.1×10^-4

Now among the top candidates, Recurrent, Recurrence, and Recurrences form a covering and Recurrent is our most likely covering so far. In practice at this point the Deep Recur beam is terminated: the probability for Deep Recurrent is p=6.0×10^-6, greater than any other beam. But for the sake of demonstration we will continue the process one more step, selecting the highest probability next valid tokens.

Beams	Next Valid Token Probabilities
Deep Recurrent p=6.0×10^-6
Deep Recurrent p=1.4×10^-11
Deep Recurrence p=8.8×10^-12
Deep Recurred p=2.8×10^-12
Deep Recurree p=2.6×10^-12

Finally, all the beams have reached a covering state so we terminate the beam search. From the output, the most probable candidate is Recurrent. You might be wondering why the probabilities are so low. Intuitively, one might expect a higher probability from the most likely covering. This is because the probabilities represent the probabilities over all possible sequences of tokens that could be generated, not just the ones that were generated in this beam search.

Applying the algorithm to the autocomplete suggestions, the user experiences

Deep Re inforcement Learning for Multi

Deep Rec ognition of the Importance of

Deep Recu rsive Neural Networks for Image

Deep Recur sive Neural Networks for Image

Deep Recurr ent Neural Networks for Time

Deep Recurre nt Neural Networks for Time

Deep Recurren t Neural Networks for Time

Deep Recurrent  Neural Networks for Time

Now the suggested completions are more coherent and stable, and we make the best suggestion earlier!

The results

To measure how much better this algorithm is than naive prediction, I ran both algorithms on a sample text, measuring how accurately they were able to predict the next word. Here's how well they did, graphed for each offset in the text.

With a beam width k=2, the tokens generated match over 5x longer and the variance is tighter than naively producing suggestions, corroborating more stable and higher quality suggestions. Additionally, we see that increasing the beam width further has diminishing marginal returns. Therefore, to keep things tidy and performant, it's sufficient to use a beam width of k=2.

The intuition here is that inefficient tokenizations don't often coincide at the same location, therefore the “hurdle” the search has to clear will rarely exceed one alternate path.

The applications

The next article will explore other applications of this solution beyond autocompletes (hint: structured output generation).

Join us on Discord or send us an email at hello@ninetyfive.gg and we'll let you know when that drops.

Of course, this solution is implemented in NinetyFive: the world's fastest autocomplete. Try us out!