Uncertainty as a compute budget: a small adaptive language model

Most tokens in a generated sequence are easy. “Of” after “a lot”, the next word in a near-verbatim quotation, the closing bracket of an opened parenthesis — autoregressive language models are quietly extremely confident about most of their outputs, and a lot of inference compute is spent producing tokens the model would emit with near-certainty after a single forward pass.

A small subset of tokens is genuinely hard: a step in a multi-step calculation, a factual claim the model is hedging on, the resolution of a syntactic ambiguity that could go several ways. These are the tokens where extra compute might actually change the answer.

LANTERN — the open-source project this note is about — is a small language model that tries to spend compute where it matters: light forward passes by default, deeper recursion and Bayesian refinement only when its own uncertainty estimates say a token is hard. The code lives at github.com/CodeHalwell/LANTERN; this write-up explains the design.

The compute-shape mismatch

Standard transformer inference is uniform: every generated token gets the same number of layers, the same attention pattern, the same number of samples. The cost of a token is constant; the difficulty of a token is not. That mismatch is the room for adaptive computation, and there are two complementary ways to exploit it:

Vary depth. Spend more layers on tokens that need more reasoning. Universal Transformers, PonderNet, and adaptive halting work all live here.
Vary sampling effort. Spend more samples — Monte Carlo dropout, self-consistency, branch-and-rank — on tokens where a single forward pass is suspect.

LANTERN does both, controlled by a single uncertainty score, and uses an attention pattern that’s already cheap enough to make the per-token cost of extra depth low.

The recursive sparse backbone

The transformer block is weight-shared and reusable: the same parameters are applied multiple times in sequence, with the number of recursive steps decided at runtime.

def recur_block(h, steps_max):
    for t in range(steps_max):
        h = Block(h)        # same weights each iteration
    return h

Conceptually this gives “depth on demand” without paying for it in parameter count. The base inference path runs the block a small fixed number of times (steps_base); the model can spend up to steps_reasoning iterations on a token if its uncertainty controller says so. An optional adaptive-halting head learns per-token stopping probabilities, so the extra steps are themselves data-dependent rather than a flat upper bound.

Attention inside the block is sparse:

Sliding-window attention of width $w$ . Each token attends only to the last $w$ tokens, reducing the per-layer attention cost from $\mathcal{O}(L^2)$ to $\mathcal{O}(L \cdot w)$ .
Global tokens. A small set of special positions — the BOS token and any inserted reasoning tokens — are attended to by every token, preserving long-range conditioning without paying for full attention.

The combination matters: sliding-window attention is what makes “run the block four extra times on this token” affordable. If the per-step cost were quadratic in sequence length, adaptive recursion would just move the bottleneck around.

Three signals for “I don’t know”

A language model’s confidence about its next token is a multi-faceted object. LANTERN combines three signals that capture genuinely different sources of uncertainty:

1. Distributional entropy

The standard signal: the entropy of the next-token distribution $p(x_{t+1} \mid x_{\le t})$ ,

H \;=\; -\sum_i p_i \log p_i.

High entropy → the model is spreading mass across many candidates. Low entropy → it has a clear winner. Cheap to compute (already have the logits), and well-understood as a calibration signal.

The complementary quantity is the top probability $p_{\max}$ . Low $H$ and high $p_{\max}$ usually agree, but they disagree informatively in long-tailed distributions — high $p_{\max}$ with high $H$ means “I’m fairly sure about my top pick, but the long tail is unusually heavy”.

2. Semantic dispersion

Distributional entropy can’t tell you why the model is uncertain. Two failure modes look identical on entropy alone:

The top- $k$ tokens are paraphrases or near-synonyms (big, large, huge). Sampling any of them gives a fluent, correct output.
The top- $k$ tokens point at semantically distinct continuations (agree, disagree, abstain). The choice changes the meaning.

LANTERN distinguishes them by looking at the embedding-space variance of the top- $k$ candidates, weighted by their probabilities:

\mu \;=\; \sum_{i \in \mathrm{top}_k} p_i \, e_i, \qquad \sigma^2 \;=\; \sum_{i \in \mathrm{top}_k} p_i \,\lVert e_i - \mu \rVert^2.

topk_embeddings = embedding_matrix[topk_indices]
centroid  = (topk_probs[:, None] * topk_embeddings).sum(0)
variance  = (topk_probs[:, None] * (topk_embeddings - centroid).pow(2)).sum()

Interpretation:

High $H$ , low $\sigma^2$ → synonyms / paraphrases. Sample normally; the answer doesn’t really depend on the choice.
High $H$ , high $\sigma^2$ → genuinely different meanings. This is the one to spend more compute on.

It’s a heuristic — embedding spaces aren’t perfectly aligned with semantics — but it’s a useful refinement over raw entropy and almost free to compute.

3. Epistemic uncertainty via MC dropout

The two signals above are computed from a single forward pass; they describe the aleatoric uncertainty captured by the model’s softmax. What they cannot tell you is whether the model itself is uncertain — whether different plausible weight configurations would produce very different distributions.

MC dropout is the cheap approximation: keep dropout active at inference, run the model $S$ times, and measure how much the predictive distribution moves between samples.

from lantern.uncertainty.bayesian import bayesian_step

mean_probs, epistemic_uncertainty = bayesian_step(
    model, hidden_states, lm_head, num_samples=5,
)

It’s not a real Bayesian posterior — it’s a variational approximation under specific assumptions on the dropout layers — but it captures “the model is unsure in a way a single forward pass would hide”. LANTERN treats it as one signal among three rather than a ground truth, which is the appropriate amount of trust to put in it.

Composing the signals

The three signals are reduced to a single uncertainty score:

U \;=\; a \cdot H \;+\; b \cdot \sigma^2 \;-\; c \cdot p_{\max} \;+\; \lambda \cdot U_{\text{epistemic}},

with weights chosen so that each term is on a comparable scale at typical operating points. The score drives a four-band controller:

Range	Behaviour
$U < \tau_{\text{low}}$	Confident — single forward pass, normal sampling.
$U < \tau_{\text{mid}}$	Moderate — consider refined sampling (e.g. lower temperature).
$U < \tau_{\text{high}}$	High — run Bayesian refinement (MC dropout samples).
$U \ge \tau_{\text{high}}$	Very high — inject THINK token, deepen recursion.

The thresholds are exposed as configuration so the same model can be operated more or less cautiously depending on the workload — strict factual tasks want low thresholds, casual generation wants high ones.

The generation loop

Every step the controller is consulted after the cheap pass and only escalates if needed. The expensive paths are conditional, not default.

for t in range(max_tokens):
    # 1. Cheap forward pass with base recursion depth.
    h      = recur_block(h, steps=steps_base)
    logits = lm_head(h[:, -1, :])

    # 2. Single-pass uncertainty from this forward.
    U = composite_uncertainty(logits, embeddings)

    # 3. Spend more samples only if the cheap pass is suspect.
    if U >= tau_low:
        mean_probs, U_epi = bayesian_step(model, h, lm_head, num_samples=5)
        U_total = U + λ * U_epi
    else:
        U_total = U

    # 4. Spend more depth only if even the samples disagree strongly.
    if U_total >= tau_high:
        inject_think_token()
        h = recur_block(h, steps=steps_reasoning)

    next_token = sample(probs_from(logits, U_total))

Two properties are worth calling out explicitly:

Cost is upper-bounded by the worst-case branch. A token can cost at most steps_reasoning block iterations plus S MC dropout samples. That sets an honest ceiling on the latency variance under adversarial inputs.
Cost is lower-bounded by the cheap branch. Easy tokens pay only steps_base iterations and no MC samples. On benign inputs the amortised cost is close to a standard small transformer.

Honest caveats

Uncertainty estimation in language models is an unsolved problem at the research level; I want to be clear about what this design does and doesn’t claim:

MC dropout is a weak Bayesian approximation. Its theoretical justification depends on the dropout layers being interpreted as a specific variational family, and the empirical literature has plenty of cases where its calibration is poor. Treat $U_{\text{epistemic}}$ as a useful signal, not a posterior.
Embedding-space variance is a proxy for meaning, not meaning. It works because tied embeddings in language models do cluster semantically related tokens, but it can mislead on rare tokens, on homographs, and across languages.
Entropy alone is well-known to be miscalibrated on modern LLMs. Combining it with semantic dispersion and MC dropout helps but does not produce a calibrated probability — the score is best treated as an ordering of tokens by relative difficulty, not as an absolute measurement.
THINK token / extra recursion is not chain-of-thought. Injecting a sentinel and running the block more times asks the model to use the extra compute to refine its representation; it does not by itself force coherent step-by-step reasoning. Whether the extra depth helps depends on what the recursive block has actually learned to do at depths beyond steps_base, which is something to monitor empirically.

The honest framing is: this is a system that lets the model spend more compute on hard tokens. Whether the resulting tokens are better depends on the underlying model, the calibration of the thresholds, and the kind of task. The interesting research question is whether the uncertainty signal is reliable enough to be useful as a controller — not whether uncertainty estimates are perfect.

What I want to evaluate next

The design is in place; the open questions are empirical.

Calibration of the composite score. Reliability diagrams on a held-out generation task, ideally with both easy and hard subsets, to check that high $U$ actually correlates with token-level error.
Accuracy vs. compute curves. The honest plot is task accuracy against amortised compute per token, sweeping the thresholds. The hope is that the adaptive path beats a fixed-depth model at the same average cost. The risk is that the cheap branch covers nearly everything and the expensive branch fires on the wrong tokens.
THINK token utility. Does the model use the inserted sentinel as a marker to switch behaviour, or does it ignore it? Concrete diagnostics: attention to the THINK token from subsequent positions, and ablation of the token vs. just running more recursion steps.
Sliding-window failure cases. Long-range dependencies that exceed the window are exactly the cases where global tokens have to do all the work. It’s worth stress-testing on tasks that require retrieval from beyond the window to see where the sparse pattern breaks down.

Adaptive computation for transformers is an old idea that keeps coming back because the underlying observation — that token difficulty is wildly non-uniform — is hard to argue with. The interesting design question is what controls the adaptation. LANTERN’s bet is that a small ensemble of cheap uncertainty signals, treated honestly as approximations rather than ground truth, is enough to make compute-on-demand inference work at the small-model end of the curve.