Transformers

A multi-layer perceptron (MLP) architecture built on top of a multi-head attention mechanism (Vaswani et al., 2023) to signal high entropy tokens to be amplified and less important tokens to be diminished.

ELI5: Mom often creates a food list consists of $n$ of items to buy. Your job is to guess what the last item on this list would be.

Most implementations are autoregressive. Most major SOTA are decoder-only, as encoder-decoder models has lack behind due to their expensive encoding phase.

state-space models which address transformers’ Efficient Transformers: A Survey arXiv (Tay et al., 2022) in attention layers within information-dense data

internals

memory limitations.

"How is LLaMa.cpp possible?"
great post by @finbarrtimbers https://t.co/yF43inlY87

llama.cpp surprised many people (myself included) with how quickly you can run large LLMs on small computers, e.g. 7B runs @ ~16 tok/s on a MacBook. Wait don't you need supercomputers to work… pic.twitter.com/EIp9iPkZ6x
— Andrej Karpathy (@karpathy) 15 août 2023

Arithmetic intensity can be determined with the following:

\text{Arithmetic Intensity} = \frac{\text{\# FLOPs}}{\text{\# MOPs}}

inference.

Either compute-bound (batch inference, saturated usage) or memory-bound (latency)

Prefill/Decode

Idea: “draft-and-verify” using smaller models to generate a head tokens (quick explanation from karpathy)

Intuitively:

we generate a small set of lookahead tokens, albeit 2-5 tokens with smaller speculators
uses the larger model to “verify” the input sequences + draft tokens (then replace tokens that aren’t valid from rejection sampler)

In a sense, we are verify these in parallel instead of autoregressive decoding.

A few techniques such as ngrams, EAGLE are supported in vLLM

EAGLE

Extrapolation Algorithm for Greater Language-model Efficiency

Motivation:

speculative sampling relies on the draft models having similar distributions as the target models.
- use smaller models. i.e: Llama 3.2 3B as draft for Llama 3.3 70B.
- high overhead for stepping through the whole models would outweighs the benefits

Difference between EAGLE-1 and EAGLE-3

EAGLE-1’s limitation at its feature prediction constraints, via LM head architecture,

EAGLE-3 addresses this by use direct token prediction and rely on multi-layer feature fusion called “training-time test”, similar to MLP Speculator

distribution skew

EAGLE does not involve any fine-tuning of the target model, therefore preservation of outputs distributions by EAGLE is theoretically guaranteed for both greedy and non-greedy sampling. This is not the case with Lookahead and Medusa.

EAGLE-1

Observations:

autoregressive on feature-level ¹ is simpler than token-level, given that there are more regularity.

uncertainty in sampling process hinders the performance of predicting the next feature.

feature-level are high-dimensional and continuous, meaning sampling “am” or “always” will results in different feature sequences.

EAGLE address this by inputs the token sequence from one time step ahead including the sampling outcomes into the draft models.

predicting $f_{\text{always}}$ based on $f_{\text{I}}$ and $t_\text{always}$
predicting $f_{\text{am}}$ based on $f_{\text{I}}$ and $t_\text{am}$

notation.

“Features” refers to second-to-top-layer feature of LLM, or the hidden states before LM head
Token by $t$ , embedding by $e$ , features by $f$ , distributions by $p$
Sequences are referred as $T_{i:j}$ for $(t_i, t_{i+1},\ldots, t_j)$ ²

architecture

[feature_seq, token_seq] # [bs, seq_len, hidden_dim], [bs, seq_len]
token_seq -> token_emb # [bs, seq_len] -> [bs, seq_len, hidden_dim]
fused_seq = feature_seq * token_emb # [bs, seq_len, 2xhidden_dim] ³
autoregressive_head:
- FC layer → reduce # [bs, seq_len, hidden_dim]
- decoder layer → features
using tree attention to generate a draft tree of depth $m$ and more than $m$ tokens for $m$ forward pass. ⁴

training

Smooth L1 loss:
$L_\text{reg} = \text{Smooth L1}(f_{i+1} \text{draft}(T_{2:i+1}, F_{1:i}))$
classification loss to optimize given objectives:
$\begin{aligned} p_{i+2} &= \text{Softmax}(\text{LM\_Head}(f_{i+1})) \\ \hat{p}_{i+2} &= \text{Softmax}(\text{LM\_Head}(\hat{f}_{i+1})) \\ L_{\text{cls}} &= \text{CrossEntropy}(p_{i+2}, \hat{p}_{i+2}) \end{aligned}$
Autoregressive head with loss $L = L_{\text{reg}} + w_{\text{cls}} L_{\text{cls}}$
- set $w_{\text{cls}}=0.1$ given that classification loss is in order magnitude bigger than regression loss
Dataset: ShareGPT, 68k dialogue
Hyperparameter:
- LR: $3e^{-5}$
- AdamW with beta $(\beta_1, \beta_2)=(0.9,0.95)$
- gradient clipping: $0.5$

EAGLE-2

tl/dr: Improvement on EAGLE-1 via context-aware dynamic draft tree into this drafting modeling.

EAGLE-3

HASS

Learning Harmonized Representations for Speculative Sampling arXiv (Zhang et al., 2025)

HArmonizedSS/HASS

Falcon

Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree arXiv (Gao et al., 2025)

MLP Speculator

via combined tokens/embedding speculators

Accelerating Production LLMs with Combined Token/Embedding Speculators arXiv (Wertheimer et al., 2024)

DistillSpec

DistillSpec: Improving Speculative Decoding via Knowledge Distillation arXiv (Zhou et al., 2024)

Medusa

https://sites.google.com/view/medusa-llm

FasterDecoding/Medusa

ngrams

apoorvumang/prompt-lookup-decoding

also known as Prompt Lookup Decoding (PLD), HF’s assisted generations

idea: to use string matching from prompt to generate candidate tokens, instead of using a draft-based models.

def find_candidate_pred_tokens(input_ids, max_ngram_size=3, num_pred_tokens=10):
  input_length = input_ids.size(1)
 
  for ngram_size in range(max_ngram_size, 0, -1):
    # Extract the last n tokens as our search ngram
    ngram = input_ids[0, -ngram_size:].tolist()
 
    # Create sliding windows of size ngram_size
    windows = input_ids.unfold(dimension=1, size=ngram_size, step=1)
 
    # Convert ngram to a tensor for comparison
    ngram_tensor = torch.tensor(ngram, device=input_ids.device).unsqueeze(0)
 
    # Find where the windows match the ngram
    matches = (windows == ngram_tensor).all(dim=2)
 
    # Get the indices of matches
    match_indices = matches.nonzero(as_tuple=True)[1]
 
    # Iterate through match indices to find a valid continuation
    for idx in match_indices:
      start_idx = idx + ngram_size
      end_idx = start_idx + num_pred_tokens
      # Ensure we don't go beyond the length of input_ids and avoid self-match
      if end_idx <= input_length and start_idx < input_length - ngram_size:
        return input_ids[0, start_idx:end_idx]
 
  # If no match is found, return an empty tensor
  return torch.tensor([], dtype=torch.long, device=input_ids.device)

lookahead decoding

SPiRE

MagicDec

optimization

(Liu et al., 2024) proposes SmartSpec via optimizing goodput.

speculative length

number of effective tokens generated by draft-models per iteration

Improvement factor (IF) determines the value of $\alpha$ .

Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models arXiv (Mamou et al., 2024) proposes a dynamic speculative length to optimize for best decoding quality. fwiw num_speculative_tokens=5 has been found to be a pretty good balance between latency and quality trade-off for larger models. They propose an oracle classifier per draft requests to determine whether they should increase/decrease SL as follows:

C_i = \text{FFN}(\operatorname{Concat}(\text{top\_k}({y_i}^D), \text{entropy}({y_i}^D), i))

where it takes the probability vectors of draft models $y_i^D$ for token position $i$ to generate a confidence score $C_i$ ⁵

distributed sps

Accelerating Large Language Model Decoding with Speculative Sampling arXiv (Chen et al., 2023)

speculative sampling

aliases: SpS, speculative decoding.

Based on:

tl/dr

Latency is improved at the cost of increasing ops, with $\gamma=5$ ⁸
This is not useful when computation resources are limited.
Walltime improvement by $\frac{1-\alpha^{\gamma +1}}{(1-\alpha)(\gamma c + 1)}$ where $\alpha$ is the approximation $E(\beta)$ ⁹
Note that this is different from rejection sampling ¹⁰
Lenience factor $l$ to perform speed versus quality trade-off ¹¹ when draft-models distributions is different from target-models’. ¹²

goal and algorithm

Let $M_p$ be the target model for task $X$ , and $p(x_t \mid x_{<t})$ the distribution we get from model for a prefix $x_{<t}$

Let $M_q$ be the draft/approximation models at the same task, and $q(x_t \mid x_{<t})$ the distribution we get from model for a prefix $x_{<t}$

Objective: to use $M_q$ to generate $\gamma \in \mathbb{Z}^{+}$ completions, and use $M_p$ to verify $\gamma$ tokens in parallel

Keep when $q(x) \le p(x)$
Reject when $q(x) \ge p(x)$ for sample with $P=1-\frac{p(x)}{q(x)}$ and sample $x$ again from $p^{'}(x) = \textit{norm}(\textit{max}(0, p(x) - q(x)))$ ¹³

"\\begin{algorithm}\n\\caption{SpeculativeDecodingStep}\n\\begin{algorithmic}\n\n\\INPUT{$M_p,\\;M_q,\\;\\textit{prefix}$}\n\n\\State $\\triangleright$ Sample $\\gamma$ guesses $x_1,\\dots,x_\\gamma$ from $M_q$\n\\FOR{$i \\gets 1$ \\TO $\\gamma$}\n \\STATE $q_i(x) \\gets M_q\\!\\bigl(\\textit{prefix} + [x_1,\\dots,x_{i-1}]\\bigr)$\n \\STATE $x_i \\sim q_i(x)$\n\\ENDFOR\n\n\\State $\\triangleright$ Run $M_p$ in parallel\n\\STATE $p_1(x),\\dots,p_{\\gamma+1}(x) \\gets\n M_p(\\textit{prefix}),\\dots,\n M_p\\!\\bigl(\\textit{prefix} + [x_1,\\dots,x_\\gamma]\\bigr)$\n\n\\State $\\triangleright$ Determine the number of accepted guesses $n$\n\\STATE $r_1,\\dots,r_\\gamma \\sim U(0,1)$\n\\STATE $n \\gets \\min\\!\\bigl(\\{\\,i-1 \\mid\n 1\\le i\\le\\gamma,\\;\n r_i > \\frac{p_i(x)}{q_i(x)}\\,\\}\\cup\\{\\gamma\\}\\bigr)$\n\n\\State $\\triangleright$ Adjust $M_p$’s distribution if needed\n\\STATE $p'(x) \\gets p_{n+1}(x)$\n\\IF{$n < \\gamma$}\n \\STATE $p'(x) \\gets \\mathrm{norm}\\!\\bigl(\\max\\!\\bigl(0,\\;\n p_{n+1}(x)-q_{n+1}(x)\\bigr)\\bigr)$\n\\ENDIF\n\n\\State $\\triangleright$ Emit one token from $M_p$ and $n$ from $M_q$\n\\STATE $t \\sim p'(x)$\n\\RETURN $\\textit{prefix} + [x_1,\\dots,x_n,t]$\n\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 4 SpeculativeDecodingStep

Input: $M_p,\;M_q,\;\textit{prefix}$

$\triangleright$ Sample $\gamma$ guesses $x_1,\dots,x_\gamma$ from $M_q$

for $i \gets 1$ to $\gamma$ do

$q_i(x) \gets M_q\!\bigl(\textit{prefix} + [x_1,\dots,x_{i-1}]\bigr)$

$x_i \sim q_i(x)$

end for

$\triangleright$ Run $M_p$ in parallel

$p_1(x),\dots,p_{\gamma+1}(x) \gets M_p(\textit{prefix}),\dots, M_p\!\bigl(\textit{prefix} + [x_1,\dots,x_\gamma]\bigr)$

$\triangleright$ Determine the number of accepted guesses $n$

$r_1,\dots,r_\gamma \sim U(0,1)$

$n \gets \min\!\bigl(\{\,i-1 \mid 1\le i\le\gamma,\; r_i > \frac{p_i(x)}{q_i(x)}\,\}\cup\{\gamma\}\bigr)$

$\triangleright$ Adjust $M_p$ ’s distribution if needed

$p'(x) \gets p_{n+1}(x)$

if $n < \gamma$ then

$p'(x) \gets \mathrm{norm}\!\bigl(\max\!\bigl(0,\; p_{n+1}(x)-q_{n+1}(x)\bigr)\bigr)$

end if

$\triangleright$ Emit one token from $M_p$ and $n$ from $M_q$

$t \sim p'(x)$

return $\textit{prefix} + [x_1,\dots,x_n,t]$

acceptance probability

alias: acceptance rate

definition 3.1

acceptance rate $\beta_{x<t}$ given a prefix $x_{<t}$ is the probability of accepting $x_t \sim q(x_t\mid x_{<t})$ via speculative sampling.

$E(\beta)$ is the natural measure of how well $M_q$ approximates $M_p$

$\alpha = E(\beta)$ assuming $\beta$ are i.i.d, (1) is a capped geometrics variables, with success probability of $1 - \alpha$ and cap $\gamma + 1$ :

E(\text{\# generated tokens}) = \frac{1-\alpha^{\gamma +1}}{1-\alpha}

calculating $\alpha$

definition 3.2

Let natural divergence $D_{LK}$ be:
$D_{LK}(p,q) = \sum_{x} |p(x) - M(x)| = \sum_{x} \mid q(x) - M(x) \mid$
where $M(x) = \frac{p(x) + q(x)}{2}$

Lemma 3.3

$D_{LK}(p,q) = 1 - \sum_{x} \min{p(x), q(x)}$ ¹⁴

Corollary 3.4

$D_{LK}(p,q)$ is a symmetric divergence in $[0,1]$ , where

$D_{LK}(p,q)=0 \Longleftrightarrow p=q$

$D_{LK}(p,q)=1 \Longleftrightarrow \text{p and q have disjoint support}$

Theorem 3.5

$\beta = 1 - D_{LK}(p,q)$ ¹⁵

Corollary 3.6

$\alpha = 1 - E(D_{LK}(p,q)) = E(min(p,q))$

walltime improvement

With i.i.d assumption speculative sampling reduces $\text{\# of calls}$ to target models by $\frac{1-\alpha^{\gamma +1}}{1-\alpha }$ , assuming running on compute resources that support increased concurrency (GPUs.)

For walltime ¹⁶ analysis, assuming we can run $\gamma +1$ concurrent evaluation of $M_p$ :

cost-efficient

let $c$ be the ratio between time for single run of $M_q$ and the time for single run $M_p$

$c$ is highly dependent on hardware measure. From the paper, $c \approx 0$ to avoid expectancy biases

Theorem 3.8

expected improvement factor in total walltime by $\frac{1-\alpha^{\gamma +1}}{(1-\alpha)(\gamma c + 1)}$ ¹⁷

Note that we assume there are long enough generations sequence here.

Corollary 3.9

$\forall \alpha > c \space \exists \space \gamma \mid \text{ we will get improvement by a factor of } \frac{1+\alpha }{1+c}$

If we get an improvement for $\gamma$ , we’d also get improvement for any $0 < \gamma^{*} < \gamma$ , hence we can use (3.8) for $\gamma = 1$ , which yields $\frac{1+\alpha}{1+c}$

arithmetic operations

arithmetics operations per token

let $\hat{c}$ be the ratio of arithmetics operations per tokens of $M_q$ to that of $M_p$

Note that the number of operations will then grow by $\gamma +1$ , given that we will produce at most $\gamma +1$ tokens per run.

Theorem 3.11

The expected factor of increase in number of operations is $\frac{(1-\alpha)(\gamma \hat{c} + \gamma + 1)}{1-\alpha^{\gamma +1}}$ ¹⁸

KV

The core “retrieval” bags that contains all previous stored key-value pair or newly added items.

Prefill disaggregation is pretty interesting in a sense that we can separate prefill stage to a separate nodes (Qin et al., 2024)

figure1: KV-centric optimization

Question

Why do we need to use KV Cache?

next-token prediction.

Sampling: we essentially look forward K-tokens, and then we sample from the distribution of the next token.

multi-token prediction.

(Gloeckle et al., 2024)

figure2: MTP implementation in DeepSeek, where they keep causal chain for prediction of each token at each depth

tl/dr: predict $n$ -tokens at once, via shared trunk and n dedicated attention heads ¹⁹

Note that during inference, we only employ one attention head

Byte-Latent Transformer

idea: learn from raw-bytes and skip tokenizer/detokenizer protocol.

Feynman-Kac

Let $\mathcal{V}$ be the vocab of given transformers model, and $\mathcal{S} = \mathcal{V}^{*}$ the set of multi-token strings. Assume $\mathcal{V}$ contains token EOS and write $\mathcal{F} \subseteq \mathcal{S}$ for the set of EOS-terminated strings.

Feynman-Kac Transformer model

is a tuple $(s_{0}, \{M_t\}_{t\ge 1}, \{G_t\}_{t\ge 1})$ where:

$s_{0} \in \mathcal{S}$ is an initial state, which will take as empty string $\epsilon$

$M_t(s_t \mid s_{t-1}, f_\theta)$ is a Markov kernel from $s_{t-1} \in \mathcal{F}^c$ to $s_t \in \mathcal{S}$ , parameterised by a transformer network $f_\theta: \mathcal{F}^c \to \mathbb{R}^{\mid \mathcal{V} \mid}$ mapping non-EOS-terminated strings to vectors of logits

$G_t(s_{t-1}, s_t, f_\theta)$ is a potential function, mapping a pair $(s_{t-1}, s_t) \in \mathcal{F}^c \times \mathcal{S}$ to a real-valued non-negative score.

Goal: generate from distribution $\mathbb{P}$ that reweights Markov chain $\mathbb{M}$ by potential functions $G_t$ . We define step-t filtering posteriors:

P_t(s_t) = \frac{\mathbb{E}_\mathbb{M} \left[ \prod_{i=1}^{t \wedge T} G_i(S_{i-1}, S_i, f_\theta) \cdot [S_t = s_t] \right]}{\mathbb{E}_\mathbb{M} \left[ \prod_{i=1}^{t \wedge T} G_i(S_{i-1}, S_i, f_\theta) \right]}

Given that $T$ is mostly finite we can then define overall posterior (Lew et al., 2023, p. see 2.2 for examples)

\mathbb{P}(s) = \lim_{t \to \infty} \mathbb{P}_t(s)

"\\begin{algorithm}\n\\caption{Sequential Monte Carlo Transformer Steering}\n\\begin{algorithmic}\n\\State \\textbf{Input:} $N$ (\\# particles), $K$ (factor), Feynman-Kac Transformer model $\\{s_0, \\{M_t\\}_{t \\geq 1}, \\{G_t\\}_{t \\geq 1}\\}$\n\\State \\textbf{Output:} Weighted particle approximation $\\{(x_i, w_i)\\}_{i=1,\\ldots,N}$ of the posterior $\\mathbb{P}$ \\\\\n\\State \\textbf{Output:} Unbiased estimate $\\hat{Z}$ of the partition function $Z = \\mathbb{E}_\\mathbb{M}[\\prod_{t=1}^T G_t(s_t, s_{t-1}, f_\\theta)]$ \\\\\n\\State Initialize $f_\\theta \\gets \\texttt{CachedTransformer}()$\n\\State Initialize $(x_i, w_i) \\gets (s_0, 1)$ for $i = 1, \\ldots, N$\n\\State Initialize $t \\gets 1$\n\\While{$x_i \\not\\in \\mathcal{F}$ for some $i \\in \\{1, \\ldots, N\\}$}\n \\State $K_i \\gets K (1 - \\mathbb{1}_{\\mathcal{F}}(x_i)) + \\mathbb{1}_{\\mathcal{F}}(x_i)$ for $i = 1, \\ldots, N$\n \\State $N' \\gets \\sum_{i=1}^N K_i$\n \\For{$i \\in \\{1, \\ldots, N\\}$}\n \\If{$x_i \\in \\mathcal{F}$}\n \\State Set $(x_{i,1}, w_{i,1}) \\gets (x_i, w_i \\cdot \\frac{N'}{N})$\n \\Else\n \\State Generate $x_{i,k} \\sim M_t(\\cdot \\mid x_i, f_\\theta)$ for $k = 1, \\ldots, K$\n \\State Set $w_{i,k} \\gets w_i \\cdot G_t(x_i, x_{i,k}, f_\\theta) \\cdot \\frac{N'}{K N}$ for $k = 1, \\ldots, K$\n \\EndIf\n \\EndFor\n \\State Set normalized weights $\\hat{w}_{i,k} \\gets \\frac{w_{(i,k)}}{\\sum_{j=1}^N \\sum_{l=1}^{K_j} w_{(j,l)}}$ for $i = 1, \\ldots, N$ and $k = 1, \\ldots, K_i$\n \\State Set $c^* \\gets \\inf\\{c \\in \\mathbb{R}_{> 0} \\mid \\sum_{i=1}^N \\sum_{k=1}^{K_i} (\\mathbb{1} \\wedge c \\hat{w}_{(i,k)}) > N\\}$\n \\State Set $(I_\\text{det}, I_\\text{stoch}, I_\\text{strat}) \\gets (\\{(i,k) \\mid c^{*} \\hat{w}_{i,k} \\geq 1\\}, \\{(i,k) \\mid c^{*} \\cdot \\hat{w}_{i,k} < 1\\}, \\{\\})$\n \\State Set $\\alpha \\gets \\frac{\\sum_{i \\in I_\\text{stoch}} \\hat{w}_i}{|I_\\text{det}|}$ and generate $U \\sim \\text{Uniform}([0, \\alpha])$\n \\For{$i \\in I_\\text{stoch}$}\n \\State Set $U \\gets U - \\hat{w}_i$\n \\If{$U < 0$}\n \\State Set $I_\\text{strat} \\gets I_\\text{strat} \\cup \\{i\\}$\n \\State Set $U \\gets U + \\alpha$\n \\EndIf\n \\EndFor\n \\State Set particles $\\{(x_i, w_i)\\}_{i=1,\\ldots,|I_\\text{det}|} \\gets \\{(x_j, w_j \\cdot \\frac{N}{N'}) \\mid j \\in I_\\text{det}\\}$\n \\State Set particles $\\{(x_i, w_i)\\}_{i=|I_\\text{det}|+1,\\ldots,N} \\gets \\{(x_j, \\frac{N}{c^* N'} \\sum_{l=1}^{N} \\sum_{k=1}^{K_l} w_{(j,k)}) \\mid j \\in I_\\text{strat}\\}$\n\\EndWhile\n\\State \\Return $\\left((x_i, w_i)_{i=1,\\ldots,N}, \\hat{Z} = \\frac{1}{N} \\sum_{i=1}^N w_i \\right)$\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 5 Sequential Monte Carlo Transformer Steering

Input: $N$ (# particles), $K$ (factor), Feynman-Kac Transformer model $\{s_0, \{M_t\}_{t \geq 1}, \{G_t\}_{t \geq 1}\}$

Output: Weighted particle approximation $\{(x_i, w_i)\}_{i=1,\ldots,N}$ of the posterior $\mathbb{P}$

Output: Unbiased estimate $\hat{Z}$ of the partition function $Z = \mathbb{E}_\mathbb{M}[\prod_{t=1}^T G_t(s_t, s_{t-1}, f_\theta)]$

Initialize $f_\theta \gets \texttt{CachedTransformer}()$

Initialize $(x_i, w_i) \gets (s_0, 1)$ for $i = 1, \ldots, N$

Initialize $t \gets 1$

while $x_i \not\in \mathcal{F}$ for some $i \in \{1, \ldots, N\}$ do

$K_i \gets K (1 - \mathbb{1}_{\mathcal{F}}(x_i)) + \mathbb{1}_{\mathcal{F}}(x_i)$ for $i = 1, \ldots, N$

$N' \gets \sum_{i=1}^N K_i$

for $i \in \{1, \ldots, N\}$ do

if $x_i \in \mathcal{F}$ then

Set $(x_{i,1}, w_{i,1}) \gets (x_i, w_i \cdot \frac{N'}{N})$

else

Generate $x_{i,k} \sim M_t(\cdot \mid x_i, f_\theta)$ for $k = 1, \ldots, K$

Set $w_{i,k} \gets w_i \cdot G_t(x_i, x_{i,k}, f_\theta) \cdot \frac{N'}{K N}$ for $k = 1, \ldots, K$

end if

end for

Set normalized weights $\hat{w}_{i,k} \gets \frac{w_{(i,k)}}{\sum_{j=1}^N \sum_{l=1}^{K_j} w_{(j,l)}}$ for $i = 1, \ldots, N$ and $k = 1, \ldots, K_i$

Set $c^* \gets \inf\{c \in \mathbb{R}_{> 0} \mid \sum_{i=1}^N \sum_{k=1}^{K_i} (\mathbb{1} \wedge c \hat{w}_{(i,k)}) > N\}$

Set $(I_\text{det}, I_\text{stoch}, I_\text{strat}) \gets (\{(i,k) \mid c^{*} \hat{w}_{i,k} \geq 1\}, \{(i,k) \mid c^{*} \cdot \hat{w}_{i,k} < 1\}, \{\})$

Set $\alpha \gets \frac{\sum_{i \in I_\text{stoch}} \hat{w}_i}{|I_\text{det}|}$ and generate $U \sim \text{Uniform}([0, \alpha])$

for $i \in I_\text{stoch}$ do

Set $U \gets U - \hat{w}_i$

if $U < 0$ then

Set $I_\text{strat} \gets I_\text{strat} \cup \{i\}$

Set $U \gets U + \alpha$

end if

end for

Set particles $\{(x_i, w_i)\}_{i=1,\ldots,|I_\text{det}|} \gets \{(x_j, w_j \cdot \frac{N}{N'}) \mid j \in I_\text{det}\}$

Set particles $\{(x_i, w_i)\}_{i=|I_\text{det}|+1,\ldots,N} \gets \{(x_j, \frac{N}{c^* N'} \sum_{l=1}^{N} \sum_{k=1}^{K_l} w_{(j,k)}) \mid j \in I_\text{strat}\}$

end while

return $\left((x_i, w_i)_{i=1,\ldots,N}, \hat{Z} = \frac{1}{N} \sum_{i=1}^N w_i \right)$

Bibliographie

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., & Jumper, J. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. arXiv preprint arXiv:2302.01318 [arXiv]
Gao, X., Xie, W., Xiang, Y., & Ji, F. (2025). Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree. arXiv preprint arXiv:2412.12639 [arXiv]
Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. arXiv preprint arXiv:2211.17192 [arXiv]
Li, Y., Wei, F., Zhang, C., & Zhang, H. (2024). EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees. arXiv preprint arXiv:2406.16858 [arXiv]
Li, Y., Wei, F., Zhang, C., & Zhang, H. (2025a). EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. arXiv preprint arXiv:2401.15077 [arXiv]
Li, Y., Wei, F., Zhang, C., & Zhang, H. (2025b). EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test. arXiv preprint arXiv:2503.01840 [arXiv]
Liu, X., Daniel, C., Hu, L., Kwon, W., Li, Z., Mo, X., Cheung, A., Deng, Z., Stoica, I., & Zhang, H. (2024). Optimizing Speculative Decoding for Serving Large Language Models Using Goodput. arXiv preprint arXiv:2406.14066 [arXiv]
Mamou, J., Pereg, O., Korat, D., Berchansky, M., Timor, N., Wasserblat, M., & Schwartz, R. (2024). Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models. arXiv preprint arXiv:2405.04304 [arXiv]
Stern, M., Shazeer, N., & Uszkoreit, J. (2018). Blockwise Parallel Decoding for Deep Autoregressive Models. arXiv preprint arXiv:1811.03115 [arXiv]
Wertheimer, D., Rosenkranz, J., Parnell, T., Suneja, S., Ranganathan, P., Ganti, R., & Srivatsa, M. (2024). Accelerating Production LLMs with Combined Token/Embedding Speculators. arXiv preprint arXiv:2404.19124 [arXiv]
Zhang, L., Wang, X., Huang, Y., & Xu, R. (2025). Learning Harmonized Representations for Speculative Sampling. arXiv preprint arXiv:2408.15766 [arXiv]
Zhou, Y., Lyu, K., Rawat, A. S., Menon, A. K., Rostamizadeh, A., Kumar, S., Kagy, J.-F., & Agarwal, R. (2024). DistillSpec: Improving Speculative Decoding via Knowledge Distillation. arXiv preprint arXiv:2310.08461 [arXiv]
Gholami, A., Yao, Z., Kim, S., Hooper, C., Mahoney, M. W., & Keutzer, K. (2024). AI and Memory Wall. arXiv preprint arXiv:2403.14123 [arXiv]
Gloeckle, F., Idrissi, B. Y., Rozière, B., Lopez-Paz, D., & Synnaeve, G. (2024). Better & Faster Large Language Models via Multi-token Prediction. arXiv preprint arXiv:2404.19737 [arXiv]
Lew, A. K., Zhi-Xuan, T., Grand, G., & Mansinghka, V. K. (2023). Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs. arXiv preprint arXiv:2306.03081 [arXiv]
Qin, R., Li, Z., He, W., Zhang, M., Wu, Y., Zheng, W., & Xu, X. (2024). Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. arXiv preprint arXiv:2407.00079 [arXiv]
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2022). Efficient Transformers: A Survey. arXiv preprint arXiv:2009.06732 [arXiv]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2023). Attention Is All You Need. arXiv preprint arXiv:1706.03762 [arXiv]

features here refer to the hidden states of the decoder layers second-to-top-layer of the LLM, before the LM head. Not to be confused with features ↩
Vanilla autoregressive at token-level is described by $T_{1:j} \rightarrow E_{1:j} \rightarrow f_j \rightarrow p_{j+1} \rightarrow t_{j+1}$ :
- input $T_{1:j}$ is then transformed into embeddings $E_{1:j}$
- then into features $F_{1:j}$ ,
- LM Head maps $f_j$ to a distribution $p_{j+1} = \text{LM\_Head}(f_j)$
- sampling next token $t_{j+1}$
↩
See vllm-project/vllm#20078 ↩
Aligns with DistillSpec and Medusa ↩
This seems like an premature optimization. For use-cases where the batch sizes fluctuates, the calculation for an optimal speculative length would probably too overkill when the improvement could be minimal. ↩
Note that we refer to standard sampling to methods such as argmax, top-k, nucleus, temperatures, et al., albeit each have a different ways to process logits. We will consider these as standard sampling from an adjusted distribution ↩
This work from DeepMind was performed concurrently and independently from Leviathan et al. (2023). The work at DeepMind focuses more on distributed settings of speculative decoding ↩
also referred in practice as num_speculative_tokens ↩
or natural measure of the acceptance rate $\beta$ ↩
Rejection sampling follows a iterative sampling procedure that might looks superficially similar to speculative sampling:
1. Sample $x \sim q(x)$ and returns $r \sim U(0,1)$
2. If $r < \frac{p(x)}{M q(x)}$ return $x$
3. then go to 1
Where $M = \operatorname{max}_{x} \frac{p(x)}{q(x)}$

We could employ non-iterative version of rejection sampling instead of speculative sampling here (go through step 1 and 2, and otherwise sample an unmodified $p(x)$ directly)

Specifically, the expected accept probability:
$E_{x\sim q(x)} \frac{p(x)}{M q(x)} = \sum_{x} p(x) \min_{x^{'}}{\frac{q(x^{'})}{p(x^{'})}} \le \sum_{x} p(x) \min{(1, \frac{q(x)}{p(x)})} = \sum_{x} \min{(p(x), q(x))}$ ↩
A lenience parameter $l \in [0,1]$ to introduce further trade-off. This is useful when the distributions of draft models does not match the target model exactly.

Specifically we have:
$\begin{aligned} \alpha &= \mathbb{E}_{x\sim q(x)} \!\left[ \begin{cases} 1, & \text{if } l\,q(x) \le p(x),\\[6pt] \displaystyle\frac{p(x)}{l\,q(x)}, & \text{if } l\,q(x) > p(x) \end{cases} \right] \\[10pt] &= \mathbb{E}_{x\sim q(x)}\! \frac{p(x)}{\max\!\bigl(p(x),\,l\,q(x)\bigr)} \\[8pt] &= \sum_{x} \frac{p(x)\,q(x)}{\max\!\bigl(p(x),\,l\,q(x)\bigr)} \\[8pt] &= \frac{1}{l}\sum_{x} \min\!\bigl(p(x),\,l\,q(x)\bigr) \\[8pt] &= \sum_{x} \min\!\Bigl(\tfrac{p(x)}{l},\,q(x)\Bigr). \end{aligned}$

Important

this relies on q is sampled from this given distributions, and $l$ increases $\alpha$

In the case of greedy decoding (temperature=0), the draft essentially outputs $x^{'}_q = \argmax{q(x)}$ , so scaling $l q(x)$ becomes a no-op, given that the argmax will be unchanged in this case. ↩
Note that we can’t use temperature=0 (i.e argmax sampling):
- Instead we allow some lenience before standardizing the distribution (accept token $x$ sampled from $M_q$ in case of $p(x) \le l \dot \max{p}$ )
- In this case, then similar empirical increases to $\alpha$ to those of temperature=1
↩
On Correctness of Speculative Sampling (SpS)

We will show that $\forall p(x) \text{ and } q(x)$ , tokens sampled via speculative sampling from $p(x)$ and $q(x)$ are distributed identically to those sampled from $p(x)$ alone.

Let $\beta$ be the acceptance probability

Note that
$p'(x) = \operatorname{norm}\!\bigl(\max(0,\;p(x)-q(x))\bigr) = \frac{p(x)-\min\!\bigl(q(x),\,p(x)\bigr)} {\displaystyle \sum_{x'}\!\bigl(p(x')-\min\!\bigl(q(x'),\,p(x')\bigr)\bigr)} = \frac{p(x)-\min\!\bigl(q(x),\,p(x)\bigr)}{1-\beta},$
so the normalising constant for the adjusted distribution $p'(x)$ is $1-\beta$ ; the last equality follows immediately from Lemma 3.3 and Theorem 3.5.

Now
$P(x = x') \;=\; P(\text{guess accepted},\,x = x') \;+\; P(\text{guess rejected},\,x = x').$
Where
$P(\text{guess accepted},\,x = x') \;=\; q(x')\,\min\!\bigl(1,\tfrac{p(x')}{q(x')}\bigr) \;=\; \min\!\bigl(q(x'),\,p(x')\bigr),$
and
$P(\text{guess rejected},\,x = x') \;=\; (1-\beta)\,p'(x') \;=\; p(x') - \min\!\bigl(q(x'),\,p(x')\bigr).$
Overall
$\begin{aligned} P(x = x') &= \min\!\bigl(p(x'),\,q(x')\bigr) \;+\; p(x') - \min\!\bigl(p(x'),\,q(x')\bigr) \\ &= p(x'). \end{aligned}$
$\boxed{}$ ↩
$\begin{aligned} D_{LK}(p,q) &= \sum_{x} |p(x) - M(x)| = \sum_{x} \frac{|p-q|}{2} \\ &= 1- \sum_{x} \frac{p+q - |p-q|}{2} \\ &= 1 - \sum_{x} \min{p(x), q(x)} \end{aligned}$
$\boxed{}$ ↩
$\begin{aligned} \beta &= \mathbb{E}_{x \sim q(x)} \Biggl[ \begin{cases} 1 & \text{if } q(x) \le p(x), \\[6pt] \displaystyle\frac{p(x)}{q(x)} & \text{if } q(x) > p(x) \end{cases} \Biggr] \\[8pt] &= \sum_{x} \min\!\bigl(p(x),\,q(x)\bigr). \end{aligned} \qquad\square$ ↩
also known as wikipedia/en/Elapsed_real_time. This is different from CPU time, given that it measure the actual time taken from the start of the computer program, where as CPU time only measures time during which processor is actively working on a certain task or process ↩
Denote the cost of running single steps of $M_p$ by $T$ .

Each run will then costs $T c \gamma + T = T(c \gamma +1)$ (running $M_q$ $\gamma$ times and running $M_p$ once)

Given (1) procduces $\frac{1-\alpha^{\gamma +1}}{1-\alpha}$ tokens

The cost to produces a token with speculative sampling would be $\frac{(c \gamma +1)(1-\alpha )}{1-\alpha^{\gamma +1}} T$

$\boxed{}$ ↩
Denote by $\hat{T}$ the number of arithmetic operations done by standard decoding per tokens, therefore speculative sampling costs $\hat{T} \hat{c} \gamma + \hat{T}(\gamma +1)$ operations. Then divided by the expected tokens we got the desired results $\boxed{}$ ↩

Gloeckle et al. (2024) employs $n=4$ . The order of the forward and backward in a n-token prediction model with $n=4$ heads of the shared trunk works as follow:

z = model.shared(x)
d = z.detach()
d.requires_grad = False
 
for i in range(n):
  p = model.heads[i](d)
  loss(p, y[i]).backward()
z.backward()

↩

A multi-layer perceptron (MLP) architecture built on top of a multi-head attention mechanism (Vaswani et al., 2023) to signal high entropy tokens to be amplified and less important tokens to be diminished.

ELI5: Mom often creates a food list consists of $n$ of items to buy. Your job is to guess what the last item on this list would be.

Most implementations are autoregressive. Most major SOTA are decoder-only, as encoder-decoder models has lack behind due to their expensive encoding phase.

state-space models which address transformers’ Efficient Transformers: A Survey arXiv (Tay et al., 2022) in attention layers within information-dense data

internals

memory limitations.

"How is LLaMa.cpp possible?"
great post by @finbarrtimbers https://t.co/yF43inlY87

llama.cpp surprised many people (myself included) with how quickly you can run large LLMs on small computers, e.g. 7B runs @ ~16 tok/s on a MacBook. Wait don't you need supercomputers to work… pic.twitter.com/EIp9iPkZ6x
— Andrej Karpathy (@karpathy) 15 août 2023

Arithmetic intensity can be determined with the following:

\text{Arithmetic Intensity} = \frac{\text{\# FLOPs}}{\text{\# MOPs}}

inference.

Either compute-bound (batch inference, saturated usage) or memory-bound (latency)

url: thoughts/PD-disaggregated-serving
description: Prefill/Decode
Prefill/Decode
Lien vers l'original

url: thoughts/Speculative-decoding
Speculative decoding
Idea: “draft-and-verify” using smaller models to generate a head tokens (quick explanation from karpathy)

Intuitively:

we generate a small set of lookahead tokens, albeit 2-5 tokens with smaller speculators

uses the larger model to “verify” the input sequences + draft tokens (then replace tokens that aren’t valid from rejection sampler)

In a sense, we are verify these in parallel instead of autoregressive decoding.

A few techniques such as ngrams, EAGLE are supported in vLLM

EAGLE

Extrapolation Algorithm for Greater Language-model Efficiency

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test (Li et al., 2025)

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees (Li et al., 2024)

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (Li et al., 2025a)

Motivation:

speculative sampling relies on the draft models having similar distributions as the target models.

use smaller models. i.e: Llama 3.2 3B as draft for Llama 3.3 70B.

high overhead for stepping through the whole models would outweighs the benefits

Difference between EAGLE-1 and EAGLE-3

EAGLE-1’s limitation at its feature prediction constraints, via LM head architecture,

EAGLE-3 addresses this by use direct token prediction and rely on multi-layer feature fusion called “training-time test”, similar to MLP Speculator

distribution skew

EAGLE does not involve any fine-tuning of the target model, therefore preservation of outputs distributions by EAGLE is theoretically guaranteed for both greedy and non-greedy sampling. This is not the case with Lookahead and Medusa.

EAGLE-1

Observations:

autoregressive on feature-level ¹ is simpler than token-level, given that there are more regularity.

uncertainty in sampling process hinders the performance of predicting the next feature.

feature-level are high-dimensional and continuous, meaning sampling “am” or “always” will results in different feature sequences.

EAGLE address this by inputs the token sequence from one time step ahead including the sampling outcomes into the draft models.

predicting $f_{\text{always}}$ based on $f_{\text{I}}$ and $t_\text{always}$

predicting $f_{\text{am}}$ based on $f_{\text{I}}$ and $t_\text{am}$

notation.

“Features” refers to second-to-top-layer feature of LLM, or the hidden states before LM head

Token by $t$ , embedding by $e$ , features by $f$ , distributions by $p$

Sequences are referred as $T_{i:j}$ for $(t_i, t_{i+1},\ldots, t_j)$ ²

architecture

[feature_seq, token_seq] # [bs, seq_len, hidden_dim], [bs, seq_len]

token_seq -> token_emb # [bs, seq_len] -> [bs, seq_len, hidden_dim]

fused_seq = feature_seq * token_emb # [bs, seq_len, 2xhidden_dim] ³

autoregressive_head:

FC layer → reduce # [bs, seq_len, hidden_dim]

decoder layer → features

using tree attention to generate a draft tree of depth $m$ and more than $m$ tokens for $m$ forward pass. ⁴

training

Smooth L1 loss:
$L_\text{reg} = \text{Smooth L1}(f_{i+1} \text{draft}(T_{2:i+1}, F_{1:i}))$

classification loss to optimize given objectives:
$\begin{aligned} p_{i+2} &= \text{Softmax}(\text{LM\_Head}(f_{i+1})) \\ \hat{p}_{i+2} &= \text{Softmax}(\text{LM\_Head}(\hat{f}_{i+1})) \\ L_{\text{cls}} &= \text{CrossEntropy}(p_{i+2}, \hat{p}_{i+2}) \end{aligned}$

Autoregressive head with loss $L = L_{\text{reg}} + w_{\text{cls}} L_{\text{cls}}$

set $w_{\text{cls}}=0.1$ given that classification loss is in order magnitude bigger than regression loss

Dataset: ShareGPT, 68k dialogue

Hyperparameter:

LR: $3e^{-5}$

AdamW with beta $(\beta_1, \beta_2)=(0.9,0.95)$

gradient clipping: $0.5$

EAGLE-2

tl/dr: Improvement on EAGLE-1 via context-aware dynamic draft tree into this drafting modeling.

EAGLE-3

HASS

Learning Harmonized Representations for Speculative Sampling (Zhang et al., 2025)

HArmonizedSS/HASS

Falcon

Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree (Gao et al., 2025)

MLP Speculator

via combined tokens/embedding speculators

Accelerating Production LLMs with Combined Token/Embedding Speculators (Wertheimer et al., 2024)

DistillSpec

DistillSpec: Improving Speculative Decoding via Knowledge Distillation (Zhou et al., 2024)

Medusa

https://sites.google.com/view/medusa-llm

FasterDecoding/Medusa

ngrams

apoorvumang/prompt-lookup-decoding

also known as Prompt Lookup Decoding (PLD), HF’s assisted generations

idea: to use string matching from prompt to generate candidate tokens, instead of using a draft-based models.
def find_candidate_pred_tokens(input_ids, max_ngram_size=3, num_pred_tokens=10):
  input_length = input_ids.size(1)
 
  for ngram_size in range(max_ngram_size, 0, -1):
    # Extract the last n tokens as our search ngram
    ngram = input_ids[0, -ngram_size:].tolist()
 
    # Create sliding windows of size ngram_size
    windows = input_ids.unfold(dimension=1, size=ngram_size, step=1)
 
    # Convert ngram to a tensor for comparison
    ngram_tensor = torch.tensor(ngram, device=input_ids.device).unsqueeze(0)
 
    # Find where the windows match the ngram
    matches = (windows == ngram_tensor).all(dim=2)
 
    # Get the indices of matches
    match_indices = matches.nonzero(as_tuple=True)[1]
 
    # Iterate through match indices to find a valid continuation
    for idx in match_indices:
      start_idx = idx + ngram_size
      end_idx = start_idx + num_pred_tokens
      # Ensure we don't go beyond the length of input_ids and avoid self-match
      if end_idx <= input_length and start_idx < input_length - ngram_size:
        return input_ids[0, start_idx:end_idx]
 
  # If no match is found, return an empty tensor
  return torch.tensor([], dtype=torch.long, device=input_ids.device)
lookahead decoding

see also: LMSYS blog,

SPiRE

MagicDec

optimization

(Liu et al., 2024) proposes SmartSpec via optimizing goodput.

speculative length

number of effective tokens generated by draft-models per iteration

Improvement factor (IF) determines the value of $\alpha$ .

Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models (Mamou et al., 2024) proposes a dynamic speculative length to optimize for best decoding quality. fwiw num_speculative_tokens=5 has been found to be a pretty good balance between latency and quality trade-off for larger models. They propose an oracle classifier per draft requests to determine whether they should increase/decrease SL as follows:
$C_i = \text{FFN}(\operatorname{Concat}(\text{top\_k}({y_i}^D), \text{entropy}({y_i}^D), i))$
where it takes the probability vectors of draft models $y_i^D$ for token position $i$ to generate a confidence score $C_i$ ⁵

distributed sps

Accelerating Large Language Model Decoding with Speculative Sampling (Chen et al., 2023)

speculative sampling

aliases: SpS, speculative decoding.

Based on:

Fast Inference from Transformers via Speculative Decoding (Leviathan et al., 2023) ⁶ ⁷

Blockwise Parallel Decoding for Deep Autoregressive Models (Stern et al., 2018)

vllm/v1/sample/rejection_sampler.py

tl/dr

Latency is improved at the cost of increasing ops, with $\gamma=5$ ⁸

This is not useful when computation resources are limited.

Walltime improvement by $\frac{1-\alpha^{\gamma +1}}{(1-\alpha)(\gamma c + 1)}$ where $\alpha$ is the approximation $E(\beta)$ ⁹

Note that this is different from rejection sampling ¹⁰

Lenience factor $l$ to perform speed versus quality trade-off ¹¹ when draft-models distributions is different from target-models’. ¹²

goal and algorithm

Let $M_p$ be the target model for task $X$ , and $p(x_t \mid x_{<t})$ the distribution we get from model for a prefix $x_{<t}$

Let $M_q$ be the draft/approximation models at the same task, and $q(x_t \mid x_{<t})$ the distribution we get from model for a prefix $x_{<t}$

Objective: to use $M_q$ to generate $\gamma \in \mathbb{Z}^{+}$ completions, and use $M_p$ to verify $\gamma$ tokens in parallel

Keep when $q(x) \le p(x)$

Reject when $q(x) \ge p(x)$ for sample with $P=1-\frac{p(x)}{q(x)}$ and sample $x$ again from $p^{'}(x) = \textit{norm}(\textit{max}(0, p(x) - q(x)))$ ¹³

$"\\begin{algorithm}\n\\caption{SpeculativeDecodingStep}\n\\begin{algorithmic}\n\n\\INPUT{$M_p,\\;M_q,\\;\\textit{prefix}$}\n\n\\State $\\triangleright$ Sample $\\gamma$ guesses $x_1,\\dots,x_\\gamma$ from $M_q$\n\\FOR{$i \\gets 1$ \\TO $\\gamma$}\n \\STATE $q_i(x) \\gets M_q\\!\\bigl(\\textit{prefix} + [x_1,\\dots,x_{i-1}]\\bigr)$\n \\STATE $x_i \\sim q_i(x)$\n\\ENDFOR\n\n\\State $\\triangleright$ Run $M_p$ in parallel\n\\STATE $p_1(x),\\dots,p_{\\gamma+1}(x) \\gets\n M_p(\\textit{prefix}),\\dots,\n M_p\\!\\bigl(\\textit{prefix} + [x_1,\\dots,x_\\gamma]\\bigr)$\n\n\\State $\\triangleright$ Determine the number of accepted guesses $n$\n\\STATE $r_1,\\dots,r_\\gamma \\sim U(0,1)$\n\\STATE $n \\gets \\min\\!\\bigl(\\{\\,i-1 \\mid\n 1\\le i\\le\\gamma,\\;\n r_i > \\frac{p_i(x)}{q_i(x)}\\,\\}\\cup\\{\\gamma\\}\\bigr)$\n\n\\State $\\triangleright$ Adjust $M_p$’s distribution if needed\n\\STATE $p'(x) \\gets p_{n+1}(x)$\n\\IF{$n < \\gamma$}\n \\STATE $p'(x) \\gets \\mathrm{norm}\\!\\bigl(\\max\\!\\bigl(0,\\;\n p_{n+1}(x)-q_{n+1}(x)\\bigr)\\bigr)$\n\\ENDIF\n\n\\State $\\triangleright$ Emit one token from $M_p$ and $n$ from $M_q$\n\\STATE $t \\sim p'(x)$\n\\RETURN $\\textit{prefix} + [x_1,\\dots,x_n,t]$\n\n\\end{algorithmic}\n\\end{algorithm}"$

Algorithm 4 SpeculativeDecodingStep

Input: $M_p,\;M_q,\;\textit{prefix}$

$\triangleright$ Sample $\gamma$ guesses $x_1,\dots,x_\gamma$ from $M_q$

for $i \gets 1$ to $\gamma$ do

$q_i(x) \gets M_q\!\bigl(\textit{prefix} + [x_1,\dots,x_{i-1}]\bigr)$

$x_i \sim q_i(x)$

end for

$\triangleright$ Run $M_p$ in parallel

$p_1(x),\dots,p_{\gamma+1}(x) \gets M_p(\textit{prefix}),\dots, M_p\!\bigl(\textit{prefix} + [x_1,\dots,x_\gamma]\bigr)$

$\triangleright$ Determine the number of accepted guesses $n$

$r_1,\dots,r_\gamma \sim U(0,1)$

$n \gets \min\!\bigl(\{\,i-1 \mid 1\le i\le\gamma,\; r_i > \frac{p_i(x)}{q_i(x)}\,\}\cup\{\gamma\}\bigr)$

$\triangleright$ Adjust $M_p$ ’s distribution if needed

$p'(x) \gets p_{n+1}(x)$

if $n < \gamma$ then

$p'(x) \gets \mathrm{norm}\!\bigl(\max\!\bigl(0,\; p_{n+1}(x)-q_{n+1}(x)\bigr)\bigr)$

end if

$\triangleright$ Emit one token from $M_p$ and $n$ from $M_q$

$t \sim p'(x)$

return $\textit{prefix} + [x_1,\dots,x_n,t]$

acceptance probability

alias: acceptance rate

definition 3.1

acceptance rate $\beta_{x<t}$ given a prefix $x_{<t}$ is the probability of accepting $x_t \sim q(x_t\mid x_{<t})$ via speculative sampling.

$E(\beta)$ is the natural measure of how well $M_q$ approximates $M_p$

$\alpha = E(\beta)$ assuming $\beta$ are i.i.d, (1) is a capped geometrics variables, with success probability of $1 - \alpha$ and cap $\gamma + 1$ :
$E(\text{\# generated tokens}) = \frac{1-\alpha^{\gamma +1}}{1-\alpha}$
calculating $\alpha$

definition 3.2

Let natural divergence $D_{LK}$ be:
$D_{LK}(p,q) = \sum_{x} |p(x) - M(x)| = \sum_{x} \mid q(x) - M(x) \mid$
where $M(x) = \frac{p(x) + q(x)}{2}$

Lemma 3.3

$D_{LK}(p,q) = 1 - \sum_{x} \min{p(x), q(x)}$ ¹⁴

Corollary 3.4

$D_{LK}(p,q)$ is a symmetric divergence in $[0,1]$ , where

$D_{LK}(p,q)=0 \Longleftrightarrow p=q$

$D_{LK}(p,q)=1 \Longleftrightarrow \text{p and q have disjoint support}$

Theorem 3.5

$\beta = 1 - D_{LK}(p,q)$ ¹⁵

Corollary 3.6

$\alpha = 1 - E(D_{LK}(p,q)) = E(min(p,q))$

walltime improvement

With i.i.d assumption speculative sampling reduces $\text{\# of calls}$ to target models by $\frac{1-\alpha^{\gamma +1}}{1-\alpha }$ , assuming running on compute resources that support increased concurrency (GPUs.)

For walltime ¹⁶ analysis, assuming we can run $\gamma +1$ concurrent evaluation of $M_p$ :

cost-efficient

let $c$ be the ratio between time for single run of $M_q$ and the time for single run $M_p$

$c$ is highly dependent on hardware measure. From the paper, $c \approx 0$ to avoid expectancy biases

Theorem 3.8

expected improvement factor in total walltime by $\frac{1-\alpha^{\gamma +1}}{(1-\alpha)(\gamma c + 1)}$ ¹⁷

Note that we assume there are long enough generations sequence here.

Corollary 3.9

$\forall \alpha > c \space \exists \space \gamma \mid \text{ we will get improvement by a factor of } \frac{1+\alpha }{1+c}$

If we get an improvement for $\gamma$ , we’d also get improvement for any $0 < \gamma^{*} < \gamma$ , hence we can use (3.8) for $\gamma = 1$ , which yields $\frac{1+\alpha}{1+c}$

arithmetic operations

arithmetics operations per token

let $\hat{c}$ be the ratio of arithmetics operations per tokens of $M_q$ to that of $M_p$

Note that the number of operations will then grow by $\gamma +1$ , given that we will produce at most $\gamma +1$ tokens per run.

Theorem 3.11

The expected factor of increase in number of operations is $\frac{(1-\alpha)(\gamma \hat{c} + \gamma + 1)}{1-\alpha^{\gamma +1}}$ ¹⁸

Lien vers l'original

KV

The core “retrieval” bags that contains all previous stored key-value pair or newly added items.

Prefill disaggregation is pretty interesting in a sense that we can separate prefill stage to a separate nodes (Qin et al., 2024)

Question

Why do we need to use KV Cache?

next-token prediction.

Sampling: we essentially look forward K-tokens, and then we sample from the distribution of the next token.

multi-token prediction.

(Gloeckle et al., 2024)

figure2: MTP implementation in DeepSeek, where they keep causal chain for prediction of each token at each depth

tl/dr: predict $n$ -tokens at once, via shared trunk and n dedicated attention heads ¹⁹

Note that during inference, we only employ one attention head

Byte-Latent Transformer

idea: learn from raw-bytes and skip tokenizer/detokenizer protocol.

Feynman-Kac

Feynman-Kac Transformer model

is a tuple $(s_{0}, \{M_t\}_{t\ge 1}, \{G_t\}_{t\ge 1})$ where:

$s_{0} \in \mathcal{S}$ is an initial state, which will take as empty string $\epsilon$

$M_t(s_t \mid s_{t-1}, f_\theta)$ is a Markov kernel from $s_{t-1} \in \mathcal{F}^c$ to $s_t \in \mathcal{S}$ , parameterised by a transformer network $f_\theta: \mathcal{F}^c \to \mathbb{R}^{\mid \mathcal{V} \mid}$ mapping non-EOS-terminated strings to vectors of logits

$G_t(s_{t-1}, s_t, f_\theta)$ is a potential function, mapping a pair $(s_{t-1}, s_t) \in \mathcal{F}^c \times \mathcal{S}$ to a real-valued non-negative score.

Goal: generate from distribution $\mathbb{P}$ that reweights Markov chain $\mathbb{M}$ by potential functions $G_t$ . We define step-t filtering posteriors:

P_t(s_t) = \frac{\mathbb{E}_\mathbb{M} \left[ \prod_{i=1}^{t \wedge T} G_i(S_{i-1}, S_i, f_\theta) \cdot [S_t = s_t] \right]}{\mathbb{E}_\mathbb{M} \left[ \prod_{i=1}^{t \wedge T} G_i(S_{i-1}, S_i, f_\theta) \right]}

Given that $T$ is mostly finite we can then define overall posterior (Lew et al., 2023, p. see 2.2 for examples)

\mathbb{P}(s) = \lim_{t \to \infty} \mathbb{P}_t(s)

"\\begin{algorithm}\n\\caption{Sequential Monte Carlo Transformer Steering}\n\\begin{algorithmic}\n\\State \\textbf{Input:} $N$ (\\# particles), $K$ (factor), Feynman-Kac Transformer model $\\{s_0, \\{M_t\\}_{t \\geq 1}, \\{G_t\\}_{t \\geq 1}\\}$\n\\State \\textbf{Output:} Weighted particle approximation $\\{(x_i, w_i)\\}_{i=1,\\ldots,N}$ of the posterior $\\mathbb{P}$ \\\\\n\\State \\textbf{Output:} Unbiased estimate $\\hat{Z}$ of the partition function $Z = \\mathbb{E}_\\mathbb{M}[\\prod_{t=1}^T G_t(s_t, s_{t-1}, f_\\theta)]$ \\\\\n\\State Initialize $f_\\theta \\gets \\texttt{CachedTransformer}()$\n\\State Initialize $(x_i, w_i) \\gets (s_0, 1)$ for $i = 1, \\ldots, N$\n\\State Initialize $t \\gets 1$\n\\While{$x_i \\not\\in \\mathcal{F}$ for some $i \\in \\{1, \\ldots, N\\}$}\n \\State $K_i \\gets K (1 - \\mathbb{1}_{\\mathcal{F}}(x_i)) + \\mathbb{1}_{\\mathcal{F}}(x_i)$ for $i = 1, \\ldots, N$\n \\State $N' \\gets \\sum_{i=1}^N K_i$\n \\For{$i \\in \\{1, \\ldots, N\\}$}\n \\If{$x_i \\in \\mathcal{F}$}\n \\State Set $(x_{i,1}, w_{i,1}) \\gets (x_i, w_i \\cdot \\frac{N'}{N})$\n \\Else\n \\State Generate $x_{i,k} \\sim M_t(\\cdot \\mid x_i, f_\\theta)$ for $k = 1, \\ldots, K$\n \\State Set $w_{i,k} \\gets w_i \\cdot G_t(x_i, x_{i,k}, f_\\theta) \\cdot \\frac{N'}{K N}$ for $k = 1, \\ldots, K$\n \\EndIf\n \\EndFor\n \\State Set normalized weights $\\hat{w}_{i,k} \\gets \\frac{w_{(i,k)}}{\\sum_{j=1}^N \\sum_{l=1}^{K_j} w_{(j,l)}}$ for $i = 1, \\ldots, N$ and $k = 1, \\ldots, K_i$\n \\State Set $c^* \\gets \\inf\\{c \\in \\mathbb{R}_{> 0} \\mid \\sum_{i=1}^N \\sum_{k=1}^{K_i} (\\mathbb{1} \\wedge c \\hat{w}_{(i,k)}) > N\\}$\n \\State Set $(I_\\text{det}, I_\\text{stoch}, I_\\text{strat}) \\gets (\\{(i,k) \\mid c^{*} \\hat{w}_{i,k} \\geq 1\\}, \\{(i,k) \\mid c^{*} \\cdot \\hat{w}_{i,k} < 1\\}, \\{\\})$\n \\State Set $\\alpha \\gets \\frac{\\sum_{i \\in I_\\text{stoch}} \\hat{w}_i}{|I_\\text{det}|}$ and generate $U \\sim \\text{Uniform}([0, \\alpha])$\n \\For{$i \\in I_\\text{stoch}$}\n \\State Set $U \\gets U - \\hat{w}_i$\n \\If{$U < 0$}\n \\State Set $I_\\text{strat} \\gets I_\\text{strat} \\cup \\{i\\}$\n \\State Set $U \\gets U + \\alpha$\n \\EndIf\n \\EndFor\n \\State Set particles $\\{(x_i, w_i)\\}_{i=1,\\ldots,|I_\\text{det}|} \\gets \\{(x_j, w_j \\cdot \\frac{N}{N'}) \\mid j \\in I_\\text{det}\\}$\n \\State Set particles $\\{(x_i, w_i)\\}_{i=|I_\\text{det}|+1,\\ldots,N} \\gets \\{(x_j, \\frac{N}{c^* N'} \\sum_{l=1}^{N} \\sum_{k=1}^{K_l} w_{(j,k)}) \\mid j \\in I_\\text{strat}\\}$\n\\EndWhile\n\\State \\Return $\\left((x_i, w_i)_{i=1,\\ldots,N}, \\hat{Z} = \\frac{1}{N} \\sum_{i=1}^N w_i \\right)$\n\\end{algorithmic}\n\\end{algorithm}"

Algorithm 5 Sequential Monte Carlo Transformer Steering

Input: $N$ (# particles), $K$ (factor), Feynman-Kac Transformer model $\{s_0, \{M_t\}_{t \geq 1}, \{G_t\}_{t \geq 1}\}$

Output: Weighted particle approximation $\{(x_i, w_i)\}_{i=1,\ldots,N}$ of the posterior $\mathbb{P}$

Output: Unbiased estimate $\hat{Z}$ of the partition function $Z = \mathbb{E}_\mathbb{M}[\prod_{t=1}^T G_t(s_t, s_{t-1}, f_\theta)]$

Initialize $f_\theta \gets \texttt{CachedTransformer}()$

Initialize $(x_i, w_i) \gets (s_0, 1)$ for $i = 1, \ldots, N$

Initialize $t \gets 1$

while $x_i \not\in \mathcal{F}$ for some $i \in \{1, \ldots, N\}$ do

$K_i \gets K (1 - \mathbb{1}_{\mathcal{F}}(x_i)) + \mathbb{1}_{\mathcal{F}}(x_i)$ for $i = 1, \ldots, N$

$N' \gets \sum_{i=1}^N K_i$

for $i \in \{1, \ldots, N\}$ do

if $x_i \in \mathcal{F}$ then

Set $(x_{i,1}, w_{i,1}) \gets (x_i, w_i \cdot \frac{N'}{N})$

else

Generate $x_{i,k} \sim M_t(\cdot \mid x_i, f_\theta)$ for $k = 1, \ldots, K$

Set $w_{i,k} \gets w_i \cdot G_t(x_i, x_{i,k}, f_\theta) \cdot \frac{N'}{K N}$ for $k = 1, \ldots, K$

end if

end for

Set normalized weights $\hat{w}_{i,k} \gets \frac{w_{(i,k)}}{\sum_{j=1}^N \sum_{l=1}^{K_j} w_{(j,l)}}$ for $i = 1, \ldots, N$ and $k = 1, \ldots, K_i$

Set $c^* \gets \inf\{c \in \mathbb{R}_{> 0} \mid \sum_{i=1}^N \sum_{k=1}^{K_i} (\mathbb{1} \wedge c \hat{w}_{(i,k)}) > N\}$

Set $(I_\text{det}, I_\text{stoch}, I_\text{strat}) \gets (\{(i,k) \mid c^{*} \hat{w}_{i,k} \geq 1\}, \{(i,k) \mid c^{*} \cdot \hat{w}_{i,k} < 1\}, \{\})$

Set $\alpha \gets \frac{\sum_{i \in I_\text{stoch}} \hat{w}_i}{|I_\text{det}|}$ and generate $U \sim \text{Uniform}([0, \alpha])$

for $i \in I_\text{stoch}$ do

Set $U \gets U - \hat{w}_i$

if $U < 0$ then

Set $I_\text{strat} \gets I_\text{strat} \cup \{i\}$

Set $U \gets U + \alpha$

end if

end for

Set particles $\{(x_i, w_i)\}_{i=1,\ldots,|I_\text{det}|} \gets \{(x_j, w_j \cdot \frac{N}{N'}) \mid j \in I_\text{det}\}$

Set particles $\{(x_i, w_i)\}_{i=|I_\text{det}|+1,\ldots,N} \gets \{(x_j, \frac{N}{c^* N'} \sum_{l=1}^{N} \sum_{k=1}^{K_l} w_{(j,k)}) \mid j \in I_\text{strat}\}$

end while

return $\left((x_i, w_i)_{i=1,\ldots,N}, \hat{Z} = \frac{1}{N} \sum_{i=1}^N w_i \right)$

features here refer to the hidden states of the decoder layers second-to-top-layer of the LLM, before the LM head. Not to be confused with features ↩
Vanilla autoregressive at token-level is described by $T_{1:j} \rightarrow E_{1:j} \rightarrow f_j \rightarrow p_{j+1} \rightarrow t_{j+1}$ :
- input $T_{1:j}$ is then transformed into embeddings $E_{1:j}$
- then into features $F_{1:j}$ ,
- LM Head maps $f_j$ to a distribution $p_{j+1} = \text{LM\_Head}(f_j)$
- sampling next token $t_{j+1}$
↩
See vllm-project/vllm#20078 ↩
Aligns with DistillSpec and Medusa ↩
This seems like an premature optimization. For use-cases where the batch sizes fluctuates, the calculation for an optimal speculative length would probably too overkill when the improvement could be minimal. ↩
Note that we refer to standard sampling to methods such as argmax, top-k, nucleus, temperatures, et al., albeit each have a different ways to process logits. We will consider these as standard sampling from an adjusted distribution ↩
This work from DeepMind was performed concurrently and independently from Leviathan et al. (2023). The work at DeepMind focuses more on distributed settings of speculative decoding ↩
also referred in practice as num_speculative_tokens ↩
or natural measure of the acceptance rate $\beta$ ↩
Rejection sampling follows a iterative sampling procedure that might looks superficially similar to speculative sampling:
1. Sample $x \sim q(x)$ and returns $r \sim U(0,1)$
2. If $r < \frac{p(x)}{M q(x)}$ return $x$
3. then go to 1
Where $M = \operatorname{max}_{x} \frac{p(x)}{q(x)}$

We could employ non-iterative version of rejection sampling instead of speculative sampling here (go through step 1 and 2, and otherwise sample an unmodified $p(x)$ directly)

Specifically, the expected accept probability:
$E_{x\sim q(x)} \frac{p(x)}{M q(x)} = \sum_{x} p(x) \min_{x^{'}}{\frac{q(x^{'})}{p(x^{'})}} \le \sum_{x} p(x) \min{(1, \frac{q(x)}{p(x)})} = \sum_{x} \min{(p(x), q(x))}$ ↩
A lenience parameter $l \in [0,1]$ to introduce further trade-off. This is useful when the distributions of draft models does not match the target model exactly.

Specifically we have:
$\begin{aligned} \alpha &= \mathbb{E}_{x\sim q(x)} \!\left[ \begin{cases} 1, & \text{if } l\,q(x) \le p(x),\\[6pt] \displaystyle\frac{p(x)}{l\,q(x)}, & \text{if } l\,q(x) > p(x) \end{cases} \right] \\[10pt] &= \mathbb{E}_{x\sim q(x)}\! \frac{p(x)}{\max\!\bigl(p(x),\,l\,q(x)\bigr)} \\[8pt] &= \sum_{x} \frac{p(x)\,q(x)}{\max\!\bigl(p(x),\,l\,q(x)\bigr)} \\[8pt] &= \frac{1}{l}\sum_{x} \min\!\bigl(p(x),\,l\,q(x)\bigr) \\[8pt] &= \sum_{x} \min\!\Bigl(\tfrac{p(x)}{l},\,q(x)\Bigr). \end{aligned}$

Important

this relies on q is sampled from this given distributions, and $l$ increases $\alpha$

In the case of greedy decoding (temperature=0), the draft essentially outputs $x^{'}_q = \argmax{q(x)}$ , so scaling $l q(x)$ becomes a no-op, given that the argmax will be unchanged in this case. ↩
Note that we can’t use temperature=0 (i.e argmax sampling):
- Instead we allow some lenience before standardizing the distribution (accept token $x$ sampled from $M_q$ in case of $p(x) \le l \dot \max{p}$ )
- In this case, then similar empirical increases to $\alpha$ to those of temperature=1
↩
On Correctness of Speculative Sampling (SpS)

We will show that $\forall p(x) \text{ and } q(x)$ , tokens sampled via speculative sampling from $p(x)$ and $q(x)$ are distributed identically to those sampled from $p(x)$ alone.

Let $\beta$ be the acceptance probability

Note that
$p'(x) = \operatorname{norm}\!\bigl(\max(0,\;p(x)-q(x))\bigr) = \frac{p(x)-\min\!\bigl(q(x),\,p(x)\bigr)} {\displaystyle \sum_{x'}\!\bigl(p(x')-\min\!\bigl(q(x'),\,p(x')\bigr)\bigr)} = \frac{p(x)-\min\!\bigl(q(x),\,p(x)\bigr)}{1-\beta},$
so the normalising constant for the adjusted distribution $p'(x)$ is $1-\beta$ ; the last equality follows immediately from Lemma 3.3 and Theorem 3.5.

Now
$P(x = x') \;=\; P(\text{guess accepted},\,x = x') \;+\; P(\text{guess rejected},\,x = x').$
Where
$P(\text{guess accepted},\,x = x') \;=\; q(x')\,\min\!\bigl(1,\tfrac{p(x')}{q(x')}\bigr) \;=\; \min\!\bigl(q(x'),\,p(x')\bigr),$
and
$P(\text{guess rejected},\,x = x') \;=\; (1-\beta)\,p'(x') \;=\; p(x') - \min\!\bigl(q(x'),\,p(x')\bigr).$
Overall
$\begin{aligned} P(x = x') &= \min\!\bigl(p(x'),\,q(x')\bigr) \;+\; p(x') - \min\!\bigl(p(x'),\,q(x')\bigr) \\ &= p(x'). \end{aligned}$
$\boxed{}$ ↩
$\begin{aligned} D_{LK}(p,q) &= \sum_{x} |p(x) - M(x)| = \sum_{x} \frac{|p-q|}{2} \\ &= 1- \sum_{x} \frac{p+q - |p-q|}{2} \\ &= 1 - \sum_{x} \min{p(x), q(x)} \end{aligned}$
$\boxed{}$ ↩
$\begin{aligned} \beta &= \mathbb{E}_{x \sim q(x)} \Biggl[ \begin{cases} 1 & \text{if } q(x) \le p(x), \\[6pt] \displaystyle\frac{p(x)}{q(x)} & \text{if } q(x) > p(x) \end{cases} \Biggr] \\[8pt] &= \sum_{x} \min\!\bigl(p(x),\,q(x)\bigr). \end{aligned} \qquad\square$ ↩
also known as wikipedia/en/Elapsed_real_time. This is different from CPU time, given that it measure the actual time taken from the start of the computer program, where as CPU time only measures time during which processor is actively working on a certain task or process ↩
Denote the cost of running single steps of $M_p$ by $T$ .

Each run will then costs $T c \gamma + T = T(c \gamma +1)$ (running $M_q$ $\gamma$ times and running $M_p$ once)

Given (1) procduces $\frac{1-\alpha^{\gamma +1}}{1-\alpha}$ tokens

The cost to produces a token with speculative sampling would be $\frac{(c \gamma +1)(1-\alpha )}{1-\alpha^{\gamma +1}} T$

$\boxed{}$ ↩
Denote by $\hat{T}$ the number of arithmetic operations done by standard decoding per tokens, therefore speculative sampling costs $\hat{T} \hat{c} \gamma + \hat{T}(\gamma +1)$ operations. Then divided by the expected tokens we got the desired results $\boxed{}$ ↩

Gloeckle et al. (2024) employs $n=4$ . The order of the forward and backward in a n-token prediction model with $n=4$ heads of the shared trunk works as follow:

z = model.shared(x)
d = z.detach()
d.requires_grad = False
 
for i in range(n):
  p = model.heads[i](d)
  loss(p, y[i]).backward()
z.backward()

↩

Transformers

Étiquette

publié à

modifié à

durée

source

internals

memory limitations.

inference.

Prefill/Decode

Speculative decoding

EAGLE

EAGLE-1

notation.

architecture

training

EAGLE-2

EAGLE-3

HASS

Falcon

MLP Speculator

DistillSpec

Medusa

ngrams

lookahead decoding

SPiRE

MagicDec

optimization

speculative length

distributed sps

speculative sampling

tl/dr

goal and algorithm

acceptance probability

calculating α\alphaα

walltime improvement

arithmetic operations

KV

next-token prediction.

multi-token prediction.

Byte-Latent Transformer

Feynman-Kac

Remarque

Bibliographie

Vous pourriez aimer ce qui suit

Liens retour

calculating $\alpha$