Idea: “draft-and-verify” using smaller models to generate a head tokens (quick explanation from karpathy)
Intuitively:
- we generate a small set of lookahead tokens, albeit 2-5 tokens with smaller speculators
- uses the larger model to “verify” the input sequences + draft tokens (then replace tokens that aren’t valid from rejection sampler)
In a sense, we are verify these in parallel instead of autoregressive decoding.
A few techniques such as ngrams, EAGLE are supported in vLLM
MLP Speculator
via combined tokens/embedding speculators