• ↑↓ pour naviguer
  • pour ouvrir
  • pour sélectionner
  • ⌘ ⌥ ↵ pour ouvrir dans un panneau
  • esc pour rejeter
⌘ '
raccourcis clavier

Idea: “draft-and-verify” using smaller models to generate a head tokens (quick explanation from karpathy)

Intuitively:

  • we generate a small set of lookahead tokens, albeit 2-5 tokens with smaller speculators
  • uses the larger model to “verify” the input sequences + draft tokens (then replace tokens that aren’t valid from rejection sampler)

In a sense, we are verify these in parallel instead of autoregressive decoding.

A few techniques such as ngrams, EAGLE are supported in vLLM

MLP Speculator

via combined tokens/embedding speculators

abs: https://arxiv.org/abs/2404.19124v1arXiv

SPiRE

MagicDec

ngram

EAGLE