A multi-layer perceptron (MLP) architecture built on top of a multi-head attention mechanism (Vaswani et al., 2023) to signal high entropy tokens to be amplified and less important tokens to be diminished.
ELI5: Mom often creates a food list consists of n of items to buy. Your job is to guess what the last item on this list would be.
Most implementations are autoregressive. Most major SOTA are decoder-only, as encoder-decoder models has lack behind due to their expensive encoding phase.
llama.cpp surprised many people (myself included) with how quickly you can run large LLMs on small computers, e.g. 7B runs @ ~16 tok/s on a MacBook. Wait don't you need supercomputers to work… pic.twitter.com/EIp9iPkZ6x
figure2: MTP implementation in DeepSeek, where they keep causal chain for prediction of each token at each depth
tl/dr: predict n-tokens at once, via shared trunk and n dedicated attention heads 1
Note that during inference, we only employ one attention head
Byte-Latent Transformer
idea: learn from raw-bytes and skip tokenizer/detokenizer protocol.
Feynman-Kac
Let V be the vocab of given transformers model, and S=V∗ the set of multi-token strings. Assume V contains token EOS and write F⊆S for the set of EOS-terminated strings.
Feynman-Kac Transformer model
is a tuple (s0,{Mt}t≥1,{Gt}t≥1) where:
s0∈S is an initial state, which will take as empty string ϵ
Mt(st∣st−1,fθ) is a Markov kernel from st−1∈Fc to st∈S, parameterised by a transformer network fθ:Fc→R∣V∣ mapping non-EOS-terminated strings to vectors of logits
Gt(st−1,st,fθ) is a potential function, mapping a pair (st−1,st)∈Fc×S to a real-valued non-negative score.
Goal: generate from distribution P that reweights Markov chain M by potential functions Gt. We define step-t filtering posteriors:
Algorithm 4 Sequential Monte Carlo Transformer Steering
Input:N (# particles), K (factor), Feynman-Kac Transformer model {s0,{Mt}t≥1,{Gt}t≥1}
Output: Weighted particle approximation {(xi,wi)}i=1,…,N of the posterior P
Output: Unbiased estimate Z^ of the partition function Z=EM[∏t=1TGt(st,st−1,fθ)]
Initialize fθ←CachedTransformer()
Initialize (xi,wi)←(s0,1) for i=1,…,N
Initialize t←1
while xi∈F for some i∈{1,…,N} do
Ki←K(1−1F(xi))+1F(xi) for i=1,…,N
N′←∑i=1NKi
for i∈{1,…,N} do
if xi∈F then
Set (xi,1,wi,1)←(xi,wi⋅NN′)
else
Generate xi,k∼Mt(⋅∣xi,fθ) for k=1,…,K
Set wi,k←wi⋅Gt(xi,xi,k,fθ)⋅KNN′ for k=1,…,K
end if
end for
Set normalized weights w^i,k←∑j=1N∑l=1Kjw(j,l)w(i,k) for i=1,…,N and k=1,…,Ki
Set c∗←inf{c∈R>0∣∑i=1N∑k=1Ki(1∧cw^(i,k))>N}
Set (Idet,Istoch,Istrat)←({(i,k)∣c∗w^i,k≥1},{(i,k)∣c∗⋅w^i,k<1},{})
Set α←∣Idet∣∑i∈Istochw^i and generate U∼Uniform([0,α])
for i∈Istoch do
Set U←U−w^i
if U<0 then
Set Istrat←Istrat∪{i}
Set U←U+α
end if
end for
Set particles {(xi,wi)}i=1,…,∣Idet∣←{(xj,wj⋅N′N)∣j∈Idet}
Set particles {(xi,wi)}i=∣Idet∣+1,…,N←{(xj,c∗N′N∑l=1N∑k=1Klw(j,k))∣j∈Istrat}
end while
return ((xi,wi)i=1,…,N,Z^=N1∑i=1Nwi)
Remarque
Gloeckle et al. (2024) employs n=4. The order of the forward and backward in a n-token prediction model with n=4 heads of the shared trunk works as follow:
z = model.shared(x)d = z.detach()d.requires_grad = Falsefor i in range(n): p = model.heads[i](d) loss(p, y[i]).backward()z.backward()
Gloeckle, F., Idrissi, B. Y., Rozière, B., Lopez-Paz, D., & Synnaeve, G. (2024). Better & Faster Large Language Models via Multi-token Prediction. arXiv preprint arXiv:2404.19737 [arXiv]
Lew, A. K., Zhi-Xuan, T., Grand, G., & Mansinghka, V. K. (2023). Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs. arXiv preprint arXiv:2306.03081 [arXiv]
Qin, R., Li, Z., He, W., Zhang, M., Wu, Y., Zheng, W., & Xu, X. (2024). Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. arXiv preprint arXiv:2407.00079 [arXiv]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2023). Attention Is All You Need. arXiv preprint arXiv:1706.03762 [arXiv]