• ↑↓ pour naviguer
  • pour ouvrir
  • pour sélectionner
  • ⌘ ⌥ ↵ pour ouvrir dans un panneau
  • esc pour rejeter
⌘ '
raccourcis clavier

softmax

softmax(y)i=eyiieyi\text{softmax(y)}_i = \frac{e^{y_i}}{\sum_{i} e^{y_i}}

where yRky \in \mathbb{R}^k

exp()

Abdelkhalik et al. (2022), RDNA3 instruction sets of V_LDEXP_F32

Usually a lot better comparing to 2**t simply for numerical stability reasons

For ARM the design specially instructions set for it!

pseudocode-exp-fexpa.cpp
// Pseudocode representing the computation flow:
float32x4_t exp_sve2(float32x4_t x) {
    // Step 1: Range reduction
    // N = round(x * log2(e))
    // r = x - N * ln(2)    [reduced argument]
 
    // Step 2: FEXPA instruction provides 2^N approximation
    // In hardware: FEXPA Z0.S, Z1.S
    float32x4_t exp_approx; // Result of FEXPA
 
    // Step 3: Polynomial evaluation for exp(r)
    // Typically uses Horner's method with reduced precision
    // coefficients since we're starting with a good approximation
    float32x4_t exp_r = evaluate_polynomial(r);
 
    // Step 4: Combine results
    return exp_approx * exp_r;
}

Advantages of FEXPA:

  • single instruction latency for initial approximation
  • vectorized ops for batch processing

On GPU we can utilise bit-shift 1<<x or CUDA’s exp2

Optimization in llama.cpp: ggerganov/llama.cpp#7154

RoPE

sigmoid

sigmoid(x)=11+ex\text{sigmoid}(x) = \frac{1}{1+e^{-x}}

ReLU

FFN(x,W1,W2,b1,b2)=max(0,xW1+b1)W2+b2\text{FFN}(x, W_{1}, W_{2}, b_{1}, b_{2}) = max(0, xW_{1}+b_{1})W_{2} + b_{2}

A version in T5 without bias:

FFNReLU(x,W1,W2)=max(xW1,0)W2\text{FFN}_\text{ReLU}(x,W_{1},W_{2}) = max(xW_{1},0)W_{2}

Swish

Ramachandran et al. (2017) introduces an alternative to ReLU that works better on deeper models across different tasks.

f(x)=xsigmoid(βx)β: constant parameterf(x) = x \cdotp \text{sigmoid}(\beta x) \\ \because \beta : \text{ constant parameter}

Gated Linear Units and Variants

component-wise product of two linear transformations of the inputs, one of which is sigmoid-activated.

Shazeer (2020) introduces a few GELU activations to yield improvements in Transformers architecture.

GLU(x,W,V,b,c)=σ(xW+b)(xV+c)Bilinear(x,W,V,b,c)=(xW+b)(xV+c)\begin{aligned} \text{GLU}(x,W,V,b,c) &= \sigma(xW+b) \otimes (xV+c) \\ \text{Bilinear}(x,W,V,b,c) &= (xW+b) \otimes (xV+c) \end{aligned}

GLU in other variants:

ReGLU(x,W,V,b,c)=max(0,xW+b)(xV+c)GEGLU(x,W,V,b,c)=GELU(xW+b)(xV+c)SwiGLU(x,W,V,b,c)=Swishβ(xW+b)(xV+c)\begin{aligned} \text{ReGLU}(x,W,V,b,c) &= \max(0, xW+b) \otimes (xV+c) \\ \text{GEGLU}(x,W,V,b,c) &= \text{GELU}(xW+b) \otimes (xV+c) \\ \text{SwiGLU}(x,W,V,b,c) &= \text{Swish}_\beta(xW+b) \otimes (xV+c) \end{aligned}

FFN for transformers layers would become:

FFNGLU(x,W,V,W2)=(σ(xW)xV)W2FFNBilinear(x,W,V,W2)=(xWxV)W2FFNReGLU(x,W,V,W2)=(max(0,xW)xV)W2FFNGEGLU(x,W,V,W2)=(GELU(xW)xV)W2FFNSwiGLU(x,W,V,W2)=(Swishβ(xW)xV)W2\begin{aligned} \text{FFN}_\text{GLU}(x,W,V,W_{2}) &= (\sigma (xW) \otimes xV)W_{2} \\ \text{FFN}_\text{Bilinear}(x,W,V,W_{2}) &= (xW \otimes xV)W_{2} \\ \text{FFN}_\text{ReGLU}(x,W,V,W_{2}) &= (\max(0, xW) \otimes xV)W_{2} \\ \text{FFN}_\text{GEGLU}(x,W,V,W_{2}) &= (\text{GELU}(xW) \otimes xV)W_{2} \\ \text{FFN}_\text{SwiGLU}(x,W,V,W_{2}) &= (\text{Swish}_\beta(xW) \otimes xV)W_{2} \end{aligned}

note: reduce number of hidden units dffd_\text{ff} (second dimension of WW and VV and the first dimension of W2W_{2}) by a factor of 23\frac{2}{3} when comparing these layers

JumpReLU

Erichson et al. (2019) proposes JumpRELU to address robustness through adversarial examples.

Rajamanoharan et al. (2024) then apply this to improves construction fielity as Gated SAE

J(z)zH(zκ)={0if zκzif z>κJ(z) \coloneqq z H(z - \kappa) = \begin{cases} 0 & \text{if } z \leq \kappa \\ z & \text{if } z > \kappa \end{cases}

momentum

See also SGD, Cornell’s CS6787, gradient descent

In the case of quadratic function: f(x)=12x2f(x) = \frac{1}{2} x^2, then xt+1=xtαxt=(1α)xtx_{t+1} = x_t - \alpha x_t = (1-\alpha)x_t

Think of convergence rate

xt+10=1αxt0\mid x_{t+1} - 0 \mid = \mid 1 - \alpha \mid \mid x_t - 0 \mid

If we set different curvature (f(x)=2x2f(x) = 2x^2) thus xt+1=xt4αxt=(14α)xtx_{t+1} = x_t - 4 \alpha x_t = (1-4 \alpha)x_t

step size

step size depends on curvature for one-dimensional quadratics

more curvature means smaller ideal step size

how would this play for general quadratics?

for PSD symmetric AA

f(x)=12xTAxf(x) = \frac{1}{2} x^T Ax

with gradient descent has update step

xt+1=xtαAxt=(IαA)xtx_{t+1} = x_t - \alpha A x_t = (I - \alpha A)x_t

convergence rate would be

maxx(IαA)xx=maxx1x(Iαi=1nλiuiuiT)x=maxxi=1n(1αλi)uiuiTxi=1nuiuiTx=maxi1αλi=max(1αλmin,αλmax1)\begin{aligned} \max_{x} \frac{\|(I - \alpha A)x\|}{\|x\|} &= \max_{x} \frac{1}{\|x\|} \left\| \left( I - \alpha \sum_{i=1}^{n} \lambda_i u_i u_i^T \right) x \right\| \\[8pt] &= \max_{x} \frac{\|\sum_{i=1}^{n} (1- \alpha \lambda_i) u_i u_i^T x\|}{\|\sum_{i=1}^{n} u_i u_i^T x\|} \\ &= max_i \mid 1- \alpha \lambda_i \mid \\ &=max(1-\alpha \lambda_{\text{min}}, \alpha \lambda_{\text{max}} - 1) \end{aligned}

optimal convergence rate

optimal value occurs when

1αλmin=αλmax1α=2λmax+λmin1 - \alpha \lambda_{\text{min}} = \alpha \lambda_{\text{max}} - 1 \Rightarrow \alpha = \frac{2}{\lambda_{\text{max}} + \lambda_{\text{min}}}

with rate

λmaxλminλmax+λmin\frac{\lambda_{\text{max}} - \lambda_{\text{min}}}{\lambda_{\text{max}} + \lambda_{\text{min}}}

We denote κ=λmaxλmin\kappa = \frac{\lambda_{\text{max}}}{\lambda_{\text{min}}} as condition number of matrix A

poorly conditioned

Problems with larger condition numbers converge slower.

Intuitively these are problems that are highly curved in some directions, but flat others

Polyak

abbreviation: “heavy ball method”

idea: add an extra momentum term to gradient descent

xt+1=xtαf(xt)+β(xtxt1)x_{t+1} = x_t - \alpha \nabla f(x_t) + \beta (x_t - x_{t-1})

tl/dr: if current gradient step is in same direction as previous step, then move a little further in the same direction

Nesterov

See also paper, momentum

idea:

  • first take a step in the direction of accumulated momentum
  • computes gradient at “lookahead” position,
  • make the update using this gradient.

definition

For a parameter vector θ\theta, the update can be expressed as

vt=μvt1+L(θt+μvt1)θt+1=θtαvt\begin{aligned} v_t &= \mu v_{t-1} + \nabla L(\theta_t + \mu v_{t-1}) \\ \theta_{t+1} &= \theta_t - \alpha v_t \end{aligned}

Achieves better convergence rates

function typegradient descentNesterove AG
Smoothθ(1T)\theta(\frac{1}{T})θ(1T2)\theta(\frac{1}{T^{2}})
Smooth & Strongly Convexθ(exp(Tκ))\theta(\exp (-\frac{T}{\kappa}))θ(expTκ)\theta(\exp -\frac{T}{\sqrt{\kappa}})

optimal assignments for parameters

α=1λmax,β=κ1κ+1\alpha = \frac{1}{\lambda_{\text{max}}}, \beta = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}
Lien vers l'original