ml optimization

softmax

\text{softmax(y)}_i = \frac{e^{y_i}}{\sum_{i} e^{y_i}}

where $y \in \mathbb{R}^k$

`exp()`

Abdelkhalik et al. (2022), RDNA3 instruction sets of V_LDEXP_F32

Usually a lot better comparing to 2**t simply for numerical stability reasons

For ARM the design specially instructions set for it!

pseudocode-exp-fexpa.cpp

// Pseudocode representing the computation flow:
float32x4_t exp_sve2(float32x4_t x) {
    // Step 1: Range reduction
    // N = round(x * log2(e))
    // r = x - N * ln(2)    [reduced argument]
 
    // Step 2: FEXPA instruction provides 2^N approximation
    // In hardware: FEXPA Z0.S, Z1.S
    float32x4_t exp_approx; // Result of FEXPA
 
    // Step 3: Polynomial evaluation for exp(r)
    // Typically uses Horner's method with reduced precision
    // coefficients since we're starting with a good approximation
    float32x4_t exp_r = evaluate_polynomial(r);
 
    // Step 4: Combine results
    return exp_approx * exp_r;
}

Advantages of FEXPA:

single instruction latency for initial approximation
vectorized ops for batch processing

On GPU we can utilise bit-shift 1<<x or CUDA’s exp2

Optimization in llama.cpp: ggerganov/llama.cpp#7154

RoPE

(Su et al., 2023, p. pg.V)

sigmoid

\text{sigmoid}(x) = \frac{1}{1+e^{-x}}

ReLU

\text{FFN}(x, W_{1}, W_{2}, b_{1}, b_{2}) = max(0, xW_{1}+b_{1})W_{2} + b_{2}

A version in T5 without bias:

\text{FFN}_\text{ReLU}(x,W_{1},W_{2}) = max(xW_{1},0)W_{2}

Swish

Ramachandran et al. (2017) introduces an alternative to ReLU that works better on deeper models across different tasks.

f(x) = x \cdotp \text{sigmoid}(\beta x) \\ \because \beta : \text{ constant parameter}

Gated Linear Units and Variants

component-wise product of two linear transformations of the inputs, one of which is sigmoid-activated.

Shazeer (2020) introduces a few GELU activations to yield improvements in Transformers architecture.

\begin{aligned} \text{GLU}(x,W,V,b,c) &= \sigma(xW+b) \otimes (xV+c) \\ \text{Bilinear}(x,W,V,b,c) &= (xW+b) \otimes (xV+c) \end{aligned}

GLU in other variants:

\begin{aligned} \text{ReGLU}(x,W,V,b,c) &= \max(0, xW+b) \otimes (xV+c) \\ \text{GEGLU}(x,W,V,b,c) &= \text{GELU}(xW+b) \otimes (xV+c) \\ \text{SwiGLU}(x,W,V,b,c) &= \text{Swish}_\beta(xW+b) \otimes (xV+c) \end{aligned}

FFN for transformers layers would become:

\begin{aligned} \text{FFN}_\text{GLU}(x,W,V,W_{2}) &= (\sigma (xW) \otimes xV)W_{2} \\ \text{FFN}_\text{Bilinear}(x,W,V,W_{2}) &= (xW \otimes xV)W_{2} \\ \text{FFN}_\text{ReGLU}(x,W,V,W_{2}) &= (\max(0, xW) \otimes xV)W_{2} \\ \text{FFN}_\text{GEGLU}(x,W,V,W_{2}) &= (\text{GELU}(xW) \otimes xV)W_{2} \\ \text{FFN}_\text{SwiGLU}(x,W,V,W_{2}) &= (\text{Swish}_\beta(xW) \otimes xV)W_{2} \end{aligned}

note: reduce number of hidden units $d_\text{ff}$ (second dimension of $W$ and $V$ and the first dimension of $W_{2}$ ) by a factor of $\frac{2}{3}$ when comparing these layers

JumpReLU

Erichson et al. (2019) proposes JumpRELU to address robustness through adversarial examples.

Rajamanoharan et al. (2024) then apply this to improves construction fielity as Gated SAE

J(z) \coloneqq z H(z - \kappa) = \begin{cases} 0 & \text{if } z \leq \kappa \\ z & \text{if } z > \kappa \end{cases}

momentum

In the case of quadratic function: $f(x) = \frac{1}{2} x^2$ , then $x_{t+1} = x_t - \alpha x_t = (1-\alpha)x_t$

Think of convergence rate

\mid x_{t+1} - 0 \mid = \mid 1 - \alpha \mid \mid x_t - 0 \mid

If we set different curvature ( $f(x) = 2x^2$ ) thus $x_{t+1} = x_t - 4 \alpha x_t = (1-4 \alpha)x_t$

step size

step size depends on curvature for one-dimensional quadratics

more curvature means smaller ideal step size

how would this play for general quadratics?

for PSD symmetric $A$

f(x) = \frac{1}{2} x^T Ax

with gradient descent has update step

x_{t+1} = x_t - \alpha A x_t = (I - \alpha A)x_t

convergence rate would be

\begin{aligned} \max_{x} \frac{\|(I - \alpha A)x\|}{\|x\|} &= \max_{x} \frac{1}{\|x\|} \left\| \left( I - \alpha \sum_{i=1}^{n} \lambda_i u_i u_i^T \right) x \right\| \\[8pt] &= \max_{x} \frac{\|\sum_{i=1}^{n} (1- \alpha \lambda_i) u_i u_i^T x\|}{\|\sum_{i=1}^{n} u_i u_i^T x\|} \\ &= max_i \mid 1- \alpha \lambda_i \mid \\ &=max(1-\alpha \lambda_{\text{min}}, \alpha \lambda_{\text{max}} - 1) \end{aligned}

optimal convergence rate

optimal value occurs when
$1 - \alpha \lambda_{\text{min}} = \alpha \lambda_{\text{max}} - 1 \Rightarrow \alpha = \frac{2}{\lambda_{\text{max}} + \lambda_{\text{min}}}$
with rate
$\frac{\lambda_{\text{max}} - \lambda_{\text{min}}}{\lambda_{\text{max}} + \lambda_{\text{min}}}$

We denote $\kappa = \frac{\lambda_{\text{max}}}{\lambda_{\text{min}}}$ as condition number of matrix A

poorly conditioned

Problems with larger condition numbers converge slower.

Intuitively these are problems that are highly curved in some directions, but flat others

Polyak

abbreviation: “heavy ball method”

idea: add an extra momentum term to gradient descent

x_{t+1} = x_t - \alpha \nabla f(x_t) + \beta (x_t - x_{t-1})

tl/dr: if current gradient step is in same direction as previous step, then move a little further in the same direction

momentum for 1D quadratics

$f(x) = \frac{\lambda}{2} x^{2}$
momentum GD gives
$\begin{aligned} x_{t+1} &= x_t - \alpha \lambda x_t + \beta (x_t - x_{t-1}) \\ &= (1+\beta - \alpha \lambda) x_t - \beta x_{t-1} \end{aligned}$
characterizing momentum:

start with $x_{t+1} = (1+\beta -\alpha \lambda) x_t - \beta x_{t-1}$

trick: let $x_t = \beta^{t/2}z_t$

$z_{t+1} = \frac{1 + \beta - \alpha \lambda}{\sqrt{\beta}} z_t - z_{t-1}$
let $u = \frac{1+\beta -\alpha \lambda}{2 \sqrt{\beta}}$ , then
$z_{t+1} = 2 u z_t - z_{t-1}$
degree- $\textbf{t}$ polynomial in $\textbf{u}$

Nesterov

Bibliographie

Abdelkhalik, H., Arafa, Y., Santhi, N., & Badawy, A.-H. (2022). Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis. arXiv preprint arXiv:2208.11174 [arXiv]
Erichson, N. B., Yao, Z., & Mahoney, M. W. (2019). JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks. arXiv preprint arXiv:1904.03750 [arXiv]
Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kramár, J., & Nanda, N. (2024). Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders. arXiv preprint arXiv:2407.14435 [arXiv]
Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv preprint arXiv:1710.05941 [arXiv]
Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202 [arXiv]
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2023). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv preprint arXiv:2104.09864 [arXiv]

softmax

\text{softmax(y)}_i = \frac{e^{y_i}}{\sum_{i} e^{y_i}}

where $y \in \mathbb{R}^k$

`exp()`

Abdelkhalik et al. (2022), RDNA3 instruction sets of V_LDEXP_F32

Usually a lot better comparing to 2**t simply for numerical stability reasons

For ARM the design specially instructions set for it!

pseudocode-exp-fexpa.cpp

// Pseudocode representing the computation flow:
float32x4_t exp_sve2(float32x4_t x) {
    // Step 1: Range reduction
    // N = round(x * log2(e))
    // r = x - N * ln(2)    [reduced argument]
 
    // Step 2: FEXPA instruction provides 2^N approximation
    // In hardware: FEXPA Z0.S, Z1.S
    float32x4_t exp_approx; // Result of FEXPA
 
    // Step 3: Polynomial evaluation for exp(r)
    // Typically uses Horner's method with reduced precision
    // coefficients since we're starting with a good approximation
    float32x4_t exp_r = evaluate_polynomial(r);
 
    // Step 4: Combine results
    return exp_approx * exp_r;
}

Advantages of FEXPA:

single instruction latency for initial approximation
vectorized ops for batch processing

On GPU we can utilise bit-shift 1<<x or CUDA’s exp2

Optimization in llama.cpp: ggerganov/llama.cpp#7154

RoPE

(Su et al., 2023, p. pg.V)

sigmoid

\text{sigmoid}(x) = \frac{1}{1+e^{-x}}

ReLU

\text{FFN}(x, W_{1}, W_{2}, b_{1}, b_{2}) = max(0, xW_{1}+b_{1})W_{2} + b_{2}

A version in T5 without bias:

\text{FFN}_\text{ReLU}(x,W_{1},W_{2}) = max(xW_{1},0)W_{2}

Swish

Ramachandran et al. (2017) introduces an alternative to ReLU that works better on deeper models across different tasks.

f(x) = x \cdotp \text{sigmoid}(\beta x) \\ \because \beta : \text{ constant parameter}

Gated Linear Units and Variants

component-wise product of two linear transformations of the inputs, one of which is sigmoid-activated.

Shazeer (2020) introduces a few GELU activations to yield improvements in Transformers architecture.

\begin{aligned} \text{GLU}(x,W,V,b,c) &= \sigma(xW+b) \otimes (xV+c) \\ \text{Bilinear}(x,W,V,b,c) &= (xW+b) \otimes (xV+c) \end{aligned}

GLU in other variants:

\begin{aligned} \text{ReGLU}(x,W,V,b,c) &= \max(0, xW+b) \otimes (xV+c) \\ \text{GEGLU}(x,W,V,b,c) &= \text{GELU}(xW+b) \otimes (xV+c) \\ \text{SwiGLU}(x,W,V,b,c) &= \text{Swish}_\beta(xW+b) \otimes (xV+c) \end{aligned}

FFN for transformers layers would become:

\begin{aligned} \text{FFN}_\text{GLU}(x,W,V,W_{2}) &= (\sigma (xW) \otimes xV)W_{2} \\ \text{FFN}_\text{Bilinear}(x,W,V,W_{2}) &= (xW \otimes xV)W_{2} \\ \text{FFN}_\text{ReGLU}(x,W,V,W_{2}) &= (\max(0, xW) \otimes xV)W_{2} \\ \text{FFN}_\text{GEGLU}(x,W,V,W_{2}) &= (\text{GELU}(xW) \otimes xV)W_{2} \\ \text{FFN}_\text{SwiGLU}(x,W,V,W_{2}) &= (\text{Swish}_\beta(xW) \otimes xV)W_{2} \end{aligned}

note: reduce number of hidden units $d_\text{ff}$ (second dimension of $W$ and $V$ and the first dimension of $W_{2}$ ) by a factor of $\frac{2}{3}$ when comparing these layers

JumpReLU

Erichson et al. (2019) proposes JumpRELU to address robustness through adversarial examples.

Rajamanoharan et al. (2024) then apply this to improves construction fielity as Gated SAE

J(z) \coloneqq z H(z - \kappa) = \begin{cases} 0 & \text{if } z \leq \kappa \\ z & \text{if } z > \kappa \end{cases}

momentum

In the case of quadratic function: $f(x) = \frac{1}{2} x^2$ , then $x_{t+1} = x_t - \alpha x_t = (1-\alpha)x_t$

Think of convergence rate

\mid x_{t+1} - 0 \mid = \mid 1 - \alpha \mid \mid x_t - 0 \mid

If we set different curvature ( $f(x) = 2x^2$ ) thus $x_{t+1} = x_t - 4 \alpha x_t = (1-4 \alpha)x_t$

step size

step size depends on curvature for one-dimensional quadratics

more curvature means smaller ideal step size

how would this play for general quadratics?

for PSD symmetric $A$

f(x) = \frac{1}{2} x^T Ax

with gradient descent has update step

x_{t+1} = x_t - \alpha A x_t = (I - \alpha A)x_t

convergence rate would be

\begin{aligned} \max_{x} \frac{\|(I - \alpha A)x\|}{\|x\|} &= \max_{x} \frac{1}{\|x\|} \left\| \left( I - \alpha \sum_{i=1}^{n} \lambda_i u_i u_i^T \right) x \right\| \\[8pt] &= \max_{x} \frac{\|\sum_{i=1}^{n} (1- \alpha \lambda_i) u_i u_i^T x\|}{\|\sum_{i=1}^{n} u_i u_i^T x\|} \\ &= max_i \mid 1- \alpha \lambda_i \mid \\ &=max(1-\alpha \lambda_{\text{min}}, \alpha \lambda_{\text{max}} - 1) \end{aligned}

optimal convergence rate

optimal value occurs when
$1 - \alpha \lambda_{\text{min}} = \alpha \lambda_{\text{max}} - 1 \Rightarrow \alpha = \frac{2}{\lambda_{\text{max}} + \lambda_{\text{min}}}$
with rate
$\frac{\lambda_{\text{max}} - \lambda_{\text{min}}}{\lambda_{\text{max}} + \lambda_{\text{min}}}$

We denote $\kappa = \frac{\lambda_{\text{max}}}{\lambda_{\text{min}}}$ as condition number of matrix A

poorly conditioned

Problems with larger condition numbers converge slower.

Intuitively these are problems that are highly curved in some directions, but flat others

Polyak

abbreviation: “heavy ball method”

idea: add an extra momentum term to gradient descent

x_{t+1} = x_t - \alpha \nabla f(x_t) + \beta (x_t - x_{t-1})

tl/dr: if current gradient step is in same direction as previous step, then move a little further in the same direction

momentum for 1D quadratics

$f(x) = \frac{\lambda}{2} x^{2}$
momentum GD gives
$\begin{aligned} x_{t+1} &= x_t - \alpha \lambda x_t + \beta (x_t - x_{t-1}) \\ &= (1+\beta - \alpha \lambda) x_t - \beta x_{t-1} \end{aligned}$
characterizing momentum:

start with $x_{t+1} = (1+\beta -\alpha \lambda) x_t - \beta x_{t-1}$

trick: let $x_t = \beta^{t/2}z_t$

$z_{t+1} = \frac{1 + \beta - \alpha \lambda}{\sqrt{\beta}} z_t - z_{t-1}$
let $u = \frac{1+\beta -\alpha \lambda}{2 \sqrt{\beta}}$ , then
$z_{t+1} = 2 u z_t - z_{t-1}$
degree- $\textbf{t}$ polynomial in $\textbf{u}$

Nesterov

See also paper, momentum

idea:

first take a step in the direction of accumulated momentum

computes gradient at “lookahead” position,

make the update using this gradient.

definition

For a parameter vector $\theta$ , the update can be expressed as
$\begin{aligned} v_t &= \mu v_{t-1} + \nabla L(\theta_t + \mu v_{t-1}) \\ \theta_{t+1} &= \theta_t - \alpha v_t \end{aligned}$

Achieves better convergence rates

function type gradient descent Nesterove AG
Smooth $\theta(\frac{1}{T})$ $\theta(\frac{1}{T^{2}})$
Smooth & Strongly Convex $\theta(\exp (-\frac{T}{\kappa}))$ $\theta(\exp -\frac{T}{\sqrt{\kappa}})$

optimal assignments for parameters

$\alpha = \frac{1}{\lambda_{\text{max}}}, \beta = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}$

Lien vers l'original

ml optimization

Étiquette

publié à

modifié à

durée

source

softmax

`exp()`

RoPE

sigmoid

ReLU

Swish

Gated Linear Units and Variants

JumpReLU

momentum

Polyak

Nesterov

Bibliographie

Vous pourriez aimer ce qui suit

Liens retour

function type	gradient descent	Nesterove AG
Smooth	$\theta(\frac{1}{T})$	$\theta(\frac{1}{T^{2}})$
Smooth & Strongly Convex	$\theta(\exp (-\frac{T}{\kappa}))$	$\theta(\exp -\frac{T}{\sqrt{\kappa}})$