sparse autoencoder

Gated SAE

Rajamanoharan et al. (2024) applies JumpRELU and observe Pareto improvement over training.

Clear consequence of the bias during training is shrinkage (Sharkey, 2024) ¹

Idea is to use gated ReLU encoder (Dauphin et al., 2017; Shazeer, 2020):

\tilde{f}(\mathbf{x}) \coloneqq \underbrace{\mathbb{1}[\underbrace{(\mathbf{W}_{\text{gate}}(\mathbf{x} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{gate}}) > 0}_{\pi_{\text{gate}}(\mathbf{x})}]}_{f_{\text{gate}}(\mathbf{x})} \odot \underbrace{\text{ReLU}(\mathbf{W}_{\text{mag}}(\mathbf{x} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{mag}})}_{f_{\text{mag}}(\mathbf{x})}

where $\mathbb{1}[\bullet > 0]$ is the (point-wise) Heaviside step function and $\odot$ denotes element-wise multiplication.

term	annotations
$f_\text{gate}$	which features are deemed to be active
$f_\text{mag}$	feature activation magnitudes (for features that have been deemed to be active)
$\pi_\text{gate}(x)$	$f_\text{gate}$ sub-layer’s pre-activations

to negate the increases in parameters, use weight sharing:

Scale $W_\text{mag}$ in terms of $W_\text{gate}$ with a vector-valued rescaling parameter $r_\text{mag} \in \mathbb{R}^M$ :

(W_\text{mag})_{ij} \coloneqq (\exp (r_\text{mag}))_i \cdot (W_\text{gate})_{ij}

Figure 3: Gated SAE with weight sharing between gating and magnitude paths

Figure 4: A gated encoder become a single layer linear encoder with JumpReLU (Erichson et al., 2019) activation function $\sigma_\theta$

feature suppression

sparse dictionary learning

find sparsity representation through linear combination of basic elements

Assumption based on linear representation hypothesis.

This is useful to describe activations, but doesn’t encapsulate “concepts” within a network.

mechanistic-interpretability #geometry

Bibliographie

Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language Modeling with Gated Convolutional Networks. arXiv preprint arXiv:1612.08083 [arXiv]
Erichson, N. B., Yao, Z., & Mahoney, M. W. (2019). JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks. arXiv preprint arXiv:1904.03750 [arXiv]
Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., Kramár, J., Shah, R., & Nanda, N. (2024). Improving Dictionary Learning with Gated Sparse Autoencoders. arXiv preprint arXiv:2404.16014 [arXiv]
Sharkey, L. (2024). Addressing Feature Suppression in SAEs. AI Alignment Forum. [alignment forum]
Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202 [arXiv]

If we hold $\hat{x}(\bullet)$ fixed, thus L1 pushes $f(x) \to 0$ , while reconstruction loss pushes $f(x)$ high enough to produce accurate reconstruction.

An optimal value is somewhere between.

However, rescaling the shrink feature activations (Sharkey, 2024) is not necessarily enough to overcome bias induced by L1: a SAE might learnt sub-optimal encoder and decoder directions that is not improved by the fixed. ↩

Gated SAE

Rajamanoharan et al. (2024) applies JumpRELU and observe Pareto improvement over training.

Clear consequence of the bias during training is shrinkage (Sharkey, 2024) ¹

Idea is to use gated ReLU encoder (Dauphin et al., 2017; Shazeer, 2020):

\tilde{f}(\mathbf{x}) \coloneqq \underbrace{\mathbb{1}[\underbrace{(\mathbf{W}_{\text{gate}}(\mathbf{x} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{gate}}) > 0}_{\pi_{\text{gate}}(\mathbf{x})}]}_{f_{\text{gate}}(\mathbf{x})} \odot \underbrace{\text{ReLU}(\mathbf{W}_{\text{mag}}(\mathbf{x} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{mag}})}_{f_{\text{mag}}(\mathbf{x})}

where $\mathbb{1}[\bullet > 0]$ is the (point-wise) Heaviside step function and $\odot$ denotes element-wise multiplication.

term	annotations
$f_\text{gate}$	which features are deemed to be active
$f_\text{mag}$	feature activation magnitudes (for features that have been deemed to be active)
$\pi_\text{gate}(x)$	$f_\text{gate}$ sub-layer’s pre-activations

to negate the increases in parameters, use weight sharing:

Scale $W_\text{mag}$ in terms of $W_\text{gate}$ with a vector-valued rescaling parameter $r_\text{mag} \in \mathbb{R}^M$ :

(W_\text{mag})_{ij} \coloneqq (\exp (r_\text{mag}))_i \cdot (W_\text{gate})_{ij}

Figure 3: Gated SAE with weight sharing between gating and magnitude paths

Figure 4: A gated encoder become a single layer linear encoder with JumpReLU (Erichson et al., 2019) activation function $\sigma_\theta$

feature suppression

sparse dictionary learning

find sparsity representation through linear combination of basic elements

Assumption based on linear representation hypothesis.

This is useful to describe activations, but doesn’t encapsulate “concepts” within a network.

mechanistic-interpretability #geometry

If we hold $\hat{x}(\bullet)$ fixed, thus L1 pushes $f(x) \to 0$ , while reconstruction loss pushes $f(x)$ high enough to produce accurate reconstruction.

An optimal value is somewhere between.

However, rescaling the shrink feature activations (Sharkey, 2024) is not necessarily enough to overcome bias induced by L1: a SAE might learnt sub-optimal encoder and decoder directions that is not improved by the fixed. ↩

sparse autoencoder

Étiquette

publié à

modifié à

durée

source

Gated SAE

feature suppression

sparse dictionary learning

Bibliographie

Vous pourriez aimer ce qui suit

Liens retour

sparse autoencoder

Étiquette

publié à

modifié à

durée

source

Gated SAE

feature suppression

sparse dictionary learning

Remarque

Bibliographie

Vous pourriez aimer ce qui suit

Liens retour