• ↑↓ pour naviguer
  • pour ouvrir
  • pour sélectionner
  • ⌘ ⌥ ↵ pour ouvrir dans un panneau
  • esc pour rejeter
⌘ '
raccourcis clavier

by Apollo Research, introduction

Goal:

  • faithfulness: decomposition should identify a set of components that sum to parameters of the network
  • minimal: should use as few components as possible to replicate the network’s behaviour on training distribution
  • simple1: component shouldn’t be computational expensive

Bussmann et al. (2024) shows sparse dictionary learning does not surface canonical units of analysisLessWrong for interpretability and suffers from reconstruction errors, and leaves features geometry unexplained.

In a sense, it is unclear how we can explain sparsely activating directions in activation space. Additionally, we don’t have a full construction of cross-layers features to really understand what the network is doing 2

They refer to decomposing circuit as mechanism 3, or “finding vector within parameter space”:

Parameter components are trained for three things:

  • They sum to the original network’s parameters
  • As few as possible are needed to replicate the network’s behavior on any given datapoint in the training data
  • They are individually ‘simpler’ than the whole network.

Important

We can determine which parameters are being used during a forward pass with attribution (given that most of them are redundant!)

Decomposition of parameters, or APD
figure1: Decomposition of parameters, or APD

Remarque

  1. means they spans as few rank and as few layers as possible.

  2. sparse crosscoders can solve this, but this will eventually run into reconstruction errors due to the fact that we are restructuring features from a learned mapping, rather than interpreting within the activation space.

  3. ‘Circuit’ makes it sound a bit like the structures in question involve many moving parts, but in constructions such as those discussed in (Hänni et al., 2024) and mathematical framework for superposition, a part of the network algorithm can be as small as a single isolated logic gate or query-key lookup.