LLMs

large language models, often implemented as autoregressive transformers models.

GPTs and friends

Most variants of LLMs are decoder-only (Radford et al., 2019)

Have “capabilities” to understand natural language.

Exhibits emergent behaviour of intelligence, but probably not AGI due to observer-expectancy effect.

One way or another is a form of behaviourism, through reinforcement learning. It is being “told” what is good or bad, and thus act accordingly towards the users. However, this induces confirmation bias where one aligns and contains his/her prejudices towards the problem.

Scalability

Incredibly hard to scale, mainly due to their large memory footprint and tokens memory allocation.

Optimization

I did a talk at HackTheNorth 2023 on this topic and rationale behind building OpenLLM

Quantization: reduce computational and memory costs of running inference with representing the weight and activations with low-precision data type
Continuous batching: Implementing Paged Attention with custom scheduler to manage swapping kv-cache for better resource utilisation
Different Attention variants, for better kernels and hardware optimisation (Think of Flash Attention 3, Radix Attention, TreeAttention, etc.)
Byte-Latent Transformer: idea to use entropy-based sampling to choose next tokens instead of token-level decoding. ¹

on how we are being taught.

How would we assess thinking?

Similar to calculator, it simplifies and increase accessibility to the masses, but in doing so lost the value in the action of doing math.

We do math to internalise the concept, and practice to thinking coherently. Similarly, we write to help crystalised our ideas, and in the process improve through the act of putting it down.

The process of rephrasing and arranging sentences poses a challenges for the writer, and in doing so, teach you how to think coherently. Writing essays is an exercise for students to articulate their thoughts, rather than testing the understanding of the materials.

on ethics

as philosophical tool.

To create a better representations of the world for both humans and machines to understand, we can truly have assistive tools to enhance our understanding of the world surround us

AI generated content

Don’t shit where you eat, Garbage in, garbage out. The quality of the content is highly dependent on the quality of the data it was trained on, or model are incredibly sensitive to data variances and biases.

Bland doublespeak

Here's a real problem though. Most people find writing hard and will get AIs to do it for them whenever they can get away with it. Which means bland doublespeak will become the default style of writing. Ugh.
— Paul Graham (@paulg) 25 février 2024

machine-assisted writings

source: creative fiction with GPT-3

Idea: use sparse autoencoders to guide ideas generations

Good-enough

"How did we get AI art before self-driving cars?" IMHO this is the single best heuristic for predicting the speed at which certain AI advances will happen. pic.twitter.com/yAo6pwEsxD
— Joshua Achiam (@jachiam0) 1 décembre 2022

This only occurs if you only need a “good-enough” item where value outweighs the process.

However, one should always consider to put in the work, rather than being “ok” with good enough. In the process of working through a problem, one will learn about bottleneck and problems to be solved, which in turn gain invaluable experience otherwise would not achieved if one fully relies on the interaction with the models alone.

as search

These models are incredibly useful for summarization and information gathering. With the taxonomy of RAG or any other CoT tooling, you can pretty much augment and produce and improve search-efficiency bu quite a lot.

notable mentions:

perplexity.ai: RAG-first search engine
explorer.globe.engineer: tree-based information retrieval
Exa labs

Programming

Overall should be a net positive, but it’s a double-edged sword.

as end-users

Source

I think it’s likely that soon all computer users will have the ability to develop small software tools from scratch, and to describe modifications they’d like made to software they’re already using

as developers

Tool that lower of barrier of entry is always a good thing, but it often will lead to probably even higher discrepancies in quality of software

Increased in productivity, but also increased in technical debt, as these generated code are mostly “bad” code, and often we have to nudge and do a lot of prompt engineering.

Truthfulness

Preference data to train against dense (model)

Judges → evaluator models

Dense models ← (reasoning)

UX → how to get in front of users?

Data provenance and governance ?

You don’t need frontier lab resources for frontier lab automated LLM evaluation.

To prove this, we’re open-sourcing j1-nano and j1-micro: two absurdly tiny (600M & 1.7B parameters) but mighty reward models competitive with orders-of-magnitude larger peers.

j1-nano and j1-micro… pic.twitter.com/HvMLJMKpvF
— Leonard Tang (@leonardtang_) 27 mai 2025

outperform opus and gpt-4o

youtube/v=veT2VI4vHyU, initial exploration, glossary

The subfield of alignment that delves into reverse engineering of a neural network, especially LLMs

To attack the curse of dimensionality, the question remains: how do we hope to understand a function over such a large space, without an exponential amount of time? ²

Topics:

open problems

Sharkey et al. (2025)

differentiate between “reverse engineering” versus “concept-based”
- reverse engineer:
  - decomposition → hypotheses → validation
    - Decomposition via dimensionality reduction
- drawbacks with SDL:
  - SDL reconstruction error are way too high (Rajamanoharan et al., 2024, p. see section 2.3)
  - SDL assumes linear representation hypothesis against non-linear feature space.
  - SDL leaves feature geometry unexplained

inference

Application in the wild: Goodfire and Transluce

How we would do inference with SAE?

Quick 🧵 and some of quick introspection into how they might run inference https://t.co/1N3JrxFHSp
— aaron (@aarnphm_) 25 septembre 2024

idea: treat SAEs as a logit bias, similar to guided decoding

steering

refers to the process of manually modifying certain activations and hidden state of the neural net to influence its outputs

For example, the following is a toy example of how a decoder-only transformers (i.e: GPT-2) generate text given the prompt “The weather in California is”

flowchart LR
  A[The weather in California is] --> B[H0] --> D[H1] --> E[H2] --> C[... hot]

To steer to model, we modify $H_2$ layers with certain features amplifier with scale 20 (called it $H_{3}$ )³

flowchart LR
  A[The weather in California is] --> B[H0] --> D[H1] --> E[H3] --> C[... cold]

One usually use techniques such as sparse autoencoders to decompose model activations into a set of interpretable features.

For feature ablation, we observe that manipulation of features activation can be strengthened or weakened to directly influence the model’s outputs

example: Panickssery et al. (2024) uses contrastive activation additions to steer Llama 2

contrastive activation additions

intuition: using a contrast pair for steering vector additions at certain activations layers

Uses mean difference which produce difference vector similar to PCA:

Given a dataset $\mathcal{D}$ of prompt $p$ with positive completion $c_p$ and negative completion $c_n$ , we calculate mean-difference $v_\text{MD}$ at layer $L$ as follow:

v_\text{MD} = \frac{1}{\mid \mathcal{D} \mid} \sum_{p,c_p,c_n \in \mathcal{D}} a_L(p,c_p) - a_L(p, c_n)

implication

by steering existing learned representations of behaviors, CAA results in better out-of-distribution generalization than basic supervised finetuning of the entire model.

superposition hypothesis

tl/dr

phenomena when a neural network represents more than $n$ features in a $n$ -dimensional space

Linear representation of neurons can represent more features than dimensions. As sparsity increases, model use superposition to represent more features than dimensions.

neural networks “want to represent more features than they have neurons”.

When features are sparsed, superposition allows compression beyond what linear model can do, at a cost of interference that requires non-linear filtering.

reasoning: “noisy simulation”, where small neural networks exploit feature sparsity and properties of high-dimensional spaces to approximately simulate much larger much sparser neural networks

In a sense, superposition is a form of lossy compression

importance

sparsity: how frequently is it in the input?
importance: how useful is it for lowering loss?

over-complete basis

reasoning for the set of $n$ directions ⁴

features

A property of an input to the model

When we talk about features (Elhage et al., 2022, p. see “Empirical Phenomena”), the theory building around several observed empirical phenomena:

Word Embeddings: have direction which corresponding to semantic properties (Mikolov et al., 2013). For example:
```
V(king) - V(man) = V(monarch)
```
Latent space: similar vector arithmetics and interpretable directions have also been found in generative adversarial network.

We can define features as properties of inputs which a sufficiently large neural network will reliably dedicate a neuron to represent (Elhage et al., 2022, p. see “Features as Direction”)

ablation

refers to the process of removing a subset of a model’s parameters to evaluate its predictions outcome.

idea: deletes one activation of the network to see how performance on a task changes.

zero ablation or pruning: Deletion by setting activations to zero
mean ablation: Deletion by setting activations to the mean of the dataset
random ablation or resampling

residual stream

figure1: Residual stream illustration

intuition: we can think of residual as highway networks, in a sense portrays linearity of the network ⁵

residual stream $x_{0}$ has dimension $\mathit{(C,E)}$ where

$\mathit{C}$ : the number of tokens in context windows and
$\mathit{E}$ : embedding dimension.

Attention mechanism $\mathit{H}$ process given residual stream $x_{0}$ as the result is added back to $x_{1}$ :

x_{1} = \mathit{H}{(x_{0})} + x_{0}

grokking

See also: writeup, code, circuit threads

A phenomena discovered by (Power et al., 2022) where small algorithmic tasks like modular addition will initially memorise training data, but after a long time ti will suddenly learn to generalise to unseen data

empirical claims

related to phase change

Bibliographie

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., & Olah, C. (2022). Toy Models of Superposition. Transformer Circuits Thread. [transformer circuit]
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385 [arXiv]
Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations. In L. Vanderwende, H. Daumé III, & K. Kirchhoff (Eds.), Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 746–751). Association for Computational Linguistics. https://aclanthology.org/N13-1090
Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., & Turner, A. M. (2024). Steering Llama 2 via Contrastive Activation Addition. arXiv preprint arXiv:2312.06681 [arXiv]
Power, A., Burda, Y., Edwards, H., Babuschkin, I., & Misra, V. (2022). Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. arXiv preprint arXiv:2201.02177 [arXiv]
Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., Kramár, J., Shah, R., & Nanda, N. (2024). Improving Dictionary Learning with Gated Sparse Autoencoders. arXiv preprint arXiv:2404.16014 [arXiv]
Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., Goldowsky-Dill, N., Heimersheim, S., Ortega, A., Bloom, J., Biderman, S., Garriga-Alonso, A., Conmy, A., Nanda, N., Rumbelow, J., Wattenberg, M., Schoots, N., Miller, J., Michaud, E. J., … McGrath, T. (2025). Open Problems in Mechanistic Interpretability. arXiv preprint arXiv:2501.16496 [arXiv]
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway Networks. arXiv preprint arXiv:1505.00387 [arXiv]
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners.

Think of decoding each text into dynamic patches, and thus actually improving inference efficiency. See also link ↩
good read from Lawrence C for ambitious mech interp. ↩
An example steering function can be:
$H_{3} = H_{2} + \text{steering\_strength} * \text{SAE}.W_{\text{dec}}[20] * \text{max\_activation}$ ↩
Even though features still correspond to directions, the set of interpretable direction is larger than the number of dimensions ↩
Constructing models with a residual stream traces back to early work by the Schmidhuber group, such as highway networks (Srivastava et al., 2015) and LSTMs, which have found significant modern success in the more recent residual network architecture (He et al., 2015).

In Transformers, the residual stream vectors are often called the “embedding.” We prefer the residual stream terminology, both because it emphasizes the residual nature (which we believe to be important) and also because we believe the residual stream often dedicates subspaces to tokens other than the present token, breaking the intuitions the embedding terminology suggests. ↩

large language models, often implemented as autoregressive transformers models.

GPTs and friends

Most variants of LLMs are decoder-only (Radford et al., 2019)

Have “capabilities” to understand natural language.

Exhibits emergent behaviour of intelligence, but probably not AGI due to observer-expectancy effect.

Scalability

Incredibly hard to scale, mainly due to their large memory footprint and tokens memory allocation.

Optimization

I did a talk at HackTheNorth 2023 on this topic and rationale behind building OpenLLM

Quantization: reduce computational and memory costs of running inference with representing the weight and activations with low-precision data type
Continuous batching: Implementing Paged Attention with custom scheduler to manage swapping kv-cache for better resource utilisation
Different Attention variants, for better kernels and hardware optimisation (Think of Flash Attention 3, Radix Attention, TreeAttention, etc.)
Byte-Latent Transformer: idea to use entropy-based sampling to choose next tokens instead of token-level decoding. ¹

on how we are being taught.

How would we assess thinking?

Similar to calculator, it simplifies and increase accessibility to the masses, but in doing so lost the value in the action of doing math.

We do math to internalise the concept, and practice to thinking coherently. Similarly, we write to help crystalised our ideas, and in the process improve through the act of putting it down.

on ethics

as philosophical tool.

To create a better representations of the world for both humans and machines to understand, we can truly have assistive tools to enhance our understanding of the world surround us

AI generated content

Bland doublespeak

Here's a real problem though. Most people find writing hard and will get AIs to do it for them whenever they can get away with it. Which means bland doublespeak will become the default style of writing. Ugh.
— Paul Graham (@paulg) 25 février 2024

machine-assisted writings

source: creative fiction with GPT-3

Idea: use sparse autoencoders to guide ideas generations

Good-enough

"How did we get AI art before self-driving cars?" IMHO this is the single best heuristic for predicting the speed at which certain AI advances will happen. pic.twitter.com/yAo6pwEsxD
— Joshua Achiam (@jachiam0) 1 décembre 2022

This only occurs if you only need a “good-enough” item where value outweighs the process.

as search

notable mentions:

perplexity.ai: RAG-first search engine
explorer.globe.engineer: tree-based information retrieval
Exa labs

Programming

Overall should be a net positive, but it’s a double-edged sword.

as end-users

Source

I think it’s likely that soon all computer users will have the ability to develop small software tools from scratch, and to describe modifications they’d like made to software they’re already using

as developers

Tool that lower of barrier of entry is always a good thing, but it often will lead to probably even higher discrepancies in quality of software

Increased in productivity, but also increased in technical debt, as these generated code are mostly “bad” code, and often we have to nudge and do a lot of prompt engineering.

Truthfulness

Preference data to train against dense (model)

Judges → evaluator models

Dense models ← (reasoning)

UX → how to get in front of users?

Data provenance and governance ?

You don’t need frontier lab resources for frontier lab automated LLM evaluation.

To prove this, we’re open-sourcing j1-nano and j1-micro: two absurdly tiny (600M & 1.7B parameters) but mighty reward models competitive with orders-of-magnitude larger peers.

j1-nano and j1-micro… pic.twitter.com/HvMLJMKpvF
— Leonard Tang (@leonardtang_) 27 mai 2025

outperform opus and gpt-4o

url: thoughts/mechanistic-interpretability
mechanistic interpretability
youtube/v=veT2VI4vHyU, initial exploration, glossary

The subfield of alignment that delves into reverse engineering of a neural network, especially LLMs

To attack the curse of dimensionality, the question remains: how do we hope to understand a function over such a large space, without an exponential amount of time? ²

Topics:

sparse autoencoder

sparse crosscoders

Attribution parameter decomposition

open problems

Sharkey et al. (2025)

differentiate between “reverse engineering” versus “concept-based”

reverse engineer:

decomposition → hypotheses → validation

Decomposition via dimensionality reduction

drawbacks with SDL:

SDL reconstruction error are way too high (Rajamanoharan et al., 2024, p. see section 2.3)

SDL assumes linear representation hypothesis against non-linear feature space.

SDL leaves feature geometry unexplained

inference

Application in the wild: Goodfire and Transluce

How we would do inference with SAE?

Quick 🧵 and some of quick introspection into how they might run inference https://t.co/1N3JrxFHSp
— aaron (@aarnphm_) 25 septembre 2024

idea: treat SAEs as a logit bias, similar to guided decoding

steering

refers to the process of manually modifying certain activations and hidden state of the neural net to influence its outputs

For example, the following is a toy example of how a decoder-only transformers (i.e: GPT-2) generate text given the prompt “The weather in California is”
flowchart LR
  A[The weather in California is] --> B[H0] --> D[H1] --> E[H2] --> C[... hot]
To steer to model, we modify $H_2$ layers with certain features amplifier with scale 20 (called it $H_{3}$ )³
flowchart LR
  A[The weather in California is] --> B[H0] --> D[H1] --> E[H3] --> C[... cold]
One usually use techniques such as sparse autoencoders to decompose model activations into a set of interpretable features.

For feature ablation, we observe that manipulation of features activation can be strengthened or weakened to directly influence the model’s outputs

example: Panickssery et al. (2024) uses contrastive activation additions to steer Llama 2

contrastive activation additions

intuition: using a contrast pair for steering vector additions at certain activations layers

Uses mean difference which produce difference vector similar to PCA:

Given a dataset $\mathcal{D}$ of prompt $p$ with positive completion $c_p$ and negative completion $c_n$ , we calculate mean-difference $v_\text{MD}$ at layer $L$ as follow:
$v_\text{MD} = \frac{1}{\mid \mathcal{D} \mid} \sum_{p,c_p,c_n \in \mathcal{D}} a_L(p,c_p) - a_L(p, c_n)$

implication

by steering existing learned representations of behaviors, CAA results in better out-of-distribution generalization than basic supervised finetuning of the entire model.

superposition hypothesis

tl/dr

phenomena when a neural network represents more than $n$ features in a $n$ -dimensional space

Linear representation of neurons can represent more features than dimensions. As sparsity increases, model use superposition to represent more features than dimensions.

neural networks “want to represent more features than they have neurons”.

When features are sparsed, superposition allows compression beyond what linear model can do, at a cost of interference that requires non-linear filtering.

reasoning: “noisy simulation”, where small neural networks exploit feature sparsity and properties of high-dimensional spaces to approximately simulate much larger much sparser neural networks

In a sense, superposition is a form of lossy compression

importance

sparsity: how frequently is it in the input?

importance: how useful is it for lowering loss?

over-complete basis

reasoning for the set of $n$ directions ⁴

features

A property of an input to the model

When we talk about features (Elhage et al., 2022, p. see “Empirical Phenomena”), the theory building around several observed empirical phenomena:
Word Embeddings: have direction which corresponding to semantic properties (Mikolov et al., 2013). For example:
V(king) - V(man) = V(monarch)
Latent space: similar vector arithmetics and interpretable directions have also been found in generative adversarial network.
We can define features as properties of inputs which a sufficiently large neural network will reliably dedicate a neuron to represent (Elhage et al., 2022, p. see “Features as Direction”)

ablation

refers to the process of removing a subset of a model’s parameters to evaluate its predictions outcome.

idea: deletes one activation of the network to see how performance on a task changes.

zero ablation or pruning: Deletion by setting activations to zero

mean ablation: Deletion by setting activations to the mean of the dataset

random ablation or resampling

residual stream

figure1: Residual stream illustration

intuition: we can think of residual as highway networks, in a sense portrays linearity of the network ⁵

residual stream $x_{0}$ has dimension $\mathit{(C,E)}$ where

$\mathit{C}$ : the number of tokens in context windows and

$\mathit{E}$ : embedding dimension.

Attention mechanism $\mathit{H}$ process given residual stream $x_{0}$ as the result is added back to $x_{1}$ :
$x_{1} = \mathit{H}{(x_{0})} + x_{0}$
grokking

See also: writeup, code, circuit threads

A phenomena discovered by (Power et al., 2022) where small algorithmic tasks like modular addition will initially memorise training data, but after a long time ti will suddenly learn to generalise to unseen data

empirical claims

related to phase change

Lien vers l'original

LLMs

Étiquette

publié à

modifié à

durée

source

Scalability

Optimization

on how we are being taught.

on ethics

as philosophical tool.

AI generated content

machine-assisted writings

Good-enough

as search

Programming

as end-users

as developers

Truthfulness

mechanistic interpretability

open problems

inference

steering

contrastive activation additions

superposition hypothesis

importance

over-complete basis

features

ablation

residual stream

grokking

Bibliographie

Vous pourriez aimer ce qui suit

Liens retour

LLMs

Étiquette

publié à

modifié à

durée

source

Scalability

Optimization

on how we are being taught.

on ethics

as philosophical tool.

AI generated content

machine-assisted writings

Good-enough

as search

Programming

as end-users

as developers

Truthfulness

mechanistic interpretability

open problems

inference

steering

contrastive activation additions

superposition hypothesis

importance

over-complete basis

features

ablation

residual stream

grokking

Bibliographie

Remarque

Vous pourriez aimer ce qui suit

Liens retour