See also: Overton Window and this blog on alignment research
The act of aligning oneself with a particular group or ideology. This can be done for a variety of reasons, including:
- To gain social acceptance
- To gain power
- To gain resources
Often known as a solution to solve “hallucination” in large models token-generation.
To align a model is simply teaching it to generate tokens that is within the bound of the Overton Window.
The goal is to build a aligned system that help us solve other alignment problems
Should we build a ethical aligned systems, or morally aligned systems?
One of mechanistic interpretability’s goal is to ablate harmful features
design
See also Information Theory
RSP
published by Anthropic
The idea is to create a standard for risk mitigation strategy when AI system advances. Essentially create a scale to judge “how capable a system can cause harm”