LESSWRONG
LW

Wikitags

Activation Engineering

Edited by David Udell last updated 29th Aug 2023

Activation Engineering is the direct manipulation of activation vectors inside of a trained machine learning model. Potentially, it is a way to steer a model's behavior.

Activation engineering can be contrasted with other strategies for steering models: fine-tuning the models for desired behavior and crafting prompts that get a particular response.

Subscribe
2
Subscribe
2
Discussion0
Discussion0
Posts tagged Activation Engineering
439Steering GPT-2-XL by adding an activation vector
Ω
TurnTrout, Monte M, David Udell, lisathiergart, Ulisse Mini
2y
Ω
98
69Modulating sycophancy in an RLHF model via activation steering
Ω
Nina Panickssery
2y
Ω
20
214Mechanistically Eliciting Latent Behaviors in Language Models
Ω
Andrew Mack, TurnTrout
1y
Ω
43
122Reducing sycophancy and improving honesty via activation steering
Ω
Nina Panickssery
2y
Ω
18
101Maze-solving agents: Add a top-right vector, make the agent go to the top-right
Ω
TurnTrout, peligrietzer, lisathiergart
2y
Ω
17
29Extracting and Evaluating Causal Direction in LLMs' Activations
Ω
Fabien Roger, simeon_c
3y
Ω
5
37An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
Ω
j_we
1y
Ω
6
41Programming Refusal with Conditional Activation Steering
Bruce W. Lee
1y
0
105ActAdd: Steering Language Models without Optimization
Ω
technicalities, TurnTrout, lisathiergart, David Udell, Ulisse Mini, Monte M
2y
Ω
3
104I found >800 orthogonal "write code" steering vectors
Jacob G-W, TurnTrout
1y
19
35Representation Tuning
Ω
Christopher Ackerman
1y
Ω
9
25Evaluating hidden directions on the utility dataset: classification, steering and removal
Ω
Annah, shash42
2y
Ω
3
21Understanding Counterbalanced Subtractions for Better Activation Additions
ojorgensen
2y
0
13LLMs Universally Learn a Feature Representing Token Frequency / Rarity
Sean Osier
1y
5
334Understanding and controlling a maze-solving policy network
Ω
TurnTrout, peligrietzer, Ulisse Mini, Monte M, David Udell
2y
Ω
28
Load More (15/54)
Add Posts