This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Activation Engineering
•
Applied to
Avoiding jailbreaks by discouraging their representation in activation space
by
Guido Bergman
18d
ago
•
Applied to
Validating / finding alignment-relevant concepts using neural data
by
Bogdan Ionut Cirstea
24d
ago
•
Applied to
[Paper] Programming Refusal with Conditional Activation Steering
by
Bruce W. Lee
1mo
ago
•
Applied to
Activation Engineering Theories of Impact
by
kubanetics
3mo
ago
•
Applied to
I found >800 orthogonal "write code" steering vectors
by
Jacob G-W
3mo
ago
•
Applied to
An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
by
Jan Wehner
3mo
ago
•
Applied to
Control Vectors as Dispositional Traits
by
Jan Wehner
3mo
ago
•
Applied to
Representation Tuning
by
Jan Wehner
3mo
ago
•
Applied to
LLMs Universally Learn a Feature Representing Token Frequency / Rarity
by
Sean Osier
4mo
ago
•
Applied to
Jailbreak steering generalization
by
Nina Panickssery
4mo
ago
•
Applied to
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
by
Henry Cai
4mo
ago
•
Applied to
Introducing SARA: a new activation steering technique
by
Alejandro Tlaie
4mo
ago
•
Applied to
Mechanistically Eliciting Latent Behaviors in Language Models
by
TurnTrout
6mo
ago
•
Applied to
How well do truth probes generalise?
by
mishajw
8mo
ago
•
Applied to
Causal Graphs of GPT-2-Small's Residual Stream
by
David Udell
8mo
ago