LESSWRONG
LW

rusheb
399Ω1010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
rusheb2yΩ140

If your hypothesis predicts that model performance will be preserved if you swap the input to any other input which has a particular property, but no other inputs in the dataset have that property, causal scrubbing can’t test your hypothesis

 

Would it be possible to make interventions which we expect not to preserve the model's behaviour, and assert that the behaviour does in fact change? 

Reply
115Ablations for “Frontier Models are Capable of In-context Scheming”
7mo
1
210Frontier Models are Capable of In-context Scheming
Ω
7mo
Ω
24
93Apollo Research 1-year update
Ω
1y
Ω
0
55A starter guide for evals
Ω
2y
Ω
2
38Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
Ω
2y
Ω
2
45Understanding mesa-optimization using toy models
Ω
2y
Ω
6