totally speculating, but intuitively I expect that:
- with SDF + finetuning, self-conceptualization as sandbagging is causally upstream of sandbagging
- with prompt-distillation, self-conceptualization as sandbagging is causally downstream of sandbagging
I guess the core thing is trying to get an MO that sandbags reliably despite never being directly fine-tuned to sandbag. This could be accomplished by:
a) synthetic document finetuning on a scenario which motives the model to sandbag (+ finetuning on other behavior implies by the scenario)
b) (logit-based) prompt distillation of sandbagging instructions on unrelated tasks (hoping that the prompt is sufficiently internalized such that the model generalizes to sandbagging)
My guess is that b) won't work and that a) is doable but hard to get working reliably (such that the blue team ~always wins). Even so, I think its worth-while to evaluate white-box methods in this setup, because I expect white-box methods to generalize better than black-box methods to models that are more competently sandbagging.
Really cool work.
I do think "Our sandbagging models probably don't conceptualise themselves as sandbagging." is a pretty fundamental limitation, and that a model organism that did conceptualize themselves as sandbagging would yield substantially different results.
very excited for SDF + finetuning / rl MOs in this direction.
Great post and great format (particularly liked "generalization upstream of reward signals" which I sort of had intuitions about (from reading your work) but hadn't seen presented so crisply)
I'd be excited to see more treatment of generalization upstream of reward signals (i.e. hypothesized mechanisms for the reward function learning algorithms, mapping to potential ML setups), though all of this has genuine potential capability externalities.
My overall (fairly uniformed low-confidence) takes:
Hmm I guess there's no guarantee that KL does better, and since we don't have great metrics for "internal faithfulness", maybe its just better to transparently optimize the flawed metric (task ce + sparsity).
Though as Robin notes on the AMI post, I do think the next step in this style of research is handling negative heads and self repair in a principled way.
yeah I think I agree with all that... (like psychopath can definitely learn language, accomplish things in the world, etc)
maybe the thought experiment with the 18yr old just prompted me to think about old arguments around "the consequentialist core" that aren't centrally about approval reward (and more about whether myopic rewards can elicit consequentialist-ish and aligned planning).
insofar as human capabilities are not largely explained by consequentialist planning (as suggested by the approval reward picture), this should make us more optimistic about human-level AGI alignment.
further, this picture might suggest that the cheapest way to human level-level AGI might route through approval reward-like mechanisms, giving us a large negative alignment tax.
ofc you might think getting approval reward to work is actually a very narrow target, and even if early human-level AGIs aren't coherent consequentialists, they will use some other mechanism for learning that doesn't route through approval reward and thus doesn't inherit the potentially nice alignment properties (or you could think that early human-level AGIs will be more like coherent consequentialists).
Noticed that you use task cross entropy loss instead of KL when learning task masks (Appendix 4.5, Loss Function) This is maybe a reasonable design choice, but important to note that this will ablate any "negative" nodes (and indirectly cause you to ignore positive nodes which overcome the negative nodes).
Overall, I suspect that this causes the subnetworks to miss important model computations (but obviously decreases the size of the subnetwork)
Aside: is there a reason there isn't a top-level link-post for this paper? (if not I'll create one)
Its worth disambiguating two critiques in Richards comment:
1) the AI safety community doesn't try to fundamentally understand intelligence
2) the AI safety community doesn't try to solve alignment for smarter than human AI systems
Tbc, they are somewhat related (i.e. people trying to fundamentally understand intelligence tend to think about alignment more) but clearly distinct. The "mainstream" AI safety crowd (myself included) is much more sympathetic to 2 than 1 (indeed Neel has said as much).
There's something to the idea that "marginal progress doesn't fee like marginal progress from the inside". Like, even if no one breakthrough or discovery "solves alignment", a general frame of "lets find principled approaches" is often more generative than "let's find the cheapest 80/20 approach" (both can be useful, and historically the safety community has probably leaned too far towards principled, but maybe the current generation is leaning too far the other way)
That most complexity theorists believe P != NC gives a kind of theoretical quasi-support to ideas like "alignment research is inherently sequential" and single-piece flows
(I think arguments like this connecting mathematical arguments to distantly related settings is mostly silly, but I still find them nice, idk...)