I give careers advice for 80,000 hours, but views I express here are my own.
James Hoffman's coffee videos have this kind of vibe. The "tasting every Nespresso pod" one is a clear example, but I also really appreciate e.g. the explanations of how to blind taste
Thanks, both for the thoughts and encouragement!
I'd love to see the most important types of work for each failure mode. Here's my very quick version, any disagreements or additions are welcome:
Appreciate you doing a quick version. I'm excited for more attempts at this and would like to write something similar myself, though I might structure it the other way round if I do a high effort version (take an agenda, work out how/if it maps onto the different parts of this). Will try to do a low-effort set of quick responses to yours soon.
P(Doom) for each scenario would also be useful.
Also in the (very long) pipeline, and a key motivation! Not just for each scenario in isolation, but also for various conditionals like:- P(scenario B leads to doom | scenario A turns out not to be an issue by default)- P(scenario B leads to doom | scenario A turns out to be an issue that we then fully solve)- P(meaningful AI-powered alignment progress is possible before doom | scenario C is solved)etc.
I think my suggest usage is slightly better but I'm not sure it's worth the effort of trying to make people change, though I find 'camouflage' as a term useful when I'm trying to explain to people.
Good question. I think there's a large overlap between them, including most of the important/scary cases that don't involve deceptive alignment (which are usually both). I think listing examples feels like the easiest way of explaining where they come apart:- There's some kinds of 'oversight failure' which aren't 'scalable oversight failure' e.g. the ball grabbing robot hand thing. I don't think the problem here was oversight simply failing to scale to superhuman. This does count as camouflage.- There's also some kinds of scalable oversight failure where the issue looks more like 'we didn't try at all' than 'we tried, but selecting based only on what we could see screwed us'. Someone just deciding to deploy a system and essentially just hoping that it's aligned would fall into this camp, but a more realistic case would be something like only evaluating a system based on its immediate effects, and then the long-run effects being terrible. You might not consider this a 'failure of scalable oversight', and instead want to call it a 'failure to even try scalable oversight', but I think the line is blurry - maybe people tried some scalable oversight stuff, it didn't really work, and then they gave up and said 'short term is probably fine'.- I think most failures of scalable oversight have some story which roughly goes "people tried to select for things that would be good, and instead got things that looked like they would be good to the overseer". These count as both.
Ok, I think this might actually be a much bigger deal than I thought. The basic issue is that weight decay should push things to be simpler if they can be made simpler without harming training loss.
This means that models which end up deceptively aligned should expect their goals to shift over time (to ones that can be more simply represented). Of course this also means that, even if we end up with a perfectly aligned model, if it isn't yet capable of gradient hacking, we shouldn't expect it to stay aligned, but instead we should expect weight decay to push it towards a simple proxy for the aligned goal, unless it is immediately able to realise this and help us freeze the relevant weights (which seems extremely hard).
(Intuitions here mostly coming from the "cleanup" stage that Neel found in his grokking paper)
[epistemic status: showerthought]
Getting language models to never simulate a character which does objectionable stuff is really hard. But getting language models to produce certain behaviours for a wide variety of prompts is much easier.
If we're worried about conditioning generative models getting more dangerous the more powerful LMs get, what if we fine tuned in an association between [power seeking/deception] and [wild overconfidence, including missing obvious flaws in plans, doing long "TV supervillain" style speeches before carrying out the final stage of a plan, etc.].
If the world model that gets learned is one where power seeking has an extremely strong association with poor cognition, maybe we either get bad attempts at treacherous turns before good ones, or models learning to be extremely suspicious of instrumental reasoning leading to power seeking given it's poor track record.
I still think there's something here and still think that it's interesting, but since writing it has occurred to me that something like root access to the datacenter, including e.g. 'write access to external memory of which there is no oversight', could bound the potential drift problem at lower capability levels than I was initially thinking for a 'pure' gradient-hack of the sort described here.
I think there's quite a big difference between 'bad looking stuff gets selected away' and 'design a poisoned token' and I was talking about the former in the top level comment, but as it happens I don't think you need to work that hard to find very easy ways to hide signals in LM outputs and recent empirical work like this seems to back that up.
The different kinds of deception thing did eventually get written up and posted!
If goal-drift prevention comes after perfect deception in the capabilities ladder, treacherous turns are a bad idea.Prompted by a thought from a colleague, here's a rough sketch of something that might turn out to be interesting once I flesh it out.- Once a model is deceptively aligned, it seems like SGD is most just going to improve search/planning ability rather than do anything with the mesa-objective.- But because 'do well according the overseers' is the correct training strategy irrespective of the mesa-objective, there's also no reason that SGD would preserve the mesa objective.- I think this means we should expect it to 'drift' over time.- Gradient hacking seems hard, plausibly harder than fooling human oversight.- If gradient hacking is hard, and I'm right about the drift thing, then I think there are setups where something that looks more like "trade with humans and assist with your own oversight" beats "deceptive alignment + eventual treacherous turn" as a strategy. - In particular, it feels like this points slightly in the direction of a "transparency is self-promoting/unusually stable" hypothesis, which is exciting.