Previously alexrjl.
I give careers advice for 80,000 hours, but views I express here are my own.
I think my suggest usage is slightly better but I'm not sure it's worth the effort of trying to make people change, though I find 'camouflage' as a term useful when I'm trying to explain to people.
Good question. I think there's a large overlap between them, including most of the important/scary cases that don't involve deceptive alignment (which are usually both). I think listing examples feels like the easiest way of explaining where they come apart:
- There's some kinds of 'oversight failure' which aren't 'scalable oversight failure' e.g. the ball grabbing robot hand thing. I don't think the problem here was oversight simply failing to scale to superhuman. This does count as camouflage.
- There's also some kinds of scalable oversight failure where the issue looks more like 'we didn't try at all' than 'we tried, but selecting based only on what we could see screwed us'. Someone just deciding to deploy a system and essentially just hoping that it's aligned would fall into this camp, but a more realistic case would be something like only evaluating a system based on its immediate effects, and then the long-run effects being terrible. You might not consider this a 'failure of scalable oversight', and instead want to call it a 'failure to even try scalable oversight', but I think the line is blurry - maybe people tried some scalable oversight stuff, it didn't really work, and then they gave up and said 'short term is probably fine'.
- I think most failures of scalable oversight have some story which roughly goes "people tried to select for things that would be good, and instead got things that looked like they would be good to the overseer". These count as both.
Ok, I think this might actually be a much bigger deal than I thought. The basic issue is that weight decay should push things to be simpler if they can be made simpler without harming training loss.
This means that models which end up deceptively aligned should expect their goals to shift over time (to ones that can be more simply represented). Of course this also means that, even if we end up with a perfectly aligned model, if it isn't yet capable of gradient hacking, we shouldn't expect it to stay aligned, but instead we should expect weight decay to push it towards a simple proxy for the aligned goal, unless it is immediately able to realise this and help us freeze the relevant weights (which seems extremely hard).
(Intuitions here mostly coming from the "cleanup" stage that Neel found in his grokking paper)
[epistemic status: showerthought] Getting language models to never simulate a character which does objectionable stuff is really hard. But getting language models to produce certain behaviours for a wide variety of prompts is much easier.
If we're worried about conditioning generative models getting more dangerous the more powerful LMs get, what if we fine tuned in an association between [power seeking/deception] and [wild overconfidence, including missing obvious flaws in plans, doing long "TV supervillain" style speeches before carrying out the final stage of a plan, etc.].
If the world model that gets learned is one where power seeking has an extremely strong association with poor cognition, maybe we either get bad attempts at treacherous turns before good ones, or models learning to be extremely suspicious of instrumental reasoning leading to power seeking given it's poor track record.
I still think there's something here and still think that it's interesting, but since writing it has occurred to me that something like root access to the datacenter, including e.g. 'write access to external memory of which there is no oversight', could bound the potential drift problem at lower capability levels than I was initially thinking for a 'pure' gradient-hack of the sort described here.
I think there's quite a big difference between 'bad looking stuff gets selected away' and 'design a poisoned token' and I was talking about the former in the top level comment, but as it happens I don't think you need to work that hard to find very easy ways to hide signals in LM outputs and recent empirical work like this seems to back that up.
If goal-drift prevention comes after perfect deception in the capabilities ladder, treacherous turns are a bad idea.
Prompted by a thought from a colleague, here's a rough sketch of something that might turn out to be interesting once I flesh it out.
- Once a model is deceptively aligned, it seems like SGD is most just going to improve search/planning ability rather than do anything with the mesa-objective.
- But because 'do well according the overseers' is the correct training strategy irrespective of the mesa-objective, there's also no reason that SGD would preserve the mesa objective.
- I think this means we should expect it to 'drift' over time.
- Gradient hacking seems hard, plausibly harder than fooling human oversight.
- If gradient hacking is hard, and I'm right about the drift thing, then I think there are setups where something that looks more like "trade with humans and assist with your own oversight" beats "deceptive alignment + eventual treacherous turn" as a strategy.
- In particular, it feels like this points slightly in the direction of a "transparency is self-promoting/unusually stable" hypothesis, which is exciting.
Could you explain your model here of how outreach to typical employees becomes net negative?
The path of: [low level OpenAI employees think better about x-risk -> improved general OpenAI reasoning around x-risk -> improved decisions] seems high EV to me.
I think the obvious way this becomes net negative is if the first (unstated) step in the causal chain is actually false:
[People who don't have any good ideas for making progress on alignment try to 'buy time' by pitching people who work at big ML labs on AI x-risk -> low level OpenAI employees think better about x-risk]
A concern of mine, especially when ideas about this kind of untargeted outreach are framed as "this is the thing to do if you can't make technical progress", is that [low level OpenAI employees think better about x-risk] will often instead be something like [low level employees' suspicion that the "AI doomer crowd" doesn't really know what it's talking about is reinforced], or [low level employee now thinks worse about x-risk].
Thanks, both for the thoughts and encouragement!
Appreciate you doing a quick version. I'm excited for more attempts at this and would like to write something similar myself, though I might structure it the other way round if I do a high effort version (take an agenda, work out how/if it maps onto the different parts of this). Will try to do a low-effort set of quick responses to yours soon.
Also in the (very long) pipeline, and a key motivation! Not just for each scenario in isolation, but also for various conditionals like:
- P(scenario B leads to doom | scenario A turns out not to be an issue by default)
- P(scenario B leads to doom | scenario A turns out to be an issue that we then fully solve)
- P(meaningful AI-powered alignment progress is possible before doom | scenario C is solved)
etc.