What you described is an example of training based on cognitive oversight in my view. When I said 'there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others' I was thinking of things like this. There's a spectrum from 'just make the rate at which the evilness detector fires a term in the loss function' to 'hill climb on your evilness detector (but don't use it during training)' to 'when the evilness detector fires, try to understand mechanistically what went wrong and then modify training to fix the problem' all the way to 'when the evilness detector fires, demand a global moratorium, burn all the gpus, and eventually try a new paradigm years later (repeat until you find a paradigm where that paradigm's version of the evilness detector doesn't fire)'. in each of these, there is some pressure for cognitive obfuscation, but each approach has less pressure than the earlier ones imo (and it's ultimately a race between optimistation pressure for alignment and optimisation pressure for obfuscation). I don't see what is the point of building a cognitive oversight technique if you aren't going to take actions on the basis of it!
My actual view on what the right amount of optimisation against cognitive oversight is, is a complicated mess with lots of parts: something like have a bunch of techniques, train on ~none of them (unless you have loads and they're not super correlated), hill climb on some of them, and reserve the highest effort, highest quality ones for sounding the alarm if we aren't succeeding at aligning the models.
Really great post, strong upvoted. I was a fan in particular of the selection of research agendas you included at the end.
but there’s no similar fundamental reason that cognitive oversight (e.g. white-box-based oversight like with interpretability) has to get harder with scale
I'm curious why you think this? It seems like there's some preliminary evidence that models are becoming increasingly capable of manipulating their thoughts. But more importantly, I'm wondering what role you think this cognitive oversight will play in making safe powerful AI? If this oversight will be used as a feedback signal, then I would expect that avoiding obfuscating the cognition does become harder at scale. I do think we should train based on cognitive oversight at least some amount, and there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others (eg using white box techniques only in held out alignment evals rather than as part of the loss function), but it still seems like cognitive oversight should predictably get somewhat harder with scale, even if it scales more favourably than behavioural oversight?
Copypasting from a slack thread:
I'll list some work that I think is aspiring to build towards an answer to some of these questions, although lots of it is very toy:
Some random thoughts on what goals powerful AIs will have more generally:
What are solenoidal flux corrections in this context
Thanks for this post!
Caveat: I haven't read this very closely yet, and I'm not an economist. I'm finding it hard to understand why you think it's reasonable to model an increase in capabilities by an increase in number of parallel copies. That is: in the returns to R&D section, you look at data on how increasing numbers of human-level researchers in AI affect algorithmic progress, but we have ~no data on what happens when you sample researchers from a very different (and superhuman) capability profile. It seems to me entirely plausible that a few months into the intelligence explosion, the best AI researchers are qualitatively superintelligent enough that their research advances per month aren't the sort of thing that could be done by ~any number of humans[1] acting in parallel in a month. I acknowledge that this is probably not tractable to model, but that seems like a problem because it seems to me that this qualitative superintelligence is a (maybe the) key driving force of the intelligence explosion.
Some intuition pumps for why this seems reasonably likely:
or at least only by an extremely large number of humans, who are doing something more like brute force search and less like thinking
This is basically the same idea as Dwarkesh's point that a human-level LLM should be able to make all sorts of new discoveries by connecting dots that humans can't connect because we can't read and take in the whole internet.
I think the biggest-alignment relevant update is that I expected RL fine-tuning over longer horizons (or even model-based RL a la AlphaZero) to be a bigger deal. I was really worried about it significantly improving performance and making alignment harder. In 2018-2019 my mainline picture was more like AlphaStar or AlphaZero, with RL fine-tuning being the large majority of compute. I've updated about this and definitely acknowledge I was wrong.[3] I don't think it totally changes the picture though: I'm still scared of RL, I think it is very plausible it will become more important in the future, and think that even the kind of relatively minimal RL we do now can introduce many of the same risks.
Curious to hear how you would revisit this prediction in light of reasoning models? Seems like you weren't as wrong as you thought a year ago, but maybe you still think there are some key ways your predictions about RL finetuning predictions were off?
This was really useful to read thanks very much for writing these posts!
Very happy you did this!
Do you have any idea about whether the difference between unlearning success on synthetic facts fine-tuned in after pretraining vs real facts introduced during pretraining comes mainly from the 'synthetic' part or the 'fine-tuning' part? I.e. if you took the synthetic facts dataset and spread it out through the pretraining corpus, do you expect it would be any harder to unlearn the synthetic facts? or maybe this question doesn't make sense because you'd have to make the dataset much larger or something to get it to learn the facts at all during pretraining? If so, it seems like a pretty interesting research question to try to understand which properties a dataset of synthetic facts needs to have to defeat unlearning.
Upvoted. The way I think about the case for ambitious interp is:
I do think that pragmatic interp is great, but I don't want the field to move away from ambitious interp entirely. My guess is that people in favour of moving away from ambitious interp mostly contest (3.a) (and secondarily maybe I think (2) is a bigger deal than they do, although maybe not). I don't think I would disagree much about how object level promising the existing agendas are: I just think that 'promising' should mean 'could lead somewhere given 10 years of effort, plus maybe lots of AI labour' which might correspond to research that feels like it is very limited right now.