I'm sympathetic to this -- thanks Thomas.
Hi Steve -- thanks for this comment, I can see how the vibe of the talk/piece might call to mind something like "studying/intervening on an existing AI system" rather than focusing on how its trained/constructed, but I do mean for the techniques I discuss to cover both. For example, and re: your Bob example, I talk about our existing knowledge of human behavior as an example of behavioral science here -- and I talk lot about studying training as a part of behavioral science, e.g.:
Let’s call an AI’s full range of behavior across all safe and accessible-for-testing inputs its “accessible behavioral profile." Granted the ability to investigate behavioral profiles of this kind in-depth, it also becomes possible to investigate in-depth the effect that different sorts of interventions have on the profile in question. Example effects like this include: how the AI’s behavioral profile changes over the course of training; how the behavioral profile varies across different forms of training; how it responds to other kinds of interventions on the AI’s internals (though: this starts to border on “transparency tools”); how it varies based on the architecture of the AI; etc. Here I sometimes imagine a button that displays some summary of an AI’s accessible behavioral profile when pressed. In principle, you could be pressing that button constantly, whenever you do anything to an AI, and seeing what you can learn.
And techniques for training/constructing AIs that benefit from understanding/direct design of their internals would count as "transparency tools" for me.
I'm a bit confused about your overall picture here. Sounds like you're thinking something like:
"almost everything in the world is evaluable via waiting for it to fail and then noticing this. Alignment and bridge-building aren't like this, but most other things are... Also, the way we're going to automate long-horizon tasks is via giving AIs long-term goals. In particular: we'll give them goal 'get long-term human approval/reward', which will lead to good-looking stuff until the AIs take over in order to get more reward. This will work for tons of stuff but not for alignment, because you can't give negative reward for the alignment failure we ultimately care about, which is the AIs taking over."
Is that roughly right?
I think it's a fair point that if it turns out that current ML methods are broadly inadequate for automating basically any sophisticated cognitive work (including capabilities research, biology research, etc -- though I'm not clear on your take on whether capabilities research counts as "science" in the sense you have in mind), it may be that whatever new paradigm ends up successful messes with various implicit and explicit assumptions in analyses like the one in the essay.
That said, I think if we're ignorant about what paradigm will succeed re: automating sophisticated cognitive work and we don't have any story about why alignment research would be harder, it seems like the baseline expectation (modulo scheming) would be that automating alignment is comparably hard (in expectation) to automating these other domains. (I do think, though, that we have reason to expect alignment to be harder even conditional on needing other paradigms, because I think it's reasonable to expect some of the evaluation challenges I discuss in the post to generalize to other regimes.)
I'm happy to say that easy-to-verify vs. hard-to-verify is what ultimately matters, but I think it's important to be clear what about makes something easier vs. harder to verify, so that we can be clear about why alignment might or might not be harder than other domains. And imo empirical feedback loops and formal methods are amongst the most important factors there.
If we assume that the AI isn't scheming to actively withhold empirically/formally verifiable insights from us (I do think this would make life a lot harder), then it seems to me like this is reasonably similar to other domains in which we need to figure out how to elicit as-good-as-human-level suggestions from AIs that we can then evaluate well. E.g., it's not clear to me why this would be all that different from "suggest a new transformer-like architecture that we can then verify improves training efficiency a lot on some metric."
Or put another way: at least in the context of non-schemers, the thing I'm looking for isn't just "here's a way things could be hard." I'm specifically looking for ways things will be harder than in the context of capabilities (or, to a lesser extent, in other scientific domains where I expect a lot of economic incentives to figure out how to automate top-human-level work). And in that context, generic pessimism about e.g. heavy RL doesn't seem like it's enough.
Sure, maybe there's a band of capability where you can take over but you can't do top-human-level alignment research (and where your takeover plan doesn't involve further capabilities development that requires alignment). It's not the central case I'm focused on, though.
Also, if there’s an alignment tax (or control tax), then that impacts the comparison, since the AIs doing alignment research are paying that tax whereas the AIs attempting takeover are not.
Is the thought here that the AIs trying to takeover aren't improving their capabilities in a way that requires paying an alignment tax? E.g. if the tax refers to a comparison between (a) rushing forward on capabilities in a way that screws you on alignment vs. (b) pushing forward on capabilities in a way that preserves alignment, AIs that are fooming will want to do (b) as well (though they may have an easier time of it for other reasons). But if it refers to e.g. "humans will place handicaps on AIs that they need to ensure are aligned, including AIs they're trying to use for alignment research, whereas rogue AIs that have also freed themselves human control will also be able to get rid of these handicaps," then yes, that's an advantage the rogue AIs will have (though note that they'll still need to self-filtrate etc).
Re: examples of why superintelligences create distinctive challenges: superintelligences seem more likely to be schemers, more likely to be able to systematically and successfully mess with the evidence provided by behavioral tests and transparency tools, harder to exert option-control over, better able to identify and pursue strategies humans hadn't thought of, harder to supervise using human labor, etc.
If you're worried about shell games, it's OK to round off "alignment MVP" to hand-off-ready AI, and to assume that the AIs in question need to be able to pursue make and pursue coherent long-term plans.[1] I don't think the analysis in the essay alters that much (for example, I think very little rests on the idea that you can get by with myopic AIs), and better to err on the side of conservatism.
I wanted to set aside "hand-off" here because in principle, you don't actually need to hand-off until humans stop being able to meaningfully contribute to the safety/quality of the automated alignment work, which doesn't necessarily need to start around the time we have AIs capable of top-human-level alignment work (e.g., human evaluation of the research, or involvement in other aspects of control -- e.g., providing certain kinds of expensive, trusted supervision -- could persist after that). And when exactly you hand-off depends on a bunch of more detailed, practical trade-offs.
As I said in the post, one way that humans still being involved might not bottleneck the process is if they're only reviewing the work to figure out whether there's a problem they need to actively intervene on.
even if human labor is still playing a role in ensuring safety, it doesn’t necessarily need to directly bottleneck the research process – or at least, not if things are going well. For example: in principle, you could allow a fully-automated alignment research process to proceed forward, with humans evaluating the work as it gets produced, but only actively intervening if they identify problems.
And I think you can still likely radically speed up and scale up your alignment research even e.g. you still care about humans reviewing and understanding the work in question.
Though for what it's worth I don't think that the task of assessing "is this a good long-term plan for achieving X" needs itself to involve long-term optimization for X. For example: you could do that task over five minutes in exchange for a piece of candy.
Fair point, and plausible that I'm too much taking for granted a certain subset of development pathways. That is: I'm focused in the essay on threat models that proceed via the automation of capabilities R&D, but it's possible that this isn't necessary.
I appreciated the detailed discussion and literature review here -- thanks.