The current default view seems to roughly be:
A common example used for motivating scalable oversight is the "AI CEO".
My views are:
If you don't like AI systems doing tasks that humans can't evaluate, I think you should be concerned about the fact that people keep building larger models and fine-tuning them in ways that elicit intelligent behavior.
Indeed, I think current scaling up of language models is likely net negative (given our current level of preparedness) and will become more clearly net negative over time as risks grow. I'm very excited about efforts to monitor and build consensus about these risks, or to convince or pressure AI labs to slow down development as further scaling becomes more risky.
But I think something has gone quite wrong if our collective strategy ends up being "We keep training smarter systems, but fortunately we are only able to fine-tune them to do tasks with external feedback signals from the world." You're going to get increasingly smart systems that can predict and manipulate observable features of the world; this is just asking for a catastrophic failure from reward hacking or deceptive alignment.
Even the world where you keep building larger and larger language models, but avoid ever training them to act in the world, seems like a recipe for trouble. It creates an unstable situation where anyone who fine-tunes a model can cause a catastrophe, or where a tiny behavioral quirk of a model could lead it to take over the world.
If we want to avoid AI systems taking over the world then I think we should strongly prefer to do it by stopping the creation of systems smart enough to do so. That seems like a much better place for norms.
I think this entire picture becomes even more compelling if you are mostly worried about deceptive alignment. I think training bigger models seems like the overwhelmingly dominant risk factor, and making models of a fixed size more useful seems like an extremely plausible intervention for reducing the risk of deceptive alignment.
Perhaps the big difference is that I think it matters a lot whether AI systems are smart enough to be an AI CEO; and not so much whether people are actually trying to employ their AI as a CEO. The risk comes from a deceptive aligned AI (or reward-hacking AI) having that level of competence. If you are in that unfortunate world, then you probably do unfortunately want a bunch of aligned AIs doing complex stuff to help you survive.
If there is an attempt to build consensus in the ML community around "Hey you can train language models just try not to have them do complicated stuff" then I wouldn't push back on it. But I think saying "Don't try to align AI systems that do complex tasks because it interferes with norms against using AIs to do complex tasks" is actively counterproductive. And I think this is a less reasonable ask than "just stop building such big language models," and so I'll argue against it being the main ask the safety community focuses on.
I understand your point of view and think it is reasonable.However, I don't think "don't build bigger models" and "don't train models to do complicated things" need to be at odds with each other. I see the argument you are making, but I think success on these asks are likely highly correlated via the underlying causal factor of humanity being concerned enough about AI x-risk and coordinated enough to ensure responsible AI development.I also think the training procedure matters a lot (and you seem to be suggesting otherwise?), since if you don't do RL or other training schemes that seem designed to induce agentyness and you don't do tasks that use an agentic supervision signal, then you probably don't get agents for a long time (if ever).
When ML models get more competent, ML capabilities researchers will have strong incentives to build superhuman models. Finding superhuman training techniques would be the main thing they'd work on. Consequently, when the problem is more tractable, I don't see why it'd be neglected by the capabilities community--it'd be unreasonable for profit maximizers not to have it as a top priority when it becomes tractable. I don't see why alignment researchers have to work in this area with high externalities now and ignore other safe alignment research areas (in practice, the alignment teams with compute are mostly just working on this area). I'd be in favor of figuring out how to get superhuman supervision for specific things related to normative factors/human values (e.g., superhuman wellbeing supervision), but researching superhuman supervision simpliciter will be the aim of the capabilities community.
Don't worry, the capabilities community will relentlessly maximize vanilla accuracy, and we don't need to help them.
I think I disagree with lots of things in this post, sometimes in ways that partly cancel each other out.
(A very quick response):
Agree with (1) and (2). I am ambivalent RE (3) and the replaceability arguments.RE (4): I largely agree, but I think the norm should be "let's try to do less ambitious stuff properly" rather than "let's try to do the most ambitious stuff we can, and then try and figure out how to do it as safely as possible as a secondary objective".