If memory serves me well, I was informed by Hendrycks' overview of catastrophic risks. I don't think it's a perfect categorization, but I think it does a good job laying out some risks that feel "less speculative" (e.g., malicious use, race dynamics as a risk factor that could cause all sorts of threats) while including those that have been painted as "more speculative" (e.g., rogue AIs).
I've updated toward the importance of explaining & emphasizing risks from sudden improvements in AI capabilities, AIs that can automate AI research, and intelligence explosions. I also think there's more appetite for that now than there used to be.
What work do you think is most valuable on the margin (for those who agree with you on many of these points)?
I'd be curious to hear more about your main disagreements.
Good point. I think it's helpful when people working schemes with the rough flavor of "we do X, and then X helps us get to a useful AI that does not takeover" try to specify roughly how capable they expect the "useful AI does not takeover" to be.
Would be curious to hear more about the kinds of tasks that Ryan and Buck expect the first "transformatively useful but still controllable" AI will be able to do (perhaps expressed in multiplier terms, like 30X human SW engineer, perhaps expressed in terms of the kinds of novel reasoning it can perform, and perhaps expressed in terms of the kinds of research tasks or policy tasks it would be able to meaningfully help with).
I'm excited to see how the AI control research direction evolves.
After skimming the post, it seems like it makes a few different claims, some of which seem more intuitive than others. For example:
I'd be excited to see more posts that specifically engage with the strongest counterpoints to claims #2-4.
Some more on #2 & #4:
The basic problem with evaluating alignment is that no matter what behaviors you observe, you have to worry that your model is just acting that way in order to make you think that it is aligned.
I think those pessimistic about control evals could say something like "the basic problem with evaluating control is that no matter what techniques your red-team uses, you have to worry that your model is better at finding attacks than your red-team." Of course, you note in the post some reason why we should expect our red-team to have advantages over models, but also you recognize that this won't scale toward arbitrarily powerful AIs.
In some ways, this feels analogous to the following situation:
Here's the analogy for control:
I'd be curious to hear more about how you're thinking about this (and apologies if some sections of the post already deal with this– feel free to quote them if I missed them in my initial skim). Specific questions:
More broadly, TurnTrout, I've noticed you using this whole "look, if something positive happened, LW would totally rip on it! But if something is presented negatively, everyone loves it!" line of reasoning a few times (e.g., I think this logic came up in your comment about Evan's recent paper). And I sort of see you taking on some sort of "the people with high P(doom) just have bad epistemics" flag in some of your comments.
A few thoughts (written quickly, prioritizing speed over precision):
My impression is that the Shoggath meme was meant to be a simple meme that says "hey, you might think that RLHF 'actually' makes models do what we value, but that's not true. You're still left with an alien creature who you don't understand and could be quite scary."
Most of the Shoggath memes I've seen look more like this, where the disgusting/evil aspects are toned down. They depict an alien that kinda looks like an octopus. I do agree that the picture evokes some sort of "I should be scared/concerned" reaction. But I don't think it does so in a "see, AI will definitely be evil" way– it does so in a "look, RLHF just adds a smiley face to a foreign alien thing. And yeah, it's pretty reasonable to be scared about this foreign alien thing that we don't understand."
To be a bit bolder, I think Shoggath is reacting to the fact that RLHF gives off a misleading impression of how safe AI is. If I were to use proactive phrasing, I could say that RLHF serves as "propaganda". Let's put aside the fact that you and I might disagree about how much "true evidence" RLHF provides RE how easy alignment will be. It seems pretty clear to me that RLHF [and the subsequent deployment of RLHF'd models] spreads an overly-rosy "meme" that gives people a misleading perspective of how well we understand AI systems, how safe AI progress is, etc.
From this lens, I see Shoggath as a counter-meme. It basically says "hey look, the default is for people to think that these things are friendly assistants, because that's what the AI companies have turned them into, but we should remember that actually we are quite confused about the alien cognition behind the RLHF smiley face."
If the timer starts to run out, then slap something together based on the best understanding we have. 18-24 months is about how long I expect it to take to slap something together based on the best understanding we have.
Can you say more about what you expect to be doing after you have slapped together your favorite plans/recommendations? I'm interested in getting a more concrete understanding of how you see your research (eventually) getting implemented.
Suppose after the 18-24 month process, you have 1-5 concrete suggestions that you want AGI developers to implement. Is the idea essentially that you would go to the superalignment team (and the equivalents at other labs) and say "hi, here's my argument for why you should do X?" What kinds of implementation-related problems, if any, do you see coming up?
I ask this partially because I think some people are kinda like "well, in order to do alignment research that ends up being relevant, I need to work at one of the big scaling labs in order to understand the frames/ontologies of people at the labs, the constraints/restrictions that would come up if trying to implement certain ideas, get better models of the cultures of labs to see what ideas will simply be dismissed immediately, identify cruxes, figure out who actually makes decisions about what kinds of alignment ideas will end up being used for GPT-N, etc etc."
My guess is that you would generally encourage people to not do this, because they generally won't have as much research freedom & therefore won't be able to work on core parts of the problem that you see as neglected. I suspect many would agree that there is some "I lose freedom" cost, but that this might be outweighed by the "I get better models of what kind of research labs are actually likely to implement" benefit, and I'm curious how you view this trade-off (or if you don't even see this as a legitimate trade-off).
Does the "rater problem" (raters have systematic errors) simply apply to step one in this plan? I agree that once you have a perfect reward model, you no longer need human raters.
But it seems like the "rater problem" still applies if we're going to train the reward model using human feedback. Perhaps I'm too anchored to thinking about things in an RLHF context, but it seems like at some point in the process we need to have some way of saying "this is true" or "this chain-of-thought is deceptive" that involves human raters.
Is the idea something like:
Or am I totally off//the idea is different than this//the "yet-to-be-worked-out-techniques" would involve getting the reward model to learn stuff without ever needing feedback from human raters?
I'd also be curious to know why (some) people downvoted this.
Perhaps it's because you imply that some OpenAI folks were captured, and maybe some people think that that's unwarranted in this case?
Sadly, the more-likely explanation (IMO) is that policy discussions can easily become tribal, even on LessWrong.
I think LW still does better than most places at rewarding discourse that's thoughtful/thought-provoking and resisting tribal impulses, but I wouldn't be surprised if some people were doing something like "ah he is saying something Against AI Labs//Pro-regulation, and that is bad under my worldview, therefore downvote."
(And I also think this happens the other way around as well, and I'm sure people who write things that are "pro AI labs//anti-regulation" are sometimes unfairly downvoted by people in the opposite tribe.)