Leveling Up: advice & resources for junior alignment researchers

Wiki Contributions


If memory serves me well, I was informed by Hendrycks' overview of catastrophic risks. I don't think it's a perfect categorization, but I think it does a good job laying out some risks that feel "less speculative" (e.g., malicious use, race dynamics as a risk factor that could cause all sorts of threats) while including those that have been painted as "more speculative" (e.g., rogue AIs).

I've updated toward the importance of explaining & emphasizing risks from sudden improvements in AI capabilities, AIs that can automate AI research, and intelligence explosions. I also think there's more appetite for that now than there used to be. 

What work do you think is most valuable on the margin (for those who agree with you on many of these points)?

I'd be curious to hear more about your main disagreements.

Good point. I think it's helpful when people working schemes with the rough flavor of "we do X, and then X helps us get to a useful AI that does not takeover" try to specify roughly how capable they expect the "useful AI does not takeover" to be.

Would be curious to hear more about the kinds of tasks that Ryan and Buck expect the first "transformatively useful but still controllable" AI will be able to do (perhaps expressed in multiplier terms, like 30X human SW engineer, perhaps expressed in terms of the kinds of novel reasoning it can perform, and perhaps expressed in terms of the kinds of research tasks or policy tasks it would be able to meaningfully help with).


I'm excited to see how the AI control research direction evolves. 

After skimming the post, it seems like it makes a few different claims, some of which seem more intuitive than others. For example:

  • Claim 1: Labs should control powerful AI systems (very intuitive, even obvious)
  • Claim 2: It is (much?) easier to evaluate control than alignment (somewhat intuitive, particularly if one buys the argument that alignment evals are very hard because you always have the "what if the AI is just scheming and subverting the eval" concern.) 
  • Claim 3: We (Buck and Ryan) have identified some specific techniques that we believe are promising for control (not intuitive; technical arguments needed).
  • Claim 4: If we succeed in controlling "transformatively useful AI", then we may be able to stop the race toward superintelligence and get AI labs or governments to agree to not build superintelligence until it can be controlled (not intuitive; non-technical arguments needed).

I'd be excited to see more posts that specifically engage with the strongest counterpoints to claims #2-4. 

Some more on #2 & #4:

The basic problem with evaluating alignment is that no matter what behaviors you observe, you have to worry that your model is just acting that way in order to make you think that it is aligned.

I think those pessimistic about control evals could say something like "the basic problem with evaluating control is that no matter what techniques your red-team uses, you have to worry that your model is better at finding attacks than your red-team." Of course, you note in the post some reason why we should expect our red-team to have advantages over models, but also you recognize that this won't scale toward arbitrarily powerful AIs.

In some ways, this feels analogous to the following situation:

  • Alice from the Scaling Department: Look, our current "alignment" techniques are working for GPT-4. So let's scale up more! I get that at some point we won't be able to align the AI, but we're not there yet, so we should be able to keep going.
  • Bob from the Preparedness Team: Well, the problem is we don't know when our alignment techniques will stop working... like, it's quite plausible that the next AI system will be able to game our evals.
  • Alice from the Scaling Department: OK, well let me know when you have clear evidence that our alignment techniques or dangerous capability evals will fail.

Here's the analogy for control:

  • Alice from the Scaling Department: Look, our current control techniques are working for GPT-4. So let's scale up more! I get that at some point our control evals aren't going to be able to control arbitrarily powerful AI, but we're not there yet, so we should be able to keep going.
  • Bob from the Preparedness Team: Well, the problem is we don't know when our control evals will stop working... like, it's quite plausible that the next AI system will be better at coming up with attacks than our red-team.
  • Alice from the Scaling Department: OK, well let me know when you have clear evidence that our control evals will fail.

I'd be curious to hear more about how you're thinking about this (and apologies if some sections of the post already deal with this– feel free to quote them if I missed them in my initial skim). Specific questions:

  1. How do you react to the analogy above? Do you expect that control evals will do a better job than other types of evals at determining when we have reached the point where scaling further is (directly/imminently) dangerous? Or more simply, will it be clear when we need to stop?
  2. More broadly, what do you think we should do once we're starting to see AI systems that are powerful enough to subvert the control evals (or develop attacks that are better than those of the human red teamers)?
  3. Suppose control evals do allow us to stop at "transformatively useful systems that do not cause a catastrophe." Roughly what kinds of capabilities do you expect this system to have? (My vague MIRI sim says something like "either the system is too weak to get us out of the critical risk window or the system is so strong that it's capable of subverting your control evals"). 

More broadly, TurnTrout, I've noticed you using this whole "look, if something positive happened, LW would totally rip on it! But if something is presented negatively, everyone loves it!" line of reasoning a few times (e.g., I think this logic came up in your comment about Evan's recent paper). And I sort of see you taking on some sort of "the people with high P(doom) just have bad epistemics" flag in some of your comments.

A few thoughts (written quickly, prioritizing speed over precision):

  1. I think that epistemics are hard & there are surely several cases in which people are biased toward high P(doom). Examples: Yudkowsky was one of the first thinkers/writers about AI, some people might have emotional dispositions that lead them toward anxious/negative interpretations in general, some people find it "cool" to think they're one of the few people who are able to accurately identify the world is ending, etc.
  2. I also think that there are plenty of factors biasing epistemics in the "hopeful" direction. Examples: The AI labs have tons of money and status (& employ large fractions of the community's talent), some people might have emotional dispositions that lead them toward overly optimistic/rosy interpretations in general, some people might find it psychologically difficult to accept premises that lead them to think the world is ending, etc.
  3. My impression (which could be false) is that you seem to be exclusively or disproportionately critical of poor arguments when they come from the "high P(doom)" side. 
  4. I also think there's an important distinction between "I personally think this argument is wrong" and "look, here's an example of propaganda + poor community epistemics." In general, I suspect community epistemics are better when people tend to respond directly to object-level points and have a relatively high bar for saying "not only do I think you're wrong, but also here are some ways in which you and your allies have poor epistemics." (IDK though, insofar as you actually believe that's what's happening, it seems good to say aloud, and I think there's a version of this that goes too far and polices speech reproductively, but I do think that statements like "community epistemics have been compromised by groupthink and fear" are pretty unproductive and could be met with statements like "community epistemics have been compromised by powerful billion-dollar companies that have clear financial incentives to make people overly optimistic about the trajectory of AI progress." 
  5. I am quite worried about tribal dynamics reducing the ability for people to engage in productive truth-seeking discussions. I think you've pointed out how some of the stylistic/tonal things from the "high P(doom)//alignment hard" side have historically made discourse harder, and I agree with several of your critiques. More recently, though, I think that the "low P(doom)//alignment not hard" side seem to be falling into similar traps (e.g., attacking strawmen of those they disagree with, engaging some sort of "ha, the other side is not only wrong but also just dumb/unreasonable/epistemically corrupted" vibe that predictably makes people defensive & makes discourse harder.

My impression is that the Shoggath meme was meant to be a simple meme that says "hey, you might think that RLHF 'actually' makes models do what we value, but that's not true. You're still left with an alien creature who you don't understand and could be quite scary."

Most of the Shoggath memes I've seen look more like this, where the disgusting/evil aspects are toned down. They depict an alien that kinda looks like an octopus. I do agree that the picture evokes some sort of "I should be scared/concerned" reaction. But I don't think it does so in a "see, AI will definitely be evil" way– it does so in a "look, RLHF just adds a smiley face to a foreign alien thing. And yeah, it's pretty reasonable to be scared about this foreign alien thing that we don't understand."

To be a bit bolder, I think Shoggath is reacting to the fact that RLHF gives off a misleading impression of how safe AI is. If I were to use proactive phrasing, I could say that RLHF serves as "propaganda". Let's put aside the fact that you and I might disagree about how much "true evidence" RLHF provides RE how easy alignment will be. It seems pretty clear to me that RLHF [and the subsequent deployment of RLHF'd models] spreads an overly-rosy "meme" that gives people a misleading perspective of how well we understand AI systems, how safe AI progress is, etc.

From this lens, I see Shoggath as a counter-meme. It basically says "hey look, the default is for people to think that these things are friendly assistants, because that's what the AI companies have turned them into, but we should remember that actually we are quite confused about the alien cognition behind the RLHF smiley face."

If the timer starts to run out, then slap something together based on the best understanding we have. 18-24 months is about how long I expect it to take to slap something together based on the best understanding we have.

Can you say more about what you expect to be doing after you have slapped together your favorite plans/recommendations? I'm interested in getting a more concrete understanding of how you see your research (eventually) getting implemented.

Suppose after the 18-24 month process, you have 1-5 concrete suggestions that you want AGI developers to implement. Is the idea essentially that you would go to the superalignment team (and the equivalents at other labs) and say "hi, here's my argument for why you should do X?" What kinds of implementation-related problems, if any, do you see coming up?

I ask this partially because I think some people are kinda like "well, in order to do alignment research that ends up being relevant, I need to work at one of the big scaling labs in order to understand the frames/ontologies of people at the labs, the constraints/restrictions that would come up if trying to implement certain ideas, get better models of the cultures of labs to see what ideas will simply be dismissed immediately, identify cruxes, figure out who actually makes decisions about what kinds of alignment ideas will end up being used for GPT-N, etc etc."

My guess is that you would generally encourage people to not do this, because they generally won't have as much research freedom & therefore won't be able to work on core parts of the problem that you see as neglected. I suspect many would agree that there is some "I lose freedom" cost, but that this might be outweighed by the "I get better models of what kind of research labs are actually likely to implement" benefit, and I'm curious how you view this trade-off (or if you don't even see this as a legitimate trade-off). 

Does the "rater problem" (raters have systematic errors) simply apply to step one in this plan? I agree that once you have a perfect reward model, you no longer need human raters.

But it seems like the "rater problem" still applies if we're going to train the reward model using human feedback. Perhaps I'm too anchored to thinking about things in an RLHF context, but it seems like at some point in the process we need to have some way of saying "this is true" or "this chain-of-thought is deceptive" that involves human raters. 

Is the idea something like:

  • Eliezer: Human raters make systematic errors
  • OpenAI: Yes, but this is only a problem if we have human raters indefinitely provide feedback. If human raters are expected to provide feedback on 10,000,000 responses under time-pressure, then surely they will make systematic errors.
  • OpenAI: But suppose we could train a reward model on a modest number of responses and we didn't have time-pressure. For this dataset of, say, 10,000 responses, we are super careful, we get a ton of people to double-check that everything is accurate, and we are nearly certain that every single label is correct. If we train a reward model on this dataset, and we can get it to generalize properly, then we can get past the "humans make systematic errors" problem.

Or am I totally off//the idea is different than this//the "yet-to-be-worked-out-techniques" would involve getting the reward model to learn stuff without ever needing feedback from human raters?

I'd also be curious to know why (some) people downvoted this.

Perhaps it's because you imply that some OpenAI folks were captured, and maybe some people think that that's unwarranted in this case?

Sadly, the more-likely explanation (IMO) is that policy discussions can easily become tribal, even on LessWrong.

I think LW still does better than most places at rewarding discourse that's thoughtful/thought-provoking and resisting tribal impulses, but I wouldn't be surprised if some people were doing something like "ah he is saying something Against AI Labs//Pro-regulation, and that is bad under my worldview, therefore downvote."

(And I also think this happens the other way around as well, and I'm sure people who write things that are "pro AI labs//anti-regulation" are sometimes unfairly downvoted by people in the opposite tribe.)

Load More