All of Akash's Comments + Replies

What work do you think is most valuable on the margin (for those who agree with you on many of these points)?

2Daniel Kokotajlo3h
Depends on comparative advantage I guess. 

I'd be curious to hear more about your main disagreements.

Good point. I think it's helpful when people working schemes with the rough flavor of "we do X, and then X helps us get to a useful AI that does not takeover" try to specify roughly how capable they expect the "useful AI does not takeover" to be.

Would be curious to hear more about the kinds of tasks that Ryan and Buck expect the first "transformatively useful but still controllable" AI will be able to do (perhaps expressed in multiplier terms, like 30X human SW engineer, perhaps expressed in terms of the kinds of novel reasoning it can perform, and perhaps expressed in terms of the kinds of research tasks or policy tasks it would be able to meaningfully help with).


I'm excited to see how the AI control research direction evolves. 

After skimming the post, it seems like it makes a few different claims, some of which seem more intuitive than others. For example:

  • Claim 1: Labs should control powerful AI systems (very intuitive, even obvious)
  • Claim 2: It is (much?) easier to evaluate control than alignment (somewhat intuitive, particularly if one buys the argument that alignment evals are very hard because you always have the "what if the AI is just scheming and subverting the eval" concern.) 
  • Claim 3: We (Buck and
... (read more)

Suppose control evals do allow us to stop at "transformatively useful systems that do not cause a catastrophe." Roughly what kinds of capabilities do you expect this system to have? (My vague MIRI sim says something like "either the system is too weak to get us out of the critical risk window or the system is so strong that it's capable of subverting your control evals").

We're thinking about systems which are capable of autonomously doing nearly all key research tasks. In particular, speeding up R&D by 30x probaly requires heavily autonomous AIs run... (read more)

We think it's likely that prior to finding that evals are unworkable, we find that control techniques would fail to ensure reasonable levels of safety and usefulness. That is, in order to ensure safety for typical use cases, we have to impose heavy restrictions that heavily restrict usefulness. See this comment for more discussion of what could fail. Regardless of what fails first, we can discuss what we should do. First, you'd notify the key stake holders that control doesn't suffice for the next AIs. (And ideally produce legible demonstrations of this etc.) If we've already reached a sufficient level of AI output that we're operating at pretty crazy and near maximal speeds (e.g. the Amdahl's law style argument here), then we're in pretty good shape from the perspective of a control plan (from a technical perspective). We'd hope that in this position we can ensure some reasonable delay prior to being forced to build and use AIs in ways that break control. We might be forced to build more powerful AIs in this situation because of other actors building insanely powerful AIs that we might need to engage in active conflict against (and we can't just stop these AIs from being trained which might be way less risky) and it could also be that internal forces want to irrationally rush to making more powerful AIs. In the case where we're not operating at sufficiently high output, the situation is considerably worse. If the AI lab is operating in a sensible way, then the hope would be that you do a careful tradeoff between likely exogenous risk and the risk that it seems like you would be incuring from various usages of the AI[1]. So if exogenous risk isn't there, we'd just stop and work on developing better approaches. If we need to proceed without further progress, than we'd just try to make the best safety usefulness tradeoffs that we can afford. ---------------------------------------- 1. Given reasonable guesses at P(scheming) for different AIs and more detailed
A key hope here is that we can use various test to get a general sense for whether AIs can importantly outcompete our red team. See this other comment for more discussion. So, here's how I would hope the actual situation with Alice and Bob goes: Alice from the Scaling Department: Look, our current control techniques are working for GPT-5. So let's scale up more! I get that at some point our control techniques aren't going to be able to control arbitrarily powerful AI, but we're not there yet, so we should be able to keep going. Bob from the Control Red Team: Indeed, our current control techniques are very likely to be working. Our protocol suffices for controlling current models on the currently approved applications assuming we remain vigilant and maintain the necessary invariants (our design basis). Part of why we think control evaluations currently work for assessing safety is that we we performed a variety of capability evaluations which indicate that as long as we prevent steganography in some key places, AIs are pretty unlikely to think of better attack plans than we thought of as the red team. This involves looking at how good AIs are at generating strategies in a bunch of different contexts and then doing some analysis which allows for us to transfer this to an overall conservative guess about the AI's abilities to create good attack plans against our actual protocol (but we can't necessarily produce the AI's best guesses at attack plans, we're utilizing that capability evaluation is easier than elicitation as discussed in footnote 8). We also have rough and conservative tests for the extent to which AIs can sandbag on our current evaluations and that seems to be fine for now. From our current rough extrapolations, the first place where we'll start running into issues with control as we build more powerful models is that we'll need to start imposing heavy restrictions on AI affordances in domains XYZ which will really cut into the usefulness (relative t
I'll respond to this more later, but first I'd like to clarify a narrow point about when building superintelligence is acceptable. I think that purely black-box AI control is very unlikely to scale to superintelligence, but there are other plausible routes to safety (alignment, various approaches which make usage of the internals of models but which aren't well described as alignment). So, the condition is more like "when we can be confident in the safety of superintelligence" and there are a variety of routes here. Separately, if we don't have an absolutely air tight argument for safety, we would ideally delay building superintelligence if it wouldn't be needed to solve pressing problems. (And indeed, we don't think that superintelligence to solve a wide range of problems.)

More broadly, TurnTrout, I've noticed you using this whole "look, if something positive happened, LW would totally rip on it! But if something is presented negatively, everyone loves it!" line of reasoning a few times (e.g., I think this logic came up in your comment about Evan's recent paper). And I sort of see you taking on some sort of "the people with high P(doom) just have bad epistemics" flag in some of your comments.

A few thoughts (written quickly, prioritizing speed over precision):

  1. I think that epistemics are hard & there are surely several cas
... (read more)

See also "Other people are wrong" vs "I am right", reversed stupidity is not intelligence, and the cowpox of doubt.

My guess is that it's relatively epistemically corrupting and problematic to spend a lot of time engaging with weak arguments.

I think it's tempting to make the mistake of thinking that debunking a specific (bad) argument is the same as debunking a conclusion. But actually, these are extremely different operations. One requires understanding a specific argument while the other requires level headed investigation of the overall situation. Separa... (read more)

Thanks for this, I really appreciate this comment (though my perspective is different on many points). It's true that I spend more effort critiquing bad doom arguments. I would like to note that when e.g. I read Quintin I generally am either in agreement or neutral. I bet there are a lot of cases where you would think "that's a poor argument" and I'd say "hm I don't think Akash is getting the point (and it'd be good if someone could give a better explanation)."  However, it's definitely not true that I never critique optimistic arguments which I consider poor. For example, I don't get why Quintin (apparently) thinks that spectral bias is a reason for optimism, and I've said as much on one of his posts. I've said something like "I don't know why you seem to think you can use this mathematical inductive bias to make high-level intuitive claims about what gets learned. This seems to fall into the same trap that 'simplicity' theorizing does." I probably criticize or express skepticism of certain optimistic arguments at least twice a week, though not always on public channels. And I've also pushed back on people being unfair, mean, or mocking of "doomers" on private channels. I think both statements are true to varying degrees (the former more than the latter in the cases I'm considering). They're true and people should say them. The fact that I work at a lab absolutely affects my epistemics (though I think the effect is currently small). People should totally consider the effect which labs are having on discourse.  I do consider myself to have a high bar for this, and the bar keeps getting passed, so I say something. EDIT: Though I don't mean for my comments to imply "someone and their allies" have bad epistemics. Ideally I'd like to communicate "hey, something weird is in the air guys, can't you sense it too?". However, I think I'm often more annoyed than that, and so I don't communicate that how I'd like.

My impression is that the Shoggath meme was meant to be a simple meme that says "hey, you might think that RLHF 'actually' makes models do what we value, but that's not true. You're still left with an alien creature who you don't understand and could be quite scary."

Most of the Shoggath memes I've seen look more like this, where the disgusting/evil aspects are toned down. They depict an alien that kinda looks like an octopus. I do agree that the picture evokes some sort of "I should be scared/concerned" reaction. But I don't think it does so in a "see, AI ... (read more)

If the timer starts to run out, then slap something together based on the best understanding we have. 18-24 months is about how long I expect it to take to slap something together based on the best understanding we have.

Can you say more about what you expect to be doing after you have slapped together your favorite plans/recommendations? I'm interested in getting a more concrete understanding of how you see your research (eventually) getting implemented.

Suppose after the 18-24 month process, you have 1-5 concrete suggestions that you want AGI developers to... (read more)

(This is not my main response, go read my other comment first.)

I ask this partially because I think some people are kinda like "well, in order to do alignment research that ends up being relevant, I need to work at one of the big scaling labs in order to understand the frames/ontologies of people at the labs, the constraints/restrictions that would come up if trying to implement certain ideas, get better models of the cultures of labs to see what ideas will simply be dismissed immediately, identify cruxes, figure out who actually makes decisions about what

... (read more)

Good question. As an overly-specific but illustrative example, let's say that the things David and I have been working on the past month (broadly in the vein of retargeting) go unexpectedly well. What then?

The output is definitely not "1-5 concrete suggestions that you want AGI developers to implement". The output is "David and John release a product to control an image generator/language model via manipulating its internals directly, this works way better than prompting, consumers love it and the major labs desperately play catch-up".

More general backgrou... (read more)

Does the "rater problem" (raters have systematic errors) simply apply to step one in this plan? I agree that once you have a perfect reward model, you no longer need human raters.

But it seems like the "rater problem" still applies if we're going to train the reward model using human feedback. Perhaps I'm too anchored to thinking about things in an RLHF context, but it seems like at some point in the process we need to have some way of saying "this is true" or "this chain-of-thought is deceptive" that involves human raters. 

Is the idea something like:

  • E
... (read more)
4Daniel Kokotajlo2mo
Leo already answered, but yes, the rater problem applies to step one of that plan. And the hope is that progress can be made on solving the rater problem in this setting. Because e.g. our model isn't situationally aware.

Short answer: The core focus of the "yet to be worked out techniques" is to figure out the "how do we get it to generalize properly" part, not the "how do we be super careful with the labels" part.

Longer answer: We can consider weak to strong generalization as actually two different subproblems:

  • generalizing from correct labels on some easy subset of the distribution (the 10,000 super careful definitely 100% correct labels)
  • generalizing from labels which can be wrong and are more correct on easy problems than hard problems, but we don't exactly know when
... (read more)

I'd also be curious to know why (some) people downvoted this.

Perhaps it's because you imply that some OpenAI folks were captured, and maybe some people think that that's unwarranted in this case?

Sadly, the more-likely explanation (IMO) is that policy discussions can easily become tribal, even on LessWrong.

I think LW still does better than most places at rewarding discourse that's thoughtful/thought-provoking and resisting tribal impulses, but I wouldn't be surprised if some people were doing something like "ah he is saying something Against AI Labs//Pro-re... (read more)

I appreciate the comment, though I think there's a lack of specificity that makes it hard to figure out where we agree/disagree (or more generally what you believe).

If you want to engage further, here are some things I'd be excited to hear from you:

  • What are a few specific comms/advocacy opportunities you're excited about//have funded?
  • What are a few specific comms/advocacy opportunities you view as net negative//have actively decided not to fund?
  • What are a few examples of hypothetical comms/advocacy opportunities you've been excited about?
  • What do you think
... (read more)

(An extra-heavy “personal capacity” disclaimer for the following opinions.) Yeah, I hear you that OP doesn’t have as much public writing about our thinking here as would be ideal for this purpose, though I think the increasingly adversarial environment we’re finding ourselves in limits how transparent we can be without undermining our partners’ work (as we’ve written about previously).

The set of comms/advocacy efforts that I’m personally excited about is definitely larger than the set of comms/advocacy efforts that I think OP should fund, si... (read more)

They mention three types of mitigations:

  • Asset protection (e.g., restricting access to models to a limited nameset of people, general infosec)
  • Restricting deployment (only models with a risk score of "medium" or below can be deployed)
  • Restricting development (models with a risk score of "critical" cannot be developed further until safety techniques have been applied that get it down to "high." Although they kind of get to decide when they think their safety techniques have worked sufficiently well.)

My one-sentence reaction after reading the doc for the first ... (read more)

I believe labels matter, and I believe the label "preparedness framework" is better than the label "responsible scaling policy." Kudos to OpenAI on this. I hope we move past the RSP label.

I think labels will matter most when communicating to people who are not following the discussion closely (e.g., tech policy folks who have a portfolio of 5+ different issues and are not reading the RSPs or PFs in great detail).

One thing I like about the label "preparedness framework" is that it begs the question "prepared for what?", which is exactly the kind of question I want policy people to be asking. PFs imply that there might be something scary that we are trying to prepare for. 

Thanks for this overview, Trevor. I expect it'll be helpful– I also agree with your recommendations for people to consider working at standard-setting organizations and other relevant EU offices.

One perspective that I see missing from this post is what I'll call the advocacy/comms/politics perspective. Some examples of this with the EU AI Act:

  • Foundation models were going to be included in the EU AI Act, until France and Germany (with lobbying pressure from Mistral and Aleph Alpha) changed their position.
  • This initiated a political/comms battle between those
... (read more)
Thanks for these thoughts! I agree that advocacy and communications is an important part of the story here, and I'm glad for you to have added some detail on that with your comment. I’m also sympathetic to the claim that serious thought about “ambitious comms/advocacy” is especially neglected within the community, though I think it’s far from clear that the effort that went into the policy research that identified these solutions or work on the ground in Brussels should have been shifted at the margin to the kinds of public communications you mention. I also think Open Phil’s strategy is pretty bullish on supporting comms and advocacy work, but it has taken us a while to acquire the staff capacity to gain context on those opportunities and begin funding them, and perhaps there are specific opportunities that you're more excited about than we are.  For what it’s worth, I didn’t seek significant outside input while writing this post and think that's fine (given the alternative of writing it quickly, posting it here, disclaiming my non-expertise, and getting additional perspectives and context from commenters like yourself). However, I have spoken with about a dozen people working on AI policy in Europe over the last couple months (including one of the people whose public comms efforts are linked in your comment) and would love to chat with more people with experience doing policy/politics/comms work in the EU. We could definitely use more help thinking about this stuff, and I encourage readers who are interested in contributing to OP’s thinking on advocacy and comms to do any of the following: * Write up these critiques (we do read the forums!);  * Join our team (our latest hiring round specifically mentioned US policy advocacy as a specialization we'd be excited about, but people with advocacy/politics/comms backgrounds more generally could also be very useful, and while the round is now closed, we may still review general applications); and/or  * Introduce yo

What are some examples of situations in which you refer to this point?

My own model differs a bit from Zach's. It seems to me like most of the publicly-available policy proposals have not gotten much more concrete. It feels a lot more like people were motivated to share existing thoughts, as opposed to people having new thoughts or having more concrete thoughts.

Luke's list, for example, is more of a "list of high-level ideas" than a "list of concrete policy proposals." It has things like "licensing" and "information security requirements"– it's not an actual bill or set of requirements. (And to be clear, I still like Luke's p... (read more)

Thanks for all of this! Here's a response to your point about committees.

I agree that the committee process is extremely important. It's especially important if you're trying to push forward specific legislation. 

For people who aren't familiar with committees or why they're important, here's a quick summary of my current understanding (there may be a few mistakes):

  1. When a bill gets introduced in the House or the Senate, it gets sent to a committee. The decision is made by the Speaker of the House or the priding officer in the Senate. In practice, howev
... (read more)

WTF do people "in AI governance" do?

Quick answer:

  1. A lot of AI governance folks primarily do research. They rarely engage with policymakers directly, and they spend much of their time reading and writing papers.
  2. This was even more true before the release of GPT-4 and the recent wave of interest in AI policy. Before GPT-4, many people believed "you will look weird/crazy if you talk to policymakers about AI extinction risk." It's unclear to me how true this was (in a genuine "I am confused about this & don't think I have good models of this" way). Regardles
... (read more)
My prediction is that the AI safety community will overestimate the difficulty of policymaker engagement/outreach. I think that the AI safety community has quickly and accurately taken social awkwardness and nerdiness into account, and factored that out of the equation. However, they will still overestimate the difficulty of policymaker outreach, on the basis that policymaker outreach requires substantially above-average sociability and personal charisma.  Even among the many non-nerd extroverts in the AI safety community, who have above average or well above average social skills (e.g. ~80th or 90th percentile), the ability to do well in policy requires an extreme combination of traits that produce intense charismatic competence, such the traits required for as a sense of humor near the level of a successful professional comedian (e.g. ~99th or 99.9th percentile). This is because the policy environment, like corporate executives, selects for charismatic extremity. Because people who are introspective or think about science at all are very rarely far above the 90th percentile for charisma, even if only the obvious natural extroverts are taken into account, the AI safety community will overestimate the difficulty of policymaker outreach.  I don't think they will underestimate the value of policymaker outreach (in fact I predict they are overestimating the value, due to the American interests in using AI for information warfare pushing AI decisionmaking towards inaccessible and inflexible parts of natsec agencies). But I do anticipate underestimating the feasibility of policymaker outreach.
That makes sense and sounds sensible, at least pre-ChatGPT.

I would also suspect that #2 (finding/generating good researchers) is more valuable than #1 (generating or accelerating good research during the MATS program itself).

One problem with #2 is that it's usually harder to evaluate and takes longer to evaluate. #2 requires projections, often over the course of years. #1 is still difficult to evaluate (what is "good alignment research" anyways?) but seems easier in comparison.

Also, I would expect correlations between #1 and #2. Like, one way to evaluate "how good are we doing at training researchers//who are the ... (read more)

Thanks for writing this! I’m curious if you have any information about the following questions:

  1. What does the MATS team think are the most valuable research outputs from the program?

  2. Which scholars was the MATS team most excited about in terms of their future plans/work?

IMO, these are the two main ways I would expect MATS to have impact: research output during the program and future research output/career trajectories of scholars.

Furthermore, I’d suspect things to be fairly tails-based (where EG the top 1-3 research outputs and the top 1-5 scholars ... (read more)

We expect to achieve more impact through (2) than (1); per the theory of change section above, MATS' mission is to expand the talent pipeline. Of course, we hope that scholars produce great research through the program, but often we are excited about scholars doing research (1) as a means for scholars to become better researchers (2). Other times, these goals are in tension. For example, some mentors explicitly de-emphasize research outputs, encouraging scholars to focus instead on pivoting frequently to develop better taste. If you have your own reason to think (1) is comparably important to (2), I wonder if you also think providing mentees to researchers whose agendas you support is a similarly important source of MATS' impact? From the Theory of Change section:
4Juan Gil3mo
I also find it plausible that the top 1-5 scholars are responsible for most of the impact, and we want to investigate this to a greater extent. Unfortunately, it's difficult to evaluate the impact of a scholar's research and career trajectory until more like 3-12 months after the program, so we decided to separate that analysis from the retrospective of the summer 2023 program. We've begun collecting this type of information (for past cohorts) via alumni surveys and other sources and hope to have another report out in the next few months that closer tracks the impact that we expect MATS to have.

I'd expect the amount of time this all takes to be a function of the time-control.

Like, if I have 90 mins, I can allocate more time to all of this. I can consult each of my advisors at every move. I can ask them follow-up questions.

If I only have 20 mins, I need to be more selective. Maybe I only listen to my advisors during critical moves, and I evaluate their arguments more quickly. Also, this inevitably affects the kinds of arguments that the advisors give.

Both of these scenarios seem pretty interesting and AI-relevant. My all-things-considered guess wo... (read more)

Is there a reason you’re using 3 hour time control? I’m guessing you’ve thought about this more than I have, but at first glance, it feels to me like this could be done pretty well with EG 60-min or even 20-min time control.

I’d guess that having 4-6 games that last 20-30 mins gives is better than having 1 game that lasts 2 hours.

(Maybe I’m underestimating how much time it takes for the players to give/receive advice. And ofc there are questions about the actual situations with AGI that we’re concerned about— EG to what extent do we expect time pressure to be a relevant factor when humans are trying to evaluate arguments from AIs?)

It takes a lot of time for advisors to give advice, the player has to evaluate all the suggestions, and there's often some back-and-forth discussion. It takes much too long to make moves in under a minute.

Thanks! And why did Holden have the ability to choose board members (and be on the board in the first place)?

I remember hearing that this was in exchange for OP investment into OpenAI, but I also remember Dustin claiming that OpenAI didn’t actually need any OP money (would’ve just gotten the money easily from another investor).

Is your model essentially that the OpenAI folks just got along with Holden and thought he/OP were reasonable, or is there a different reason Holden ended up having so much influence over the board?

My model is that this was a mixture of a reputational trade (OpenAI got to hire a bunch of talent from EA spaces and make themselves look responsible to the world) and just actually financial ($30M was a substantial amount of money at that point in time). Sam Altman has many times said he quite respects Holden, so that made up a large fraction of the variance. See e.g. this tweet: 

Clarification/history question: How were these board members chosen?

My current model is that Holden chose them. Tasha in 2018, Helen in 2021 when he left and chose Helen as his successor board member. I don't know, but I think it was close to a unilateral decision from his side (like I don't think anyone at Open AI had much reason to trust Helen outside of Holden's endorsement, so my guess is he had a lot of leeway).

Thanks for this dialogue. I find Nate and Oliver's "here's what I think will actually happen" thoughts useful.

I also think I'd find it useful for Nate to spell out "conditional on good things happening, here's what I think the steps look like, and here's the kind of work that I think people should be doing right now. To be clear, I think this is all doomed, and I'm only saying this being Akash directly asked me to condition on worlds where things go well, so here's my best shot."

To be clear, I think some people do too much "play to your outs" reasoning. In... (read more)

Thanks for sharing this! I'm curious if you have any takes on Nate's comment or Oliver's comment:


I don't think we have any workable plan for reacting to the realization that dangerous capabilities are upon us. I think that when we get there, we'll predictably either (a) optimize against our transparency tools or otherwise walk right of the cliff-edge anyway, or (b) realize that we're in deep trouble, and slow way down and take some other route to the glorious transhumanist future (we might need to go all the way to WBE, or at least dramatically switch

... (read more)

Adding a datapoint here: I've been involved in the Control AI campaign, which was run by Andrea Miotti (who also works at Conjecture). Before joining, I had heard some integrity/honesty concerns about Conjecture. So when I decided to join, I decided to be on the lookout for any instances of lying/deception/misleadingness/poor integrity. (Sidenote: At the time, I was also wondering whether Control AI was just essentially a vessel to do Conjecture's bidding. I have updated against this– Control AI reflects Andrea's vision. My impression is that Conjectu... (read more)

I don't think aysja was endorsing "hope" as a strategy– at least, that's not how I read it. I read it as "we should hold leaders accountable and make it clear that we think it's important for people to state their true beliefs about important matters."

To be clear, I think it's reasonable for people to discuss the pros and cons of various advocacy tactics, and I think asking "to what extent do I expect X advocacy tactic will affect peoples' incentives to openly state their beliefs?" makes sense.

Separately, though, I think the "accountability frame" is impor... (read more)

I think a lot of alignment folk have made positive updates in response to the societal response to AI xrisk.

This is probably different than what you're pointing at (like maybe your claim is more like "Lots of alignment folks only make negative updates when responding to technical AI developments" or something like that).

That said, I don't think the examples you give are especially compelling. I think the following position is quite reasonable (and I think fairly common):

  • Bing Chat provides evidence that some frontier AI companies will fail at alignment even
... (read more)

Thanks for sharing this! A few thoughts:

It is likely that at ASL-4 we will require a detailed and precise understanding of what is going on inside the model, in order to make an “affirmative case” that the model is safe.

I'd be extremely excited for Anthropic (or ARC or other labs) to say more about what they believe would qualify as an affirmative case for safety. I appreciate this sentence a lot, and I think a "strong version" of affirmative safety (that go beyond "we have not been able to detect danger" toward "we have an understanding of the system we a... (read more)

Did you mean ASL-2 here?

My understanding is that their commitment is to stop once their ASL-3 evals are triggered. They hope that their ASL-3 evals will be conservative enough to trigger before they actually have an ASL-3 system, but I think that's an open question. I've edited my comment to say "before Anthropic scales beyond systems that trigger their ASL-3 evals". See this section from their RSP below (bolding my own):

"We commit to define ASL-4 evaluations before we first train ASL-3 models (i.e., before continuing training beyond when ASL-3 evaluations... (read more)

Ok, we agree. By "beyond ASL-3" I thought you meant "stuff that's outside the category ASL-3" instead of "the first thing inside the category ASL-3". Yep, that summary seems right to me. (I also think the "concrete commitments" statement is accurate.) Yeah, I also think putting the burden of proof on scaling (instead of on pausing) is safer and probably appropriate. I am hesitant about it on process grounds; it seems to me like evidence of safety might require the scaling that we're not allowing until we see evidence of safety. On net, it seems like the right decision on the current margin but the same lock-in concerns (if we do the right thing now for the wrong reasons perhaps we will do the wrong thing for the same reasons in the future) worry me about simply switching the burden of proof (instead of coming up with a better system to evaluate risk).

I got the impression that Anthropic wants to do the following things before it scales beyond systems that trigger their ASL-3 evals:

  1. Have good enough infosec so that it is "unlikely" for non-state actors to steal model weights, and state actors can only steal them "with significant expense."
  2. Be ready to deploy evals at least once every 4X in effective compute
  3. Have a blog post that tells the world what they plan to do to align ASL-4 systems.

The security commitment is the most concrete, and I agree with Habryka that these don't seem likely to cause Anthropic to... (read more)

I got the impression that Anthropic wants to do the following things before it scales beyond ASL-3:

Did you mean ASL-2 here? This seems like a pretty important detail to get right. (What they would need to do to scale beyond ASL-3 is meet the standard of an ASL-4 lab, which they have not developed yet.)

I agree with Habryka that these don't seem likely to cause Anthropic to stop scaling:

By design, RSPs are conditional pauses; you pause until you have met the standard, and then you continue. If you get the standard in place soon enough, you don't need to paus... (read more)

Thank you for this, Zvi!

Reporting requirements for foundation models, triggered by highly reasonable compute thresholds.

I disagree, and I think the computing threshold is unreasonably high. I don't even mean this in a "it is unreasonable because an adequate civilization would do way better"– I currently mean it in a "I think our actual civilization, even with all of its flaws, could have expected better."

There are very few companies training models above 10^20 FLOP, and it seems like it would be relatively easy to simply say "hey, we are doing this trainin... (read more)

4Neel Nanda4mo
My guess is that the threshold is a precursor to more stringent regulation on people above the bar, and that it's easier to draw a line in the sand now and stick to it. I feel pretty fine with it being so high
I think part 2 that details the reactions will provide important color here - if this had impacted those other than the major labs right away, I believe the reaction would have been quite bad, and that setting it substantially lower would have been a strategic error and also a very hard sell to the Biden Administration. But perhaps I am wrong about that. They do reserve the ability to change the threshold in the future. 

Since OpenAI hasn't released its RDP, I agree with the claim that Anthropic's RSP is currently more concrete than OpenAI's RDP. In other words, "Anthropic has released something, and OpenAI has not yet released something."

That said, I think this comment might make people think Anthropic's RSP is more concrete than it actually is. I encourage people to read this comment, as well as the ensuing discussion between Evan and Habryka. 

Especially this part of one of Habryka's comments:

The RSP does not specify the conditions under which Anthropic would stop s

... (read more)

FWIW I read Anthropic's RSP and came away with the sense that they would stop scaling if their evals suggested that a model being trained either registered as ASL-3 or was likely to (if they scaled it further). They would then restart scaling once they 1) had a definition of the ASL-4 model standard and lab standard and 2) met the standard of an ASL-3 lab.

Do you not think that? Why not?

I think it’s good for proponents of RSPs to be open about the sorts of topics I’ve written about above, so they don’t get confused with e.g. proposing RSPs as a superior alternative to regulation. This post attempts to do that on my part. And to be explicit: I think regulation will be necessary to contain AI risks (RSPs alone are not enough), and should almost certainly end up stricter than what companies impose on themselves.

Strong agree. I wish ARC and Anthropic had been more clear about this, and I would be less critical of their RSP posts if they had s... (read more)

Thanks for the thoughts! Some brief (and belated) responses: * I disagree with you on #1 and think the thread below your comment addresses this. * Re: #2, I think we have different expectations. We can just see what happens, but I’ll note that the RSP you refer to is quite explicit about the need for further iterations (not just “revisions” but also the need to define further-out, more severe risks). * I’m not sure what you mean by “an evals regime in which the burden of proof is on labs to show that scaling is safe.” How high is the burden you’re hoping for? If they need to definitively rule out risks like “The weights leak, then the science of post-training enhancements moves forward to the point where the leaked weights are catastrophically dangerous” in order to do any further scaling, my sense is that nobody (certainly not me) has any idea how to do this, and so this proposal seems pretty much equivalent to “immediate pause,” which I’ve shared my thoughts on. If you have a lower burden of proof in mind, I think that’s potentially consistent with the work on RSPs that is happening (it depends on exactly what you are hoping for). * I agree with the conceptual point that improvements on the status quo can be net negative for the reasons you say. When I said “Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate,” I didn’t mean to say that there’s no way this can make intellectual sense. To take a quick stab at what bugs me: I think to the extent a measure is an improvement but insufficient, the strategy should be to say “This is an improvement but insufficient,” accept the improvement and count on oneself to win the “insufficient” argument. This kind of behavior seems to generalize to a world in which everyone is clear about their preferred order of object-level states of the world, and incentive gradients consistently point in the right direction (especially

Why do you think RSPs don't put the burden of proof on labs to show that scaling is safe?

I think the RSP frame is wrong, and I don't want regulators to use it as a building block. My understanding is that labs are refusing to adopt an evals regime in which the burden of proof is on labs to show that scaling is safe. Given this lack of buy-in, the RSP folks concluded that the only thing left to do was to say "OK, fine, but at least please check to see if the system will imminently kill you. And if we find proof that the system is pretty clearly dangerous

... (read more)
OpenAI's RDP name seems nicer than the RSP name, for roughly the reason they explain in their AI summit proposal (and also 'risk-informed' feels more honest than 'responsible'):
8Oliver Sourbut4mo
I think we should keep the acronym, but change the words as necessary! Imagine (this is tongue in cheek and I don't have strongly-held opinions on RSPs yet)

Cross-posted to the EA Forum.

One reason I'm critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it's OK to keep going.

It's hard to take anything else you're saying seriously when you say things like this; it seems clear that you just haven't read Anthropic's RSP. I think that the current conditions and resulting safeguards are insufficient to prevent AI existential risk, but to say that it doesn't make them clear is just patently fals... (read more)

1M. Y. Zuo4mo
Can you link an example of what you believe to be a well-worded RSP?

Do you mind pointing me to the section? I skimmed your post again, and the only relevant thing I saw was this part:

  1. Seeing the existing RSP system in place at labs, governments step in and use it as a basis to enact hard regulation.
  2. By the time it is necessary to codify exactly what safety metrics are required for scaling past models that pose a potential takeover risk, we have clearly solved the problem of understanding-based evals and know what it would take to demonstrate sufficient understanding of a model to rule out e.g. deceptive alignment.
  3. U
... (read more)
I agree that this is a problem, but it strikes me that we wouldn't necessarily need a concrete eval - i.e. we wouldn't need [by applying this concrete evaluation process to a model, we can be sure we understand it sufficiently]. We could have [here is a precise description of what we mean by "understanding a model", such that we could, in principle, create an evaluation process that answers this question]. We can then say in an RSP that certain types of model must pass an understanding-in-this-sense eval, even before we know how to write an understanding-in-this-sense eval. (though it's not obvious to me that defining the right question isn't already most of the work) Personally, I'd prefer that this were done already - i.e. that anything we think is necessary should be in the RSP at some level of abstraction / indirection. That might mean describing properties an eval would need to satisfy. It might mean describing processes by which evals could be approved - e.g. deferring to an external board. [Anthropic's Long Term Benefit Trust doesn't seem great for this, since it's essentially just Paul who'd have relevant expertise (?? I'm not sure about this - it's just unclear that any of the others would)] I do think it's reasonable for labs to say that they wouldn't do this kind of thing unilaterally - but I would want them to push for a more comprehensive setup when it comes to policy.

@evhub I think it's great when you and other RSP supporters make it explicit that (a) you don't think they're sufficient and (b) you think they can lead to more meaningful regulation.

With that in mind, I think the onus is on you (and institutions like Anthropic and ARC) to say what kind of regulations they support & why. And then I think most of the value will come from "what actual regulations are people proposing" and not "what is someone's stance on this RSP thing which we all agree is insufficient."

Except for the fact that there are ways to talk ab... (read more)

I did—I lay out a plan for how to get from where we are now to a state where AI goes well from a policy perspective in my RSP post.

I appreciated the section that contrasted the "reasonable pragmatist strategy" with what people in the "pragmatist camp" sometimes seem to be doing. 

EAs in AI governance would often tell me things along the lines of "trust me, the pragmatists know what they're doing. They support strong regulations, they're just not allowed to say it. At some point, when the time is right, they'll come out and actually support meaningful regulations. They appreciate the folks who are pushing the Overton Window etc etc."

I think this was likely wrong, or at least overso... (read more)

Congrats on launching!

Besides funding, are there things other people could provide to help your group succeed? Are there particular types of people you’d be excited to collaborate with, get mentorship/advice from, etc?

Thanks Akash. Speaking for myself, I have plenty of experience supervising MSc and PhD students and running an academic research group, but scientific institution building is a next-level problem. I have spent time reading and thinking about it, but it would be great to be connected to people with first-hand experience or who have thought more deeply about it, e.g.

  • People with advice on running distributed scientific research groups
  • People who have thought about scientific institution building in general (e.g. those with experience starting FROs in bioscienc
... (read more)

I just want to say that I thought this was an excellent dialogue. 

It is very rare for two people with different views/perspectives to come together and genuinely just try to understand each other, ask thoughtful questions, and track their updates. This dialogue felt like a visceral, emotional reminder that this kind of thing is actually still possible. Even on a topic as "hot" as AI pause advocacy. 

Thank you to Holly, Rob, and Jacob. 

I'll also note that I've been proud of Holly for her activism. I remember speaking with her a bit when she wa... (read more)

I'm interested in having dialogues about various things, including:

  • Responsible scaling policies
  • Pros and cons of the Bay Area//how the vibe in Bay Area community has changed [at least in my experience]
  • AI policy stuff in general (takes on various policy ideas, orgs, etc.)
  • Takeaways from talking to various congressional offices (already have someone for this one but I think it'd likely be fine for others to join)
  • General worldview stuff & open-ended reflection stuff (e.g., things I've learned, mistakes I've made, projects I've been proudest of, things I'm uncertain about, etc.)

I was thinking about this passage:

RSPs offer a potential middle ground between (a) those who think AI could be extremely dangerous and seek things like moratoriums on AI development, and (b) who think that it’s too early to worry about capabilities with catastrophic potential. RSPs are pragmatic and threat-model driven: rather than arguing over the likelihood of future dangers, we can...

I think "extreme" was subjective and imprecise wording on my part, and I appreciate you catching this. I've edited the sentence to say "Instead, ARC implies that the morato... (read more)

Thanks, Zach! Responses below:

But the reason that ARC and Anthropic didn't write this 'good RSP' isn't that they're reckless, but because writing such an RSP is a hard open problem

I agree that writing a good RSP is a hard open problem. I don't blame ARC for not having solved the "how can we scale safely" problem. I am disappointed in ARC for communicating about this poorly (in their public blog post, and [speculative/rumor-based] maybe in their private government advocacy as well.).

I'm mostly worried about the comms/advocacy/policy implications. If Anthrop... (read more)

i would love to see competing RSPs (or, better yet, RTDPs, as @Joe_Collman pointed out in a cousin comment).


Thanks for writing this, Evan! I think it's the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.

I plan to write up more opinions about RSPs, but one I'll express for now is that I'm pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I'll detail this below:

What would a good RSP look like?

  • Clear commitments along the line
... (read more)

I happen to think that the Anthropic RSP is fine for what it is, but it just doesn't actually make any interesting claims yet. The key thing is that they're committing to actually having an ASL-4 criteria and safety argument in the future. From my perspective, the Anthropic RSP effectively is an outline for the sort of thing an RSP could be (run evals, have safety buffer, assume continuity, etc) as well as a commitment to finish the key parts of the RSP later. This seems ok to me.

I would preferred if they included tentative proposals for ASL-4 evaluations ... (read more)

9Lukas Finnveden4mo
Are you thinking about this post? I don't see any explicit claims that the moratorium folks are extreme. What passage are you thinking about?

What would a good RSP look like?

  • Clear commitments along the lines of "we promise to run these 5 specific tests to evaluate these 10 specific dangerous capabilities."
  • Clear commitments regarding what happens if the evals go off (e.g., "if a model scores above a 20 on the Hubinger Deception Screener, we will stop scaling until it has scored below a 10 on the relatively conservative Smith Deception Test.")
  • Clear commitments regarding the safeguards that will be used once evals go off (e.g., "if a model scores above a 20 on the Cotra Situational Awareness Screen
... (read more)

Strongly agree with almost all of this.

My main disagreement is that I don't think the "What would a good RSP look like?" description is sufficient without explicit conditions beyond evals. In particular that we should expect that our suite of tests will be insufficient at some point, absent hugely improved understanding - and that we shouldn't expect to understand how and why it's insufficient before reality punches us in the face.

Therefore, it's not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them... (read more)


It seems to me like the main three RSP posts (ARC's, Anthropic's, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs.

I mean, I am very explicitly trying to communicate what I see as the success story here. I agree that there are many ways that this could fail—I mention a bunch of them in the last section—but I think that having a clear story of how things could go well is important to being able to work to actually achieve that story.

On top of that, the posts seem to have this "don't listen to the people who are

... (read more)

Thanks for this; seems reasonable to me. One quick note is that my impression is that it's fairly easy to set up a 501(c)4. So even if [the formal institution known as MIRI] has limits, I think MIRI would be able to start a "sister org" that de facto serves as the policy arm. (I believe this is accepted practice & lots of orgs have sister policy orgs.) 

(This doesn't matter right now, insofar as you don't think it would be an efficient allocation of resources to spin up a whole policy division. Just pointing it out in case your belief changes and the 501(c)3 thing felt like the limiting factor). 

Thanks for this response!

Do you have any candidates for the list in mind?

I didn't have any candidate mistakes in mind. After thinking about it for 2 minutes, here are some possible candidates (though it's not clear to me that any of them were actually mistakes):

  • Eliezer's TIME article explicitly mentions that states should "be willing to destroy a rogue datacenter by airstrike." On one hand, it seems important to clearly/directly communicate the kind of actions that Eliezer believes the world will need to be willing to take in order to keep us safe. On the
... (read more)
5Gretta Duleba4mo
Re: the wording about airstrikes in TIME: yeah, we did not anticipate how that was going to be received and it's likely we would have wordsmithed it a bit more to make the meaning more clear had we realized. I'm comfortable calling that a mistake. (I was not yet employed at MIRI at the time but I was involved in editing the draft of the op-ed so it's at least as much on me as anybody else who was involved.) Re: policy division: we are limited by our 501(c)3 status as to how much of our budget we can spend on policy work, and here 'budget' includes the time of salaried employees. Malo and Eliezer both spend some fraction of their time on policy but I view it as unlikely that we'll spin up a whole 'division' about that. Instead, yes, we partner with/provide technical advice to CAIP and other allied organizations. I don't view failure-to-start-a-policy-division as a mistake and in fact I think we're using our resources fairly well here. Re: critiquing existing policy proposals: there is undoubtedly more we could do here, though I lean more in the direction of 'let's say what we think would be almost good enough' rather than simply critiquing what's wrong with other proposals.

I'm excited to see MIRI focus more on public communications. A few (optional) questions:

  • What do you see as the most important messages to spread to (a) the public and (b) policymakers?
  • What mistakes do you think MIRI has made in the last 6 months?
  • Does MIRI also plan to get involved in policy discussions (e.g. communicating directly with policymakers, and/or advocating for specific policies)?
  • Does MIRI need any help? (Or perhaps more precisely "Does MIRI need any help from the right kind of person with the right kind of skills, and if so, what would that person or those skills look like?")
9Gretta Duleba4mo
  That's a great question that I'd prefer to address more comprehensively in a separate post, and I should admit up front that the post might not be imminent as we are currently hard at work on getting the messaging right and it's not a super quick process. Huh, I do not have a list prepared and I am not entirely sure where to draw the line around what's interesting to discuss and what's not; furthermore it often takes some time to have strong intuitions about what turned out to be a mistake. Do you have any candidates for the list in mind? We are limited in our ability to directly influence policy by our 501(c)3 status; that said, we do have some latitude there and we are exercising it within the limits of the law. See for example this tweet by Eliezer. Yes, I expect to be hiring in the comms department relatively soon but have not actually posted any job listings yet. I will post to LessWrong about it when I do.

I suspect that lines like this are giving people the impression that you [Nate] don't think there are (realistic) things that you can improve, or that you've "given up". 

I do have some general sense here that those aren't emotionally realistic options for people with my emotional makeup.

I have a sense that there's some sort of trap for people with my emotional makeup here. If you stay and try to express yourself despite experiencing strong feelings of frustration, you're "almost yelling". If you leave because you're feeling a bunch of frustration and

... (read more)

Huh, I initially found myself surprised that Nate thinks he's adhering to community norms. I wonder if part of what's going on here is that "community norms" is a pretty vague phrase that people can interpret differently. 

Epistemic status: Speculative. I haven't had many interactions with Nate, so I'm mostly going off of what I've heard from others + general vibes. 

Some specific norms that I imagine Nate is adhering to (or exceeding expectations in):

  • Honesty
  • Meta-honesty
  • Trying to offer concrete models and predictions
  • Being (internally) open to ackno
... (read more)

In academia, for instance, I think there are plenty of conversations in which two researchers (a) disagree a ton, (b) think the other person's work is hopeless or confused in deep ways, (c) honestly express the nature of their disagreement, but (d) do so in a way where people generally feel respected/valued when talking to them.

My model says that this requires them to still be hopeful about local communication progress, and happens when they disagree but already share a lot of frames and concepts and background knowledge. I, at least, find it much harde... (read more)

But I think some people possess the skill of "being able to communicate harsh truths accurately in ways where people still find the interaction kind, graceful, respectful, and constructive." And my understanding is that's what people like TurnTrout are wishing for.

Kinda. I'm advocating less for the skill of "be graceful and respectful and constructive" and instead looking at the lower bar of "don't be overtly rude and aggressive without consent; employ (something within 2 standard deviations of) standard professional courtesy; else social consequences." I want to be clear that I'm not wishing for some kind of subtle mastery, here. 

But I think some people possess the skill of "being able to communicate harsh truths accurately in ways where people still find the interaction kind, graceful, respectful, and constructive." And my understanding is that's what people like TurnTrout are wishing for.

This is a thing, but I'm guessing that what you have in mind involves a lot more than you're crediting of not actually trying for the crux of the conversation. As just one example, you can be "more respectful" by making fewer "sweeping claims" such as "you are making such and such error in rea... (read more)

Engaging with people in ways such that they often feel heard/seen/understood

This is not a reasonable norm. In some circumstances (including, it sounds like, some of the conversations under discussion) meeting this standard would require a large amount of additional effort, not related to the ostensible reason for talking in the first place.

Engaging with people in ways such that they rarely feel dismissed/disrespected

Again, a pretty unreasonable norm. For some topics, such as "is what you're doing actually making progress towards that thing you've ar... (read more)

TurnTrout, if you're comfortable sharing, I'd be curious to hear more about the nature of your interaction with Nate in July 2022.

Separately, my current read of this thread is something like "I wish it was easier for Nate to communicate with people, but it seems (at least to me) like the community is broadly aware that Nate can be difficult to communicate with & I think his reputation (at least in the AIS circles I'm in) already matches this. Also, it seems like he (at least sometimes) tries to be pretty clear/explicit in warning people about his commu... (read more)

it seems (at least to me) like the community is broadly aware that Nate can be difficult to communicate with & I think his reputation (at least in the AIS circles I'm in) already matches this. Also, it seems like he (at least sometimes) tries to be pretty clear/explicit in warning people about his communication/mentoring style.

I disagree on both points. I wasn't aware before talking with him, and I'd been in alignment and around the Bay for years. Some of my MATS mentees had been totally unaware and were considering engaging with him about something. P... (read more)

I second this--- I skimmed part of nate's comms doc, but it's unclear to me what turntrout is talking about unless he's talking about "being blunt"--- it sounds that overall there's something other than bluntness going on, cuz I feel like we already know about bluntness / we've thought a lot about upsides and downsides of bluntness people before.

I’m excited to see both of these papers. It would be a shame if we reached superintelligence before anyone had put serious effort into examining if CoT prompting could be a useful way to understand models.

I also had some uneasiness while reading some sections of the post. For the rest of this comment, I’m going to try to articulate where that’s coming from.

Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.

I notice that I’m pretty confused when I read this sentence. I didn’t read ... (read more)

Load More