Leveling Up: advice & resources for junior alignment researchers

Wiki Contributions



Thanks! Despite the lack of SMART goals, I still feel like this reply gave me a better sense of what your priorities are & how you'll be assessing success/failure.

One failure mode– which I'm sure is already on your radar– is something like: "MIRI ends up producing lots of high-quality stuff but no one really pays attention. Policymakers and national security people are very busy and often only read things that (a) directly relate to their work or (b) are sent to them by someone who they respect."

Another is something like: "MIRI ends up focusing too much on making arguments/points that are convincing to general audiences but fail to understand the cruxes/views of the People Who Matter." (A strawman version of this is something like "MIRI ends up spending a lot of time in the Bay and there's lots of pressure to engage a bunch with the cruxes/views of rationalists, libertarians, e/accs, and AGI company employees. Meanwhile, the kinds of conversations happening among natsec folks & policymakers look very different, and MIRI's materials end up being less relevant/useful to this target audience."

I'm extremely confident that these are already on your radar, but I figure it might be worth noting that these are two of the failure modes I'm most worried about. (I guess besides the general boring failure mode along the lines of "hiring is hard and doing anything is hard and maybe things just stay slow and when someone asks what good materials you guys have produced the answer is still 'we're working on it'.)

(Final note: A lot of my questions and thoughts have been critical, but I should note that I appreciate what you're doing & I'm looking forward to following MIRI's work in the space! :D)


Thank you! I still find myself most curious about the "how will MIRI make sure it understands its audience" and "how will MIRI make sure its materials are read by policymakers + natsec people" parts of the puzzle. Feel free to ignore this if we're getting too in the weeds, but I wonder if you can share more details about either of these parts.

There is also an audience-specific component, and to do well on that, we do need to understand our audience better. We are working to recruit beta readers from appropriate audience pools.

There are several approaches here, most of which will not be executed by the comms team directly, we hand off to others


I'm surprised why some people are so interested in the idea of liability for extreme harms. I understand that from a legal/philosophical perspective, there are some nice arguments about how companies should have to internalize the externalities of their actions etc.

But in practice, I'd be fairly surprised if liability approaches were actually able to provide a meaningful incentive shift for frontier AI developers. My impression is that frontier AI developers already have fairly strong incentives to avoid catastrophes (e.g., it would be horrible for Microsoft if its AI model caused $1B in harms, it would be horrible for Meta and the entire OS movement if an OS model was able to cause $1B in damages.)

And my impression is that most forms of liability would not affect this cost-benefit tradeoff by very much. This is especially true if the liability is only implemented post-catastrophe. Extreme forms of liability could require insurance, but this essentially feels like a roundabout and less effective way of implementing some form of licensing (you have to convince us that risks are below an acceptable threshold to proceed.)

I think liability also has the "added" problem of being quite unpopular, especially among Republicans. It is easy to attack liability regulations as anti-innovation, argue that that it creates a moat (only big companies can afford to comply), and argue that it's just not how America ends up regulating things (we don't hold Adobe accountable for someone doing something bad with Photoshop.)

To be clear, I don't think "something is politically unpopular" should be a full-stop argument against advocating for it.

But I do think that "liability for AI companies" scores poorly both on "actual usefulness if implemented" and "political popularity/feasibility." I also think the "liability for AI companies" advocacy often ends up getting into abstract philosophy land (to what extent should companies internalize externalities) and ends up avoiding some of the "weirder" points (we expect AI has a considerable chance of posing extreme national security risks, which is why we need to treat AI differently than Photoshop.)

I would rather people just make the direct case that AI poses extreme risks & discuss the direct policy interventions that are warranted.

With this in mind, I'm not an expert in liability and admittedly haven't been following the discussion in great detail (partly because the little I have seen has not convinced me that this is an approach worth investing into). I'd be interested in hearing more from people who have thought about liability– particularly concrete stories for how liability would be expected to meaningfully shift incentives of labs. (See also here). 

Stylistic note: I'd prefer replies along the lines of "here is the specific argument for why liability would significantly affect lab incentives and how it would work in concrete cases" rather than replies along the lines of "here is a thing you can read about the general legal/philosophical arguments about how liability is good."


the artifacts we're producing are very big and we want to get them right.

To the extent that this can be shared– What are the artifacts you're most excited about, and what's your rough prediction about when they will be ready?

Moreover, how do you plan to assess the success/failure of your projects? Are there any concrete metrics you're hoping to achieve? What does a "really good outcome" for MIRI's comms team look like by the end of the year, and what does a "we have failed and need to substantially rethink our approach, speed, or personnel" outcome look like?

(I ask partially because one of my main uncertainties right now is how well MIRI will get its materials in front of the policymakers and national security officials you're trying to influence. In the absence of concrete goals/benchmarks/timelines, I could imagine a world where MIRI moves at a relatively slow pace, produces high-quality materials with truthful arguments, but this content isn't getting to the target audience, and the work isn't being informed by the concerns/views of the target audience.)


Got it– thank you! Am I right in thinking that your team intends to influence policymakers and national security officials, though? If so, I'd be curious to learn more about how you plan to get your materials in front of them or ensure that your materials address their core points of concern/doubt.

Put a bit differently– I feel like it would be important for your team to address these questions insofar as your team has the following goals:

The main audience we want to reach is policymakers – the people in a position to enact the sweeping regulation and policy we want – and their staff.

We are hopeful about reaching a subset of policy advisors who have the skill of thinking clearly and carefully about risk, particularly those with experience in national security.


Thank you for this update—I appreciate the clear reasoning. I also personally feel that the AI policy community is overinvested in the "say things that will get you points" strategy and underinvested in the "say true things that help people actually understand the problem" strategy. Specifically, I feel like many US policymakers have heard "be scared of AI because of bioweapons" but have not heard clear arguments about risks from autonomous systems, misalignment, AI takeover, etc. 

A few questions:

  1. To what extent is MIRI's comms team (or technical governance team) going to interact directly with policymakers and national security officials? (I personally suspect you will be more successful if you're having regular conversations with your target audience and taking note of what points they find confusing or unconvincing rather than "thinking from first principles" about what points make a sound argument.)
  2. To what extent is MIRI going to contribute to concrete policy proposals (e.g., helping offices craft legislation or helping agencies craft specific requests)?
  3. To what extent is MIRI going to help flesh out how its policy proposals could be implemented? (e.g., helping iron out the details of what a potential international AI compute governance regime would look like, how it would be implemented, how verification would work, what society would do with the time it buys)
  4. Suppose MIRI has an amazing resource about AI risks. How does MIRI expect to get national security folks and important policymakers to engage with it?

(Tagging @lisathiergart in case some of these questions overlap with the work of the technical governance team.)


6 respondents thought AI safety could communicate better with the wider world. The AI safety community do not articulate the arguments for worrying about AI risk well enough, come across as too extreme or too conciliatory, and lean into some memes too much or not enough.

I think this accurately captures a core debate in AI comms/AI policy at the moment. Some groups are worried about folks coming off as too extreme (e.g., by emphasizing AI takeover and loss-of-control risks) and some groups are worried about folks worrying so much about sounding "normal" that they give an inaccurate or incomplete picture of the risks (e.g., by getting everyone worried about AI-generated bioweapons, even if the speaker does not believe that "malicious use from bioweapons" is the most plausible or concerning threat model.) 

My own opinion is that I'm quite worried that some of the "attempts to look normal" have led to misleading/incorrect models of risk. These models of risk (which tend to focus more on malicious use than risks from autonomous systems) do not end up producing reasonable policy efforts.

The tides seem to be changing, though—there have been more efforts to raise awareness about AGI, AGI takeover, risks from autonomous systems, and risks from systems that can produce a decisive strategic advantage. I think these risks are quite important for policymakers to understand, and clear/straightforward explanations of them are rare. 

I also think status incentives are discouraging (some) people from raising awareness about these threat models– people don't want to look silly, dumb, sci-fi, etc. But IMO one of the most important comms/policy challenges will be getting people to take such threat models seriously, and I think there are ways to explain such threat models legitimately. 


Thanks for looking into this! A few basic questions about the Trust:

1. Do we know if trustees can serve multiple terms? See below for a quoted section from Anthropic's site:

Trustees serve one-year terms and future Trustees will be elected by a vote of the Trustees.

2. Do we know what % of the board is controlled by the trustees, and by when it is expected to be a majority?

The Trust is an independent body of five financially disinterested members with an authority to select and remove a portion of our Board that will grow over time (ultimately, a majority of our Board).

3. Do we know if Paul is still a Trustee, or does his new role at USAISI mean he had to step down?

The initial Trustees are:

Jason Matheny: CEO of the RAND Corporation
Kanika Bahl: CEO & President of Evidence Action
Neil Buddy Shah: CEO of the Clinton Health Access Initiative (Chair)
Paul Christiano: Founder of the Alignment Research Center
Zach Robinson: Interim CEO of Effective Ventures US


Which of the institutions would you count as AGI labs? (genuinely curious– usually I don't think about academic labs [relative to like ODA + Meta + Microsoft] but perhaps there are some that I should be counting.)

And yeah, OP funding is a weird metric because there's a spectrum of how much grantees are closely tied to OP. Like, there's a wide spectrum from "I have an independent research group and got 5% of my total funding from OP" all the way to like "I get ~all my funding from OP and work in the same office as OP and other OP allies and many of my friends/colleagues are OP etc."

That's why I tried to use the phrase "close allies/grantees", to convey more of this implicit cultural stuff than merely "have you ever received OP $." My strong impression is that the authors of the paper are much more intellectually/ideologically/culturally independent from OP, relative to the list of 17 interviewees presented above. 

Load More