Leveling Up: advice & resources for junior alignment researchers

Wiki Contributions


It seems to me like one (often obscured) reason for the disagreement between Thomas and Habryka is that they are thinking about different groups of people when they define "the field."

To assess the % of "the field" that's doing meaningful work, we'd want to do something like [# of people doing meaningful work]/[total # of people in the field].

Who "counts" in the denominator? Should we count anyone who has received a grant from the LTFF with the word "AI safety" in it? Only the ones who have contributed object-level work? Only the ones who have contributed object-level work that passes some bar? Should we count the Anthropic capabilities folks? Just the EAs who are working there?

My guess is that Thomas was using more narrowly defined denominator (e.g., not counting most people who got LTFF grants and went off to to PhDs without contributing object-level alignment stuff; not counting most Anthropic capabilities researchers who have never-or-minimally engaged with the AIS community) whereas Habryka was using a more broadly defined denominator.

I'm not certain about this, and even if it's true, I don't think it explains the entire effect size. But I wouldn't be surprised if roughly 10-30% of the difference between Thomas and Habryka might come from unstated assumptions about who "counts" in the denominator. 

(My guess is that this also explains "vibe-level" differences to some extent. I think some people who look out into the community and think "yeah, I think people here are pretty reasonable and actually trying to solve the problem and I'm impressed by some of their work" are often defining "the community" more narrowly than people who look out into the community and think "ugh, the community has so much low-quality work and has a bunch of people who are here to gain influence rather than actually try to solve the problem.")

Quick note that this is from a year ago: March 4, 2022. (Might be good to put this on top of the post so people don't think it's from 2023). 

Answer by AkashMar 07, 2023184

I think a lot of threat models (including modern threat models) are found in, or heavily inspired by, old MIRI papers. I also think MIRI papers provide unusually clear descriptions of the alignment problem, why MIRI expects it to be hard, and why MIRI thinks intuitive ideas won't work (see e.g., Intelligence Explosion: Evidence and Import, Intelligence Explosion Microeconomics, and Corrigibility). 

Regarding more recent stuff, MIRI has been focusing less on research output and more on shaping discussion around alignment. They are essentially "influencers" on the alignment space. Some people I know label this as "not real research", which I think is true in some sense, but I think more about "what was the impact of this" than "does it fit into the definition of a particular term."  

For specifics, List of Lethalities and Death with Dignity have had a pretty strong effect on discourse in the alignment community (whether or not this is "good" depends on the degree to which you think MIRI is correct and the degree to which you think the discourse has shifted in a good vs. bad direction). On how various plans miss the hard bits of the alignment challenge remains one of the best overviews/critiques of the field of alignment, and the sharp left turn post is a recent piece that is often cited to describe a particularly concerning (albeit difficult to understand) threat model. Six dimensions of operational adequacy is currently one of the best (and only) posts that tries to envision a responsible AI lab. 

Some people have found the 2021 MIRI Dialogues to be extremely helpful at understanding the alignment problem, understanding threat models, and understanding disagreements in the field. 

I believe MIRI occasionally advises people at other organizations (like Redwood, Conjecture, Open Phil) on various decisions. It's unclear to me how impactful their advice is, but it wouldn't surprise me if one or more orgs had changed their mind about meaningful decisions (e.g., grantmaking priorities or research directions) partially as a result of MIRI's advice. 

There's also MIRI's research, though I think this gets less attention at the moment because MIRI isn't particularly excited about it. But my guess is that if someone made a list of all the alignment teams, MIRI would currently have 1-2 teams in the top 20. 

With my comments, I was hoping to spark more of a back-and-forth. Having failed at that, I'm guessing part of the problem is that I didn't phrase my disagreements bluntly or strongly enough, while also noting various points of agreement, which might have overall made it sound like I had only minor disagreements.

Did you ask for more back-and-forth, or were you hoping Sam would engage in more back-and-forth without being explicitly prompted?

If it's the latter, I think the "maybe I made it seem like I only had minor disagreements" hypothesis is less likely than the "maybe Sam didn't even realize that I wanted to have more of a back-and-forth" hypothesis. 

I also suggest asking more questions when you're looking for back-and-forth. To me, a lot of your comments didn't seem to be inviting much back-and-forth, but adding questions would've changed this (even simple things like "what do you think?" or "Can you tell me more about why you believe X?")

Answer by AkashMar 03, 202330

Does this drive a "race to the bottom," where more lenient evals teams get larger market share

I appreciate you asking this, and I find this failure mode plausible. It reminds me of one of the failure modes I listed here (where a group proposing strict evals gets outcompeted by a group proposing looser evals).

Governance failure: We are outcompeted by a group that develops (much less demanding) evals/standards (~10%). Several different groups develop safety standards for AI labs. One group has expertise in AI privacy and data monitoring, another has expertise in ML fairness and bias, and a third is a consulting company that has advised safety standards in a variety of high-stakes contexts (e.g., biosecurity, nuclear energy). 

Each group proposes their own set of standards. Some decision-makers at top labs are enthusiastic about The Unified Evals Agreement, but others are skeptical. In addition to object-level debates, there are debates about which experts should be trusted. Ultimately, lab decision-makers end up placing more weight on teams with experience implementing safety standards in other sectors, and they go with the standards proposed by the consulting group. Although these standards are not mutually exclusive with The Unified Evals Agreement, lab decision-makers are less motivated to adopt new standards (“we just implemented evals-- we have other priorities right now.”). The Unified Evals Agreement is crowded out by standards that have much less emphasis on long-term catastrophic risks from AI systems. 

Nonetheless, the "vibe" I get is that people seem quite confident that this won't happen. Perhaps because labs care a lot about x-risk and want to have high-quality evals, perhaps because lots of the people working on evals have good relationships with labs, and perhaps because they expect there aren't many groups working on evals/standards (except for the xrisk-motivated folks).

However, many of these people might not have a sufficient “toolbox” or research experience to have much marginal impact in short timelines worlds.

I think this is true for some people, but I also think people tend to overestimate the amount of years it takes to have enough research experience to contribute. 

I think a few people have been able to make useful contributions within their first year (though in fairness they generally had backgrounds in ML or AI, so they weren't starting completely from scratch), and several highly respected senior researchers have just a few years of research experience. (And they, on average, had less access to mentorship/infrastructure than today's folks). 

I also think people often overestimate the amount of time it takes to become an expert in a specific area relevant to AI risk (like subtopics in compute governance, information security, etc.)

Finally, I think people should try to model community growth & neglectedness of AI risk in their estimates. Many people have gotten interested in AI safety in the last 1-3 years. I expect that many more will get interested in AI safety in the upcoming years. Being one researcher in a field of 300 seems more useful than being one researcher in a field of 1500. 

With all that in mind, I really like this exercise, and I expect that I'll encourage people to do this in the future:

  1. Write out your credences for AGI being realized in 2027, 2032, and 2042;
  2. Write out your plans if you had 100% credence in each of 2027, 2032, and 2042;
  3. Write out your marginal impact in lowering P(doom) via each of those three plans;
  4. Work towards the plan that is the argmax of your marginal impact, weighted by your credence in the respective AGI timelines.

I appreciate the comment and think I agree with most of it. Was there anything in the post that seemed to disagree with this reasoning?

I downvoted the post because I don't think it presents strong epistemics. Some specific critiques:

  • The author doesn't explain the reasoning that produced the updates. (They link to posts, but I don't think it's epistemically sound to link to say "I made updates and you can find the reasons why in these posts." At best, people read the posts, and then come away thinking "huh, I wonder which of these specific claims/arguments were persuasive to the poster.")
  • The author recommends policy changes (to LW and the field of alignment) that (in my opinion) don't seem to follow from the claims presented. (The claim "LW and the alignment community should shift their focuses" does not follow from "there is a 50-70% chance of alignment by default". See comment for more).
  • The author doesn't explain their initial threat model, why it was dominated by deception, and why they're unconvinced by other models of risk & other threat models.

I do applaud the author for sharing the update and expressing an unpopular view. I also feel some pressure to not downvote it because I don't want to be "downvoting something just because I disagree with it", but I think in this case it really is the post itself. (I didn't downvote the linked post, for example).

In other words, I now believe a significant probability, on the order of 50-70%, that alignment is solved by default.

Let's suppose that you are entirely right about deceptive alignment being unlikely. (So we'll set aside things like "what specific arguments caused you to update?" and tricky questions about modest epistemology/outside views).

I don't see how "alignment is solved by default with 30-50% probability justifies claims like "capabilities progress is net positive" or "AI alignment should change purpose to something else."

If a doctor told me I had a disease that had a 50-70% chance to resolve on its own, otherwise it would kill me, I wouldn't go "oh okay, I should stop trying to fight the disease."

The stakes are also not symmetrical. Getting (aligned) AGI 1 year sooner is great, but it only leads to one extra year of flourishing. Getting unaligned AGI leads to a significant loss over the entire far-future. 

So even if we have a 50-70% chance of alignment by default, I don't see how your central conclusions follow.

I don't agree with everything in the post, but I do commend Sam for writing it. I think it's a rather clear and transparent post that summarizes some important aspects of his worldview, and I expect posts like this to be extremely useful for discourse about AI safety.

Here are three parts I found especially clear & useful to know:

Thoughts on safety standards

We think it’s important that efforts like ours submit to independent audits before releasing new systems; we will talk about this in more detail later this year. At some point, it may be important to get independent review before starting to train future systems, and for the most advanced efforts to agree to limit the rate of growth of compute used for creating new models. We think public standards about when an AGI effort should stop a training run, decide a model is safe to release, or pull a model from production use are important. Finally, we think it’s important that major world governments have insight about training runs above a certain scale.

Thoughts on openness

We now believe we were wrong in our original thinking about openness, and have pivoted from thinking we should release everything (though we open source some things, and expect to open source more exciting things in the future!) to thinking that we should figure out how to safely share access to and benefits of the systems. We still believe the benefits of society understanding what is happening are huge and that enabling such understanding is the best way to make sure that what gets built is what society collectively wants (obviously there’s a lot of nuance and conflict here)

Connection between capabilities and safety

Importantly, we think we often have to make progress on AI safety and capabilities together. It’s a false dichotomy to talk about them separately; they are correlated in many ways. Our best safety work has come from working with our most capable models. That said, it’s important that the ratio of safety progress to capability progress increases.

Thoughts on timelines & takeoff speeds

AGI could happen soon or far in the future; the takeoff speed from the initial AGI to more powerful successor systems could be slow or fast. Many of us think the safest quadrant in this two-by-two matrix is short timelines and slow takeoff speeds; shorter timelines seem more amenable to coordination and more likely to lead to a slower takeoff due to less of a compute overhang, and a slower takeoff gives us more time to figure out empirically how to solve the safety problem and how to adapt.

It’s possible that AGI capable enough to accelerate its own progress could cause major changes to happen surprisingly quickly (and even if the transition starts slowly, we expect it to happen pretty quickly in the final stages). We think a slower takeoff is easier to make safe, and coordination among AGI efforts to slow down at critical junctures will likely be important (even in a world where we don’t need to do this to solve technical alignment problems, slowing down may be important to give society enough time to adapt).

Load More