Resources that (I think) new alignment researchers should know about

Orpheus16

EDIT: This post was last updated in February 2023. It may no longer be fully up to date.

Here are some resources that I think new alignment researchers should know about.

Starting off

AGI Safety Fundamentals is the most common starting point. There’s also the (new) Alignment 201 course, which is meant to be read after AGI safety fundamentals. Note that one disadvantage of reading a popular curriculum is that it might get you thinking in the same ways as everyone else (which might be especially concerning if people are currently coming up with the same ideas).

Exposing yourself to different sources everyone else seems like a pretty good way to try to come up with different ideas than everyone else. As a result, I often recommend people spend a few weeks trying out alternative approaches.

For example, you might start thinking about the problem for yourself, diving deeply into the agenda of a researcher you respect, writing down your naive hypotheses about AI alignment, diving into list of lethalities, writing about what you agree with and disagree with, and trying out alignment problems, or performing other exercises (see below).

On the margin, I think more people should try out these “alternative” approaches. I’m especially excited when people are able to write, distill, and analyze works relatively early on in their upskilling journey.

There are also some readings that are meant to help you think about how to think about alignment. These are mostly about how to think rather than what to think (or what to think about), which makes me more comfortable recommending them widely. (Note though that the point about diversifying and seeking out alternative perspectives still applies).

Some examples of readings about how to think about alignment research:

Most people start with the same few bad ideas (4 mins)
7 traps that (we think) new alignment researchers often fall into (4 mins)
More is different for AI (3 mins)
Worst-case thinking in AI alignment (8 mins)
Principles for alignment/agency projects (5 mins)
Taboo “outside view” (10 mins)

Nonetheless, I think AGI Safety Fundamentals is a decent place to start. I recommend completing it quickly (I expect you can do it in 1-4 weeks, as opposed to 8. Though note that it is useful to spend time thinking, reflecting, and writing).

Diving Deeper

OK, but what next? Broadly, you have three options: independent research, a program run by the AI safety community, or a program run by the rest of the world.

Independent Research

This is the easiest way to start, and it’s the most unstructured. You can follow your curiosities. You can go through an existing reading list. You can skill-up in machine learning. You can do whatever you want, basically.

If you take this route, I recommend staying focused on trying to solve the problem (or specific subproblems). It’s easy to spend a lot of time “learning” without actually focusing on doing alignment research. With that in mind, if you don’t know any linear algebra, calculus, or ML, nearly everyone recommends getting (at least) a basic understanding of these. See also 7 traps that (we think) new alignment researchers often fall into.

I usually recommend that people start off by spending at least 1-2 weeks doing research without much contact with others. I think it’s useful to be forced to come up with your own opinions before you get anchored by the ideas of others.

After that, many people find it helpful to connect with alignment researchers, visit AI safety hubs, and find “sparring partners” to discuss ideas with. There are also some Slack channels and discords.

Programs within the AI Safety Community

SERI-Mats: Upskill in alignment research with a mentor

Refine: Upskill in conceptual alignment research with an emphasis on thinking for yourself (EDIT: No longer running)

Remix: Contribute to mechanistic interpretability research (EDIT: See Redwood Research's summer internship)

MLAB: Upskill in machine learning (advanced)

ML Safety Scholars: Upskill in machine learning (beginners)

Philosophy Fellowship: For grad students and PhDs in philosophy

PIBBSS: For social scientists and natural scientists

Long-term future fund: For funding to support upskilling, AIS projects, and other ways to improve the long-term future

Other Programs

Online machine learning courses (e.g., this one)

Internships, non-alignment research projects, and courses in ML/math/CS.

Graduate schools (depending on your timelines)

Funding

I mentioned the long-term future fund earlier. You can apply for funding if you are interested in skilling up. On the margin, I think people should have a lower bar for applying. The application often takes 1-2 hours. If you are unsure about whether or not you should apply, I suggest you apply.

You can also message me if you are unsure about whether or not to apply (I do not work for EA funds, and everything I say is my own opinion).

Exercises

I mentioned that I’m excited for people to do more “thinking for themselves” and writing. But what does this actually look like?

Here are some exercises that you may want to consider doing:

Naive Hypotheses: Write down naive hypotheses for how to solve alignment
Opinions on arguments: Take a post and explicitly which arguments you agree with and which arguments you disagree with
Vivek Hebbar’s alignment problems: A list of alignment problems that focus on core problems & encourage independent thinking
What 2026 looks like: Write down a detailed account of what you expect to happen in the upcoming years.
Conceptual alignment projects: Write down a list of conceptual (or empirical) alignment projects that you think are important.
Macrostrategy action plans: Write down your best guess of what everyone in the EA community should be prioritizing. Consider conditioning on certain assumptions about the world (e.g., if AGI is coming in 3 years, what should the AIS community be doing? What recommendations would I have for specific people, or for large portions of the community?)
Write down your current plan for solving alignment (similar to naive hypotheses, though a bit different in structure).
Taboo words like “complexity” or “inner alignment” or “AGI” or “reward”, and see how this affects your thinking.
If you’re working on a specific problem, ask yourself what success looks like. If you solve the specific problem, what happens next? How does this get us closer to aligned AI?
Distillations: Write up summaries of papers, conversations, videos, etc.

I'm grateful to Thomas Larsen and Olivia Jimenez for feedback on this post. If you want more suggestions from me and Olivia, check out this EAGxVirtual talk.

Thanks for writing this! Some other useful lists of resources:

AI Safety Support's giant list of useful links. It's got a lot of good stuff in there and stays pretty up to date
List of AI safety technical courses and reading lists
"AI safety resources and materials" tag on the EA Forum
"Collections and resources" tag on the EA Forum
"Collections and resources" tag on LessWrong
"List of links" tag on LessWrong

This is super valuable! Would you be happy for this content to be integrated into Stampy to be maintained as a living document?

Akash -- this is very helpful; thanks for compiling it!

I'm struck that much of the advice for newbies interested in 'AI alignment with human values' is focused very heavily on the 'AI' side of alignment, and not on the 'human values' side of alignment -- despite the fact that many behavioral and social sciences have been studying human values for many decades.

It might be helpful to expand lists like these to include recommended papers, books, blogs, videos, etc that can help alignment newbies develop a more sophisticated understanding of the human psychology side of alignment.

I have a list of recommended nonfiction books here, but it's not alignment-focused. From this list though, I think that many alignment researchers might benefit from reading 'The blank slate' (2002) by Steven Pinker, 'The righteous mind' (2012) by Jonathan Haidt, 'Intelligence' (2016) by Stuart Ritchie, etc.

Primarily because right now, we're not even close to that goal. We're trying to figure out how to avoid deceptive alignment right now.

If we're nowhere close to solving alignment well enough that even a coarse-grained description of actual human values is relevant yet, then I don't understand why anyone is advocating further AI research at this point.

Also, 'avoiding deceptive alignment' doesn't really mean anything if we don't have a relatively rich and detailed description of what 'authentic alignment' with human values would look like.

I'm truly puzzled by the resistance that the AI alignment community has against learning a bit more about the human values we're allegedly aligning with.

I agree with this. What if it was actually possible to formalize morality? (Cf «Boundaries» for formalizing an MVP morality.) Inner alignment seems like it would be a lot easier with a good outer alignment function!

Mostly because ambitious value learning is really fucking hard, and this proposal falls into all the problems that ambitious or narrow value learning has.

You're right though that AI capabilities will need to slow down, and I am not hopeful here.

Thanks for writing this! I would also add the CHAI internship to the list of programs within the AI safety community.