[An email with a bunch of links I sent an experienced ML researcher interested in learning about Alignment / x-safety.]

David Scott Krueger

[[FYI: I'm just copying this in and removing a few bits; apologies for formatting; I don't intend to post the attachments]]

EtA: I (embarrasingly and unfortunately) understated Richard Ngo's technical background in email; I left what I originally wrote in ~~strikethrough.~~

EtA(2): I thought a bit more context would be useful here:
- This email was designed for a particular person after several in person conversations.
- I'm not proposing this as anything like "The best things to send to an arbitrary 'experienced ML researcher interested in learning about Alignment / x-safety'".
- I didn't put a ton of effort into this.
- I aimed to present a somewhat diverse and representative sampling of x-safety stuff.

EtA (3): Suggestions / Feedback welcome!

OK I figure perfect is the enemy of the good and this is already a doozy of an email, so I'm just going to send it :)

I think it would be great to get a better sense of what sort of materials are a good introduction for someone in your situation, so please let me know what you find most/least useful/etc.!

A few top recommendations...

- read about DeepMind's "safety, robustness, and assurance" breakdown of alignment
- sign up for the Alignment and ML Safety newsletters, and skim through the archives

- Look at this syllabus I mocked up for my UofT application (attached). I tried to focus on ML and include a lot of the most important ML papers, although there's a lot missing.

- Read about RL from human preferences if you haven't already. I imagine you might've seen the "backflipping noodle" blog post. I helped author a research agenda based on such approaches Scalable agent alignment via reward modeling. The research I talked about in my talk is part of a project on understanding reward model hacking; we wrote a short grant proposal for that (attached). I'm in the midst of rethinking how large a role I expect reward modeling (or RL more generally) to play in future AI systems, but I previously considered this one of the highest priority directions to work on. Others (e.g. Jan, first author of the agenda) are working on showing what you can do with reward modeling; I'm more concerned with figuring out if it is a promising approach at all or not (I suspect not, because of power-seeking / instrumental convergence).

- Look at the "7 Alternatives for agent alignment" in the agenda for a brief overview of alternative approaches to specification.
- Look at "10.1 Related research agendas" in ARCHES for a quick overview of various research agendas.

- The most widely known/cited agenda is

Concrete Problems in AI Safety

- My thesis (attached) doesn't include as much as I remember, but you still might find the sections on x-risk and Alignment a good+quick read.

- People have described the AGI fundamentals course as the best resource for alignment newcomers. Sections 4-6 seem the most relevant. It is aimed at people who also don't know about AI, and Richard (the author) ~~is a philosopher without much technical training~~ is a research scientist at OpenAI (AI Futures team), did a PhD in Philosophy of ML, Masters in ML and was a research engineer at DeepMind for two years. I haven't looked at it closely but I think it emphasizes the "pseudo-academic" literature, framing, and terminology more than I would like.

I'm personally not super excited about any of the technical research directions people have proposed, but I know a lot of people are a lot more excited about various agendas than I am. Rather than saying much more, I figured I would just share some resources with you and I'll be curious to hear your thoughts about any of this that you end up looking at!

Some random context:
A lot of thinking and writing in alignment happens outside academia. There's a whole pseudo-academic field with its own jargon that does a lot of reinventing wheel and insight porn. I think it's definitely worth knowing about this stuff, since there are a lot of good ideas, and a lot of the hard work of alignment is about just trying to get any technical angle of attack on the problem. But I find it hard to keep track of, lacking in rigor, and overall I'm frustrated by the lack of clear standards which seems to create a bit of a nepotistic / cult-of-personality vibe. This stuff is mostly posted on the alignment forum / LessWrong. I'd like to encourage more cross-polination between these communities and academia, but there is something of a mutual lack of respect. I think having clearer standards for this pseudo-academic field, and leaning into the speculative/pre-paradigmatic/less-technical nature of a lot of the work would help.
More context:
I did a lot of reading and talking to people about Alignment years ago, and haven't kept up as much with recent stuff that people have written, partially for lack of time, and partially because I feel like I already know / have thought about most things enough... But I also think a lot of the more recent stuff is probably a lot better written or further developed.

- Here are a few articles from LessWrong and EA forum that are specifically about the relevance of various kinds of technical research for AI x-safety:

https://forum.effectivealtruism.org/posts/hNPCo4kScxccK9Ham/open-problems-in-ai-x-risk-pais-5
https://www.lesswrong.com/posts/hvGoYXi2kgnS3vxqb/some-ai-research-areas-and-their-relevance-to-existential-1

https://www.lesswrong.com/posts/fRsjBseRuvRhMPPE5/an-overview-of-11-proposals-for-building-safe-advanced-ai

https://www.lesswrong.com/posts/FDJnZt8Ks2djouQTZ/how-do-we-become-confident-in-the-safety-of-a-machine

- A few articles motivating x-safety:
https://www.lesswrong.com/posts/HduCjmXTBD4xYTegv/draft-report-on-existential-risk-from-power-seeking-ai
https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/

- A few articles discussing nuanced versions of AI x-risk than the "spherical cow":
https://www.lesswrong.com/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic
https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-like

Two hot new AI safety orgs are Redwood Research and Anthropic.
- This is Redwood's first (I think) publication/project (I haven't read it, but have talked to people about it. You can read about the motivation here):

Adversarial Training for High-Stakes Reliability

- This gives a good sense of how Anthropic is thinking about Alignment:

A General Language Assistant as a Laboratory for Alignment

- A lot of people are now excited about trying to align foundation models. Here's a post on that: https://www.lesswrong.com/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models

I like thinking of this as basically being about getting models to do instruction following; examples include instructGPT and this earlier Google paper:

generalizable Language Models with Instruction Fine-Tuning

- Paul Christiano is probably the biggest name (after Eliezer Yudkowsky)in the non-academic technical AI Alignment community.

-- This idea of having AIs that help you align something smarter that you mentioned is very similar to "Iterated Distillation and Amplification (IDA)" (which is itself quite vague/general...)

--- IDA can be viewed as motivationHumans consulting HCH (HCH): https://ai-alignment.com/humans-consulting-hch-f893f6051455

--- https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616
--- https://arxiv.org/abs/1805.00899

--- Debate is similar in spirit to IDA: https://arxiv.org/abs/1810.08575
--- So is recursive reward modeling (mentioned in our reward modeling agenda, also this paper uses that: https://arxiv.org/abs/2109.10862)
-- I guess Paul now thinks IDA won't be competitive with other approaches, so has pivoted to this new thing called "Eliciting Latent Knowledge (ELK)": https://www.lesswrong.com/posts/zjMKpSB2Xccn9qi5t/elk-prize-results

- People are excited about interpretability, especially Chris Olah's work. I remain skeptical that this will pay off (for a variety of reasons), but it does look more promising than I originally expected.

https://www.lesswrong.com/posts/X2i9dQQK3gETCyqh2/chris-olah-s-views-on-agi-safety

https://distill.pub/2020/circuits/?utm_campaign=Dynamically%20Typed&utm_medium=email&utm_source=Revue%20newsletter
https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

- Stuart Russell's approach is "assistance games". He talks about this in his book Human Compatible, which was a quick/fun read! The first paper is Cooperative IRL, a more recent (probably clearer) one is

Benefits of Assistance over Reward Learning

I don't think many others in the community are excited about this idea ATM. Rohin and Dima used to be, but seem convinced (maybe partially through discussions with me?) that the best way to get assistance like behavior would be to just use reward modeling to train an AI to follow an assistance-like interaction protocol when interacting with people. This makes it seem like assistance isn't the best goal, and we should instead start with trying to learn some simpler interaction protocols that could help with IDA-style "bootstrapping".

Newsletters + Podcasts (some of these already mentioned above):

- IMO, Rohin Shah's alignment newsletter has historically been the best source for learning about and keeping track of developments in AI Alignment, and has done a good job of covering stuff both inside and outside of academia. Rohin has been super busy since starting at DeepMind, and is not keeping it up very well at the moment.
- Dan Hendrycks recently started a newsletter as well that is more ML focused. I think he shares my frustration (and more!) with the non-academic alignment community. While Rohin is maybe more focused on work that is motivated by x-safety, Dan is more focused on what seems relevant to x-safety.
- Daniel Filan has a podcast.
- FLI had a podcast, e.g. https://futureoflife.org/2020/04/15/an-overview-of-technical-ai-alignment-in-2018-and-2019-with-buck-shlegeris-and-rohin-shah/
- I've heard the 80k podcast has some of the best intro material; it's generally pretty good, IMO. Here's one: https://80000hours.org/podcast/episodes/paul-christiano-ai-alignment-solutions/

A few important peer-reviewed papers from AI Alignment researchers:

47

[An email with a bunch of links I sent an experienced ML researcher interested in learning about Alignment / x-safety.]

47

Ω 16

Concrete Problems in AI Safety

Adversarial Training for High-Stakes Reliability

A General Language Assistant as a Laboratory for Alignment

generalizable Language Models with Instruction Fine-Tuning

Benefits of Assistance over Reward Learning

Cooperative Inverse Reinforcement Learning

Deep reinforcement learning from human preferences

Benchmarking Neural Network Robustness

Value Alignment Verification

The Effects of Reward Misspecification

Optimal Policies Tend to Seek Power

47

Ω 16

47

Ω 16