(My understanding of) What Everyone in Technical Alignment is Doing and Why

elifland

One insight this has generated so far is that Reward is not the optimization target — instead, it is more like that which reinforces. People sometimes refer to inner aligning an RL agent with respect to the reward signal, but this doesn't actually make sense.

Grumble grumble. Savvy people have known that reward is not the optimization target for at least five years, probably more like a decade. It's true that various people don't know this yet & so I'm glad that post was written, but it's a bit unfair to credit shard theory with having generated that idea. (I think TurnTrout would agree with this, his post says that alignment people seem to be aware of this point already)

[-]David Scott Krueger (formerly: capybaralet)3yΩ6123

I don't consider this a settled question; is there rigorous technical work establishing that "Reward is not the optimization target"?

[-]Daniel Kokotajlo3yΩ8162

Depends on your standards for "rigorous technical work" and "establishing." In some sense nothing on this topic is sufficiently rigorous, and in some sense nothing on this topic has been established yet. I think the Risks from Learned Optimization paper might be what you are looking for. There's also evhub's recent talk. And of course, TurnTrouts post that was linked above. And again I just pull these out of the top of my head, the ideas in them have been floating around for a while.

I'd be interested to hear an argument that reward is the optimization target, if you've got one!

I suspect that this is an issue that will be cleared up by everyone being super careful and explicit and nitpicky about their definitions. (Because I think a big part of what's going on here is that people aren't doing that and so they are getting subtly confused and equivocating between importantly different statements, and then on top of that other people are misunderstanding their words)

[-]David Scott Krueger (formerly: capybaralet)3yΩ5157

Thanks! I don't think those meet my criteria. I also suspect "everyone being super careful and explicit and nitpicky about their definitions" is lacking, and I'd consider that a basic and essential component of rigorous technical work.

8Daniel Kokotajlo3y

Agreed! Got an argument that reward is the optimization target?

9David Scott Krueger (formerly: capybaralet)3y

I don't think this framing of it being the optimization target or not is very helpful. It's like asking "does SGD converge?" or "will my supervised learning model learn the true hypothesis?" The answer will depend on a number of factors, and it's often not best thought of as a binary thing. e.g. for agents that do planning based on optimizing a reward function, it seems appropriate to say that reward is the optimization target. Here's another argument: maybe it's the field of RL, and not Alex Turner, who is right about this: https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target#Appendix__The_field_of_RL_thinks_reward_optimization_target (I'm not sure Alex characterizes the field's beliefs correctly, and I'm sort of playing devil's advocate with that one (not a big fan of "outside views"), but it's a bit odd to act like the burden of proof is on someone who agrees with the relevant academic field).

6Daniel Kokotajlo3y

Thanks! I'm not sure the framing is helpful either, but reading Turner's linked appendix it does seem like various people are making some sort of mistake that can be summarized as "they seem to think the policy / trained network should be understood as trying to get reward, as preferring higher-reward outcomes, as targeting reward..." (And Turner says he himself was one of them despite doing a PhD in RL theory) Like I said above I think that probably there's room for improvement here -- if everyone defined their terms better this problem would clear up and go away. I see Turner's post as movement in this direction but by no means the end of the journey. Re your first argument: If I understand you correctly, you are saying that if your AI design involves something like monte-carlo tree search using a reward-estimator module (Idk what the technical term for that is) and the reward-estimator module is just trained to predict reward, then it's fair to describe the system as optimizing for the goal of reward. Yep that seems right to me, modulo concerns about inner alignment failures in the reward-estimator module. I don't see this as contradicting Alex Turner's claims but maybe it does. Re your second argument, the appeal to authority: I suppose in a vacuum, not having thought about it myself or heard any halfway decent arguments, I'd defer to the RL field on this matter. But I have thought about it a bit myself and I have heard some decent arguments, and that effect is stronger than the deference effect for me, and I think this is justified.

4David Scott Krueger (formerly: capybaralet)3y

RE appeal to authority: I mostly mentioned it because you asked for an argument and I figured I would just provide any decent ones I thought of OTMH. But I have not provided anything close to my full thoughts on the matter, and probably won't, due to bandwidth.

5Steven Byrnes3y

Often, when an RL agent imagines a possible future roll-out, it does not evaluate whether that possible future is good or bad by querying an external ground-truth reward function; instead, it queries a learned value function. When that’s the case, the thing that the agent is foresightedly “trying” / “planning” to do is to optimize the learned value function, not the reward function. Right? For example, I believe AlphaZero can be described this way—it explores some number of possible future scenarios (I’m hazy on the details), and evaluates how good they are based on querying the learned value function, not querying the external ground-truth reward function, except in rare cases where the game is just about to end. I claim that, if we make AGI via model-based RL (as I expect), it will almost definitely be like that too. If an AGI has a (nonverbal) idea along the lines of “What if I try to invent a new microscope using (still-somewhat-vague but innovative concept)”, I can’t imagine how on earth you would build an external ground-truth reward function that can be queried with that kind of abstract hypothetical. But I find it very easy to imagine how a learned value function could be queried with that kind of abstract hypothetical. (You can say “OK fine but the learned value function will asymptotically approach the external ground-truth reward function”. However, that might or might not be true. It depends on the algorithm and environment. I expect AGIs to be in a nonstationary environment with vastly too large an action space to fully explore, and full of irreversible actions that make full exploration impossible anyway. In that case, we cannot assume that there’s no important difference between “trying” to maximize the learned value function versus “trying” to maximize the reward function.) Sorry if I’m misunderstanding. (My own discussion of this topic, in the context of a specific model-based RL architecture, is Section 9.5 here.)

[-]Rohin Shah3y*Ω15414

Note: I link to a bunch of stuff below in the context of the DeepMind safety team, this should be thought of as "things that particular people do" and may not represent the views of DeepMind or even just the DeepMind safety team.

I just don't know much about what the [DeepMind] technical alignment work actually looks like right now

We do a lot of stuff, e.g. of the things you've listed, the Alignment / Scalable Alignment Teams have done at least some work on the following since I joined in late 2020:

Eliciting latent knowledge (see ELK prizes, particularly the submission from Victoria Krakovna & Vikrant Varma & Ramana Kumar)
LLM alignment (lots of work discussed in the podcast with Geoffrey you mentioned)
Scalable oversight (same as above)
Mechanistic interpretability (unpublished so far)
Externalized Reasoning Oversight (my guess is that this will be published soon) (EDIT: this paper)
Communicating views on alignment (e.g. the post you linked, the writing that I do on this forum is in large part about communicating my views)
Deception + inner alignment (in particular examples of goal misgeneralization)
Understanding agency (see e.g. discovering agents, most of Ramana's posts)

And in ... (read more)

3Thomas Larsen3y

Thanks you for this thoughtful response, I didn't know about most of these projects. I've linked this comment in the DeepMind section, as well as done some modifications for both clarity and including a bit more. This is useful to know.

[-]Vika3yΩ5104

Thanks Thomas for the helpful overview post! Great to hear that you found the AGI ruin opinions survey useful.

I agree with Rohin's summary of what we're working on. I would add "understanding / distilling threat models" to the list, e.g. "refining the sharp left turn" and "will capabilities generalize more".

Some corrections for your overall description of the DM alignment team:

I would count ~20-25 FTE on the alignment + scalable alignment teams (this does not include the AGI strategy & governance team)
I would put DM alignment in the "fairly hard" bucket (p(doom) = 10-50%) for alignment difficulty, and the "mixed" bucket for "conceptual vs applied"

3Thomas Larsen3y

Sorry for the late response, and thanks for your comment, I've edited the post to reflect these.

2Vika3y

No worries! Thanks a lot for updating the post

[-]David Scott Krueger (formerly: capybaralet)3yΩ203523

The main thing missing here are academic groups (like mine at Cambridge https://www.davidscottkrueger.com/). This is a pretty glaring oversight, although I'm not that surprised since it's LW.

Some other noteworthy groups in academia lead by people who are somewhat connected to this community:
- Jacob Steinhardt (Berkeley)
- Dylan Hadfield-Menell (MIT)
- Sam Bowman (NYU)
- Roger Grosse (UofT)

More at https://futureoflife.org/team/ai-existential-safety-community/ (although I think the level of focus on x-safety and engagement with this community varies substantially among these people).

BTW, FLI is itself worth a mention, as is FHI, maybe in particular https://www.fhi.ox.ac.uk/causal-incentives-working-group/ if you want to focus on technical stuff.

Some other noteworthy groups in academia lead by people who are perhaps less connected to this community:
- Aleksander Madry (MIT)
- Percy Liang (Stanford)
- Scott Neikum (UMass Amhearst)

These are just examples.

[-]elifland3y120

(speaking for just myself, not Thomas but I think it’s likely he’d endorse most of this)

I agree it would be great to include many of these academic groups; the exclusion wasn’t out of any sort of malice. Personally I don’t know very much about what most of these groups are doing or their motivations; if any of them want to submit brief write ups I‘d be happy to add them! :)

edit: lol, Thomas responded with a similar tone while I was typing

[-]David Reber3y106

The causal incentives working group should get mentioned, it's directly on AI safety: though it's a bit older I gained a lot of clarity about AI safety concepts via "Modeling AGI Safety Frameworks with Causal Influence Diagrams", which is quite accessible even if you don't have a ton of training in causality.

[-]Thomas Larsen3y100

Sorry about that, and thank you for pointing this out.

For now I've added a disclaimer (footnote 2 right now, might make this more visible/clear but not sure what the best way of doing that is). I will try to add a summary of some of these groups in when I have read some of their papers, currently I have not read a lot of their research.

Edit: agree with Eli's comment.

0Gunnar_Zarncke3y

Can you provide some links to these groups?

[-]aog3yΩ8188

These professors all have a lot of published papers in academic conferences. It’s probably a bit frustrating to not have their work summarized, and then be asked to explain their own work, when all of their work is published already. I would start by looking at their Google Scholar pages, followed by personal websites and maybe Twitter. One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.

[-]Thomas Larsen3yΩ6108

Agree with both aogara and Eli's comment.

One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.

For me this reading between the lines is hard: I spent ~2 hours reading academic papers/websites yesterday and while I could quite quickly summarize the work itself, it was quite hard to me to figure out the motivations.

[-]David Scott Krueger (formerly: capybaralet)3yΩ31112

There's a lot of work that could be relevant for x-risk but is not motivated by it. Some of it is more relevant than work that is motivated by it. An important challenge for this community (to facilitate scaling of research funding, etc.) is to move away from evaluating work based on motivations, and towards evaluating work based on technical content.

6elifland3y

See The academic contribution to AI safety seems large and comments for some existing discussion related to this point

7joshc3y

PAIS #5 might be helpful here. It explains how a variety of empirical directions are related to X-Risk and probably includes many of the ones that academics are working on.

7aog3y

Agreed it's really difficult for a lot of the work. You've probably seen it already but Dan Hendrycks has done a lot of work explaining academic research areas in terms of x-risk (e.g. this and this paper). Jacob Steinhardt's blog and field overview and Sam Bowman's Twitter are also good for context.

[-]David Reber3y125

I second this, that it's difficult to summarize AI-safety-relevant academic work for LW audiences. I want to highlight the symmetric difficulty of trying to summarize the mountain of blog-post-style work on the AF for academics.

In short, both groups have steep reading/learning curves that are under-appreciated when you're already familiar with it all.

7elifland3y

Fair, I see why this would be frustrating and apologize for any frustration caused. In an ideal world we would have read many of these papers and summarized them ourselves, but that would have taken a lot of time and I think the post was valuable to get out ASAP. ETA: Probably it would have been better to include more of a disclaimer on the "everyone" point from the get-go, I think not doing this was a mistake.

5aog3y

(Also, this is an incredibly helpful writeup and it’s only to be expected that some stuff would be missing. Thank you for sharing it!)

6JohnMalin3y

I don't think the onus should be on the reader to infer x-risk motivations. In academic ML, it's the author's job to explain why the reader should care about the paper. I don't see why this should be different in safety. If it's hard to do that in the paper itself, you can always e.g. write a blog post explaining safety relevance (as mentioned by aogara, people are already doing this, which is great!). There are often many different ways in which a paper might be intended to be useful for x-risks (and ways in which it might not be). Often the motivation for a paper (even in the groups mentioned above) may be some combination of it being an interesting ML problem, interests of the particular student, and various possible thoughts around AI safety. It's hard to try to disentangle this from the outside by reading between the lines.

2Morpheus3y

On the other hand there are a lot of reasons to belief the authors to be delusional about promises of their research and it's theory for impact. I think the most I get personally out of posts like this is having this 3rd party perspective that I can compare with my own.

0johnswentworth3y

On the one hand, yeah, probably frustrating. On the other hand, that's the norm in academia: people publish work and then nobody reads it.

[-]David Reber3y2716

Anecdotally, I've found the same said of Less Wrong / Alignment Forum posts among AI safety / EA academics: that it amounts to an echo chamber that no one else reads.

I suspect both communities are taking their collective lack of familiarity with the other as evidence that the other community isn't doing their part to disseminate their ideas properly. Of course, neither community seems particularly interested in taking the time to read up on the other, and seems to think that the other community should simply mimic their example (LWers want more LW synopses of academic papers, academics want AF work to be published in journals).

Personally I think this is symptomatic of a larger camp-ish divide between the two, which is worth trying to bridge.

[-]aog3y2617

All of these academics are widely read and cited. Looking at their Google Scholar profiles, everyone one of them has more than 1000, and half have more than 10,000 citations. Outside of LessWrong, lots of people in academia and industry labs already read and understand their work. We shouldn't disparage people who are successfully bringing AI safety into the mainstream ML community.

[-]Thomas Larsen3y301

Just made a fairly large edit to the post after lots of feedback from commenters. My most recent changes include the following:

Note limitations in introduction (lack academics, not balanced depth proportional to people, not endorsed by researchers)
Update CLR as per Jesse's comment
Add FAR
Update brain-like AGI to include this.
Rewrite shard theory section
- Brain <-> shards
effort: 50 -> 75 hours :)
Add this paper to DeepMind
Add some academics (David Krueger, Sam Bowman, Jacob Steinhardt, Dylan Hadfield-Menell, FHI)
Add other category
Summary table updates:
- Update links in table to make sure they work.
- Add scale of organization
- Add people

Thank you to everyone who commented, it has been very helpful.

[-]TW1233y260

Thanks so much for writing this! I think it's a very useful resource to have. I wanted to add a few thoughts on your description of CAIS, which might help make it more accurate.

[Note: I worked full time at CAIS from its inception until a couple weeks ago. I now work there on a part time basis while finishing university. This comment hasn't been reviewed by others at CAIS, but I'm pretty confident it's accurate.]

For somebody external to CAIS, I think you did a fairly good job describing the organization so thank you! I have a couple things I'd probably change:

First, our outreach is not just to academics, but also to people in industry. We usually use the term "ML community" rather than "academia" for this reason.
Second, the technical research side of the organization is about a lot more than robustness. We do research in Trojans as you mention, which isn't robustness, but also in machine ethics, cooperative AI, anomaly detection, forecasting, and probably more areas soon. We are interested in most of the areas in Open Problems in AI X-Risk, but the extent to which we're actively working on them varies.
I also think it might be good to add our newly-announced (so maybe after you wrote

... (read more)

8Thomas Larsen3y

Thank you Thomas, I really appreciate you taking the time to write out your comment, it is very useful feedback. I've linked your comment in the post and rewritten the description of CAIS.

3TW1233y

Thanks! I really appreciate it, and think it's a lot more accurate now. Nitpicks: I think the MLSS link is currently broken. Also, in the headline table, it still emphasizes model robustness perhaps more than is warranted.

2Thomas Larsen3y

Right! I've changed both.

[-]Lone Pine3y1115

I confused CAIS with Drexler's Comprehensive AI Services. Can you add a clarification stating that they are different things?

1Thomas Larsen3y

Good point. We've added the Center for AI Safety's full name into the summary table which should help.

6TW1233y

Also, as to your comment: (My personal opinion, not necesasarily the opinion of CAIS) I pretty much agree. It's the job of the concretizers (and also grantmakers to some extent) to incentivize/nudge research to be in a useful direction rather than a nonuseful direction, and for fieldbuilding to shift researchers towards more explicitly considering x-risk. But, as you say, competition can be a valuable force; if you can set the incentives right, it might not be necessary for all researchers to be caring about x-risk. If you can give them a fun problem to solve and make sure it's actually relevant and they are only rewarded for actually relevant work, then good research could still be produced. Relevant research has been produced by the ML community before by people who weren't explicitly thinking about x-risk (mostly "accidentally", i.e. not because anyone who cared about x-risk told them/incentivized them to, but hopefully this will change). Also, iterative progress involves making progress that works now but might not in the future. That's ok, as long as some of it does in fact work in the future.

8Thomas Larsen3y

Yeah I think the difficulty of setting this up correctly is the main crux. I'm quite uncertain on this, but I'll give the argument my model of John Wentworth makes against this: The Trojan detection competition it does seem roughly similar to deception, and if you can find Trojan's really well, it's plausible that you can find deceptive alignment. However, what we really need is a way to exert optimization pressure away from deceptive regions of parameter space. And right now, afaik, we have no idea how strongly deception is favored. I can imagine using methods from this competition to put a small amount of pressure away from this, by, e.g., restarting whenever you see deception, or running SGD on your interpreted deception. But this feels sketchy because 1) you are putting pressure on these tools, and you might just steer into regions of space where they fail, and 2) you are training a model until it becomes deceptive: eventually, a smart deceptive model will be actively trying to beat these tools. So what I really want is understanding the generators of deceptive alignment, which could take the form of formal version of the argument given here, so that I can prevent entering the deceptive regions of parameter space in the first place. Could you link an example? I am curious what you have in mind. I'm guessing something like the ROME paper?

1joshc3y

Thoughts on John's comment: this is a problem with any method for detecting deception that isn't 100% accurate. I agree that finding a 100% accurate method would be nice, but good luck. Also, you can somewhat get around this by holding some deception detecting methods out (i.e. not optimizing against them). When you finish training and the held out methods tell you that your AI is deceptive, you start over. Then you have to try to think of another approach that is more likely to actually discourage deception than fool your held out detectors. This is the difference between gradient descent search and human design search, which I think is an important distinction. Also, FWIW, I doubt that trojans are currently a good microcosm for detecting deception. Right now, it is too easy to search for the trigger using brute force optimization. If you ported this over to sequential-decision-making land where triggers can be long and complicated, that would help a lot. I see a lot of current trojan detection research as laying the groundwork for future research that will be more relevant. In general, it seems better to me to evaluate research by asking "where is this taking the field/what follow-up research is this motivating?" rather than "how are the words in this paper directly useful if we had to build AGI right now?" Eventually, the second one is what matters, but until we have systems that look more like agents that plan and achieve goals in the real world, I'm pretty skeptical of a lot of the direct value of empirical research.

[-]Charlie Steiner3y241

Because this is from your perspective, could you say a bit about who you are, what your research tastes are, which of these people you've interacted with?

[-]Thomas Larsen3y270

That makes sense. For me:

Background: I graduated from college at the University of Michigan this spring, I majored in Math and CS. In college I worked on vision research for self-driving cars, and wrote my undergrad thesis on robustness (my linkedin). I spent a lot of time running the EA group at Michigan. I'm currently doing SERI MATS under John Wentworth.
Research taste: currently very bad and confused and uncertain. I want to become better at research and this is mostly why I am doing MATS right now. I guess I especially enjoy reading and thinking about mathy research like Infra-Bayesianism and MIRI embedded agency stuff, but I'll be excited about whatever research I think is the most important.
I'm pretty new to interacting with the alignment sphere (before this summer I had just read things online and taken AGISF). Who I've interacted with (I'm probably forgetting some, but gives a rough idea):
1. 1 conversation with Andrew Critch
2. ~3 conversations with people at each of Conjecture and MIRI
3. ~8 conversations with various people at Redwood
4. Many conversations with people who hang around Lightcone, especially John and other SERI MATS participants (including Tea

... (read more)

[-]elifland3y130

Good point. For myself:

Background (see also https://www.elilifland.com/): I did some research on adversarial robustness of NLP models while in undergrad. I then worked at Ought as a software/research engineer for 1.5 years, was briefly a longtermist forecasting entrepreneur then have been thinking independently about alignment strategy among other things for the past 2 months.
Research tastes: I'm not great at understanding and working on super mathy stuff, so I mostly avoided giving opinions on these. I enjoy toy programming puzzles/competitions but got bored of engineering large/complex systems which is part of why I left Ought. I'm generally excited about some level of automating alignment research.
Who I've interacted with:
1. A ton: Ought
2. ~3-10 conversations: Conjecture (vast majority being "Simulacra Theory" team), Team Shard
3. ~1-2 conversations with some team members: ARC, CAIS, CHAI, CLR, Encultured, Externalized Reasoning Oversight, MIRI, OpenAI, John Wentworth, Truthful AI / Owain Evans

[-]JesseClifton3y*2214

[I work at CAIF and CLR]

Thanks for this!

I recommend making it clearer that CAIF is not focused on s-risk and is not formally affiliated with CLR (except for overlap in personnel). While it’s true that there is significant overlap in CLR’s and CAIF’s research interests, CAIF’s mission is much broader than CLR’s (“improve the cooperative intelligence of advanced AI for the benefit of all”), and its founders + leadership are motivated by a variety of catastrophic risks from AI.

Also, “foundational game theory research” isn’t an accurate description of CAIF’s scope. CAIF is interested in a variety of fields relevant to the cooperative intelligence of advanced AI systems. While this includes game theory and decision theory, I expect that a majority of CAIF’s resources (measured in both grants and staff time) will be directed at machine learning, and that we’ll also support work from the social and natural sciences. Also see Open Problems in Cooperative AI and CAIF’s recent call for proposals for a better sense of the kinds of work we want to support.

[ETA] I don’t think “foundational game theory research” is an accurate description of CLR’s scope, either, though I understand how public wr... (read more)

1Thomas Larsen3y

Thanks for the update! We've edited the section on CLR to reflect this comment, let us know if it still looks inaccurate.

[-]SteveZ3y1810

I think this is a really nice write-up! As someone relatively new to the idea of AI Safety, having a summary of all the approaches people are working on is really helpful as it would have taken me weeks to put this together on my own.

Obviously this would be a lot of work, but I think it would be really great to post this as a living document on GitHub where you can update and (potentially) expand it over time, perhaps by curating contributions from folks. In particular it would be interesting to see three arguments for each approach: a “best argument for”, “best argument against” and “what I think is the most realistic outcome”, along with uncertainties for each.

9Thomas Larsen3y

Thanks! I probably won't do this, but I agree it would be good. I agree that this would be good, but especially hard to do in a manner endorsed by all parties. I might try to write a second version of this post that tries to write this out, specifically, trying to clarify the assumptions on what the world has to look like for this research to be useful.

1JakubK3y

Maybe the "AI Watch" page could incorporate ideas from this post and serve as an equivalent to "a living document on GitHub."

[-]AdamGleave3y140

One omission from the list is the Fund for Alignment Research (FAR), which I'm a board member of. That's fair enough: FAR is fairly young, and doesn't have a research agenda per se, so it'd be hard to summarize their work from the outside!. But I thought it might be of interest to readers so I figured I'd give a quick summary here.

FAR's theory of change is to incubate new, scalable alignment research agendas. Right now I see a small range of agendas being pursued at scale (largely RLHF and interpretability), then a long tail of very diverse agendas being pursued by single individuals (mostly independent researchers or graduate students) or 2-3 person teams. I believe there's a lot of valuable ideas in this long tail that could be scaled, but this isn't happening due to a lack of institutional support. It makes sense that the major organisations want to focus on their own specific agendas -- there's a benefit to being focused! -- but it means a lot of valuable agendas are slipping through the cracks.

FAR's current approach to solving this problem is to build out a technical team (research engineers, junior research scientists, technical communication specialists) and provide support ... (read more)

1Thomas Larsen3y

Hi Adam, thank you so much for writing this informative comment. We've added your summary of FAR to the main post (and linked this comment).

[-]Rohin Shah3y132

The NAH is almost certainly not true for ethics itself (this would amount to a form of moral realism).

I don't follow. To get at my confusion:

Do you also think that the NAH is not true for trees because that would amount to a form of tree realism?
Do you think that GPT-N will not be able to answer questions about how humans would make ethical decisions?

Truthful AI

The authors don't view Truthful AI as a solution to alignment.

The default outcome of AGI is doom.

I object to the implication that the linked post argues for this claim: the "without specific countermeasures" part of that post does a lot of work.

4Thomas Larsen3y

Hi Rohin, thank you so much for your feedback. I agree with everything you said and will try to update the post for clarity. Sorry, that part was not well written (or well thought out), and so I'll try to clarify: What I meant by 'is the NAH true for ethics?' is 'do sufficiently intelligent agents tend to converge on the same goals?', which, now that I think about it, is just the negation of the orthogonality thesis. * I'm not sure I understand the tree realism post other than that a tree is a fuzzy category. While I am also fuzzy on the question of 'what are my values', that's not the argument I'm trying to make. * I definitely think GPT-N will be able to answer questions about how humans would make ethical decisions, and wouldn't be surprised if GPT-3 already performs fairly well at this. Thanks for pointing that out, I hadn't read that comment. Hm, yeah sorry for that poor reasoning, I think I should qualify that more. I do think that the default right now is that sufficient countermeasures are likely to not be deployed, but that point definitely deserves to be scrutinized more by me.

[-]Rohin Shah3y103

What I meant by 'is the NAH true for ethics?' is 'do sufficiently intelligent agents tend to converge on the same goals?', which, now that I think about it, is just the negation of the orthogonality thesis.

Ah, got it, that makes sense. The reason I was confused is that NAH applied to ethics would only say that the AI system has a concept of ethics similar to the ones humans have; it wouldn't claim that the AI system would be motivated by that concept of ethics.

[-]Anthony DiGiovanni3y*127

(Speaking for myself as a CLR researcher, not for CLR as a whole)

I don't think it's accurate to say CLR researchers think increasing transparency is good for cooperation. There are some tradeoffs here, such that I and other researchers are currently uncertain whether marginal increases in transparency are net good for AI cooperation. Though, it is true that more transparency opens up efficient equilibria that wouldn't have been possible without open-source game theory. (ETA: some relevant research by people (previously) at CLR here, here, and here.)

[-]AdamGleave3y115

I liked this post and think it'll serve as a useful reference point, I'll definitely send it to people who are new to the alignment field.

But I think it needs a major caveat added. As a survey of alignment research that regularly posts on LessWrong or interacts closely with that community, it does a fine job. But as capybaralet already pointed out, it misses many academic groups. And even some major industry groups are de-emphasized. For example, DeepMind alignment is 20+ people, and has been around for many years. But it's got if anything a slightly less detailed write-up than Team Shard, a small group of people for a few months, or infra-Bayesianism, largely one person for several years.

The best shouldn't be the enemy of the good, and some groups are just quite opaque, but I think it does need to be cleared about its limitations. One anti-dote would be including in the table a sense of # of people, # of years it's been around, and maybe even funding to get a sense of what the relative scale of these different projects are.

2Thomas Larsen3y

Strong upvoted and I quite like this antidote, I will work on adding my guess of the scale of these orgs into the table.

[-]Steven Byrnes3y101

Aligned AI / Stuart Armstrong
The problem is that I don't see how to integrate this approach for solving this problem with deep learning. It seems like this approach might work well for a model-based RL setup where you can make the AI explicitly select for this utility function.

For my part, I was already expecting AGI to be some kind of model-based RL. So I’m happy to make that assumption.

However, when I tried to flesh out model splintering (a.k.a. concept extrapolation) assuming a model-based-RL AGI—see Section 14.4 here—I still couldn’t quite get the whole story to hang together.

(Before publishing that, I sent a draft to Stuart Armstrong, and he told me that he had a great answer but couldn’t make it public yet :-P )

4Thomas Larsen3y

Thanks for linking that! Oooh that is really exciting news.

[-]Ruby3y91

Curated! I think this post is a considerable contribution to the ecosystem and one that many people are grateful for. Progress is made by people building on the works of others, and for that to happen, people have to be aware of the works of others and able to locate those most relevant to them. As the Alignment field grows, it gets progressively harder to keep up with what everyone is up to, what's been tried, where more effort might be useful. Roundups like these enable more people to get a sense of what's happening much more cheaply. And seeing an overv... (read more)

[-]Gunnar_Zarncke3y*93

The Brain-like AGI safety research agenda has proposed multiple research areas, and multiple people are working on some of them:

15.2.1.2 The “Reverse-engineer human social instincts” research program

There is project aintelope (see the project announcement here) that operationalizes this by implementing agents according to Steven's framework. We have applied for LTFF funding.
There is also at least one more researcher actively working on it.

15.2.2.2 The “Easy-to-use super-secure sandbox for AGIs” research program

Encultured AI is working on th

... (read more)

3Thomas Larsen3y

Good point, I've updated the post to reflect this. I'm excited for your project :)

[-]TW1233y93

As somebody who used to be an intern at CHAI, but certainly isn't speaking for the organization:

CHAI seems best approximated as a collection of researchers doing a bunch of different things. There is more reinforcement learning at CHAI than elsewhere, and it's ML research, but it's not top down at all so it doesn't feel that unified. Stuart Russell has an agenda, but his students have their own agendas which only sometimes overlap with his.

[-]JanB3yΩ480

Anthropic is also working on inner alignment, it's just not published yet.

Regarding what "the point" of RL from human preferences with language models is; I think it's not only to make progress on outer alignment (I would agree that this is probably not the core issue; although I still think that it's a relevant alignment issue).

See e.g. Ajeya's comment here:

According to my understanding, there are three broad reasons that safety-focused people worked on human feedback in the past (despite many of them, certainly including Paul, agreeing with this post tha

... (read more)

[-]habryka2yΩ461Review for 2022 Review

These kinds of overview posts are very valuable, and I think this one is as well. I think it was quite well executed, and I've seen it linked a lot, especially to newer people trying to orient to the state of the AI Alignment field, and the ever growing number of people working in it.

[-]Gabe M3y62

Thanks for actually taking the time to organize all the information here, this is and will be very useful!

For OpenAI, you could also link this recent blog post about their approach to alignment research that reinforces the ideas you already gathered. Though maybe that blog post doesn't go into enough detail or engage with those ideas critically and you've already read it and decided to leave it out?

2Thomas Larsen3y

Thank you Gabriel! Yeah good point, I think I should have included that link, updated now to include it.

[-]Gunnar_Zarncke2y40

It's almost two years. I think it would be valuable to do a review or update to this summary post!

[-]Soroush Pour2y41

For anybody else wondering what "ERO" stands for in the DeepMind section -- it stands for "Externalized Reasoning Oversight" and more details can be found in this paper.

Source: @Rohin Shah's comment.

[-]Roman Leventov3y40

Alignment of Complex Systems Research Group is missing from the post?

[-]Raemon3y40

Note: I wanted to curate this post, but it seemed like it was still in the process of getting revisions based on various feedback. Thomas/elifland, when you think you've made all the edits you're likely to make and new edit-suggestions have trailed off, give me a ping.

[-]RobertM3y43

Great writeup, very happy to see an overview of the field like this.

One note: it looks like the Infra-Bayesianism section is cut off, and ends on a sentence fragment:

In the worlds where AIs solve alignment for us

1Thomas Larsen3y

Thank you Robert! I've fixed that, thanks for pointing that sentence fragment out.

[-]Kerrigan3y30

Humans have different values than the reward circuitry in our brain being maximized, but they are still pointed reliably. These underlying values cause us to not wirehead with respect to the outer optimizer of reward

Is there an already written expansion of this?

[-]jungofthewon3y30

and Ought either builds AGI or strongly influences the organization that builds AGI.

"strongly influences the organization that builds AGI" applies to all alignment research initiatives right? Alignment researchers at e.g. DeepMind have less of an uphill battle but they still have to convince the rest of DeepMind to adopt their work.

2elifland3y

Yes, I didn't mean to imply this was necessarily an Ought-specific problem and I guess it may have been a bit unfair for me to only do a BOTEC on Ought. I included it because I had the most fleshed-out thoughts on it but it could give the wrong impression about relative promise when others don't have BOTECs. Also people (not implying you!) often take my BOTECs too seriously, they're done in this spirit. That being said, I agree that strong within-organization influence feels more likely than across; not sure to what extent.

9Vika3y

I would expect that the way Ought (or any other alignment team) influences the AGI-building org is by influencing the alignment team within that org, which would in turn try to influence the leadership of the org. I think the latter step in this chain is the bottleneck - across-organization influence between alignment teams is easier than within-organization influence. So if we estimate that Ought can influence other alignment teams with 50% probability, and the DM / OpenAI / etc alignment team can influence the corresponding org with 20% probability, then the overall probability of Ought influencing the org that builds AGI is 10%. Your estimate of 1% seems too low to me unless you are a lot more pessimistic about alignment researchers influencing their organization from the inside.

3elifland3y

Good point, and you definitely have more expertise on the subject than I do. I think my updated view is ~5% on this step. I might be underconfident about my pessimism on the first step (competitiveness of process-based systems) though. Overall I've updated to be slightly more optimistic about this route to impact.

1jungofthewon3y

All good, thanks for clarifying.

[-]Nicholas Kross3y20

Also possibly relevant (though less detailed): this table I made.

[-]hamnox3y21

I wanna offer feedback on the READING.

at "Off the cuff I’d give something like 10%, 3%, 1% for these respectively (conditioned on the previous premises) which multiplies to .003%", the verbal version doubled back to remind what the referents for each percentage we're, then read the sentence again.

that was PERFECT. high value add. made sure the actual point was gotten across, when it would have been very easy to just mentally tune out numerical information.

[-]Patodesu3y21

Even if you think S-risks from AGI are 70 times less likely than X-risks, you should think how many times worse would it be. For me would be several orders of magnitude worse.

[-]peterslattery3y10

Is there a plan to review and revise this to keep it up to date? Or is there something similar that I can look at which is more updated? I have this saved as something to revisit, but I worry not that it could be out of date and inaccurate given the speed of progress.

1peterslattery3y

Also, just as feedback (which probably doesn't warrant any changes being made unless similar feedback provided), I will flag that it would be good to be able to see posts that this is mentioned in ranked by recency rather than total karma.

[-]Aaron Bergman3y1-1

Note: I'm probably well below median commenter in terms of technical CS/ML understanding. Anyway...

I feel like a missing chunk of research could be described as “seeing DL systems as ‘normal,’ physical things and processes that involve electrons running around inside little bits of (very complex) metal pieces” instead of mega-abstracted “agents.”

The main reason this might be fruitful is that, at least intuitively and to my understanding, failures like “the AI stops just playing chess really well and starts taking over the world to learn how to play c... (read more)

1the gears to ascension3y

I can understand why it would seem excessively abstract, but when we speak of agency, we are in fact talking about patterns in the activations of the gpu's circuit elements - specifically we'd be talking about patterns of numerical feedback where the program forms a causal predictive model of a variable and then, based on the result of the predictive model, does any form of model-predictive control, eg outputting bytes (floats, probably) that encode an action that the action-conditional predictive model evaluates as likely to impact the variable. Merely minimizing loss is insufficient to end up with this outcome in many cases, but on some datasets, with some problem formulations - ones that we expect to come up, such as motor control of a robot in order to walk across a room, for a trivial example, or trying to select videos which maximize probability that a user stays on the website - we can expect that the predictive model, if more precise about the future than a human's predictive model, would allow the gpu code to select actions (motor actions or video selections) that have higher reliability of reaching the target outcome (cross the room, ensure the user stays on the site) that the control loop code evaluated via the predictive model. The worry is that, if an agent is general enough in purpose to form its own subgoals and evaluate those in the predictive model, it could end up doing multi-step plan chaining through this general world-simulator subalgorithm and realize it can attack its creators in one of a great many possible ways.

1Aaron Bergman3y

Ngl I did not fully understand this, but to be clear I don't think understanding alignment through the lense of agency is "excessively abstract." In fact I think I'd agree with the implicit default view that it's largely the single most productive lense to look through. My objection to the status quo is that it seems like the scale/ontology/lense/whatever I was describing is getting 0% of the research attention whereas perhaps it should be getting 10 or 20%. Not sure this analogy works, but if NIH was spending $10B on cancer research, I would (prima facie, as a layperson) want >$0 but probably <$2B spent on looking at cancer as an atomic-scale phenomenon, and maybe some amount at an even lower-scale scale

3the gears to ascension3y

yeah I was probably too abstract in my reply - to rephrase: a thermostat (or other extremely small control system) is a perfectly valid example of agency. it's not dangerously strong agency or any such thing. but my point is really to say that you're on the right track here, looking at the micro-scale versions of things is very promising.

^{^}

We may revise the document based on corrections in the comments or future announcements, but don't promise anything. Others are welcome to create future versions or submit summaries of their own approaches for us to edit in. We will note the time it was last edited when we edit things. (ETA: most recent update: 10/9/2022)

^{^}

In this chart, the ? denotes more uncertainty if this is a correct description

^{^}

~~I would appreciate someone giving more information on DeepMind's approach to alignment.~~ Update: Rohin has given a helpful summary in a comment.

^{^}

Technically, they just need to span the set of extrapolations, so that the correct extrapolation is just a linear combination of the found classifiers.

^{^}

Hold on, how come you are excited about Conjecture automating alignment research but not OpenAI?
Answer: I see a categorical distinction between trying to align agentic and oracle AIs. Conjecture is trying only for oracle LLMs, trained without any RL pressure giving them goals, which seems way safer. OpenAI doing recursive reward modeling / IDA type schemes involves creating agentic AGIs and therefore faces also a lot more alignment issues like convergent instrumental goals, power seeking, goodharting, inner alignment failure, etc.

I think inner alignment can be a problem with LLMs trained purely in a self-supervised fashion (e.g., simulacra becoming aware of their surroundings), but I anticipate it to only be a problem with further capabilities. I think RL trained GPT-6 is a lot more likely to be an x-risk than GPT-6 trained only to do text prediction.

^{^}

To be clear: I am very against proposals for violent pivotal acts that are sometimes brought up, such as destroying other AI labs on the verge of creating a misaligned AGI. This seems bad because 1) violence is bad and isn't dignified. 2) it seems like this intention would make it much harder to coordinate. 3) Setting an AGI loose to pull off a violent pivotal act could incredibly easily disempower humanity: you are intentionally letting the AGI destructively take over.

^{^}

Some cruxes that would change this conclusion are if we don't get prosaic AGI or if solving alignment takes a lot of serial thought, e.g. work that needs to be done by 1 researcher over 10 years, and can't be solved by 10 researchers working for 1 year.

Approach	Problem Focus	Current Approach Summary	Scale
Aligned AI	Model splintering	Solve extrapolation problems.	2-5 researchers, started Feb 2022
ARC	Inaccessible information	ELK + LLM power-seeking evaluation	3 researchers, started April 2021
Anthropic	LLM Outer Alignment (?)[3]	Interpretability + HHH + augmenting alignment research with LLMs	~35? technical staff[3], started May 2021
Brain-like-AGI Safety	Brain-like AGI Safety	Use brains as a model for how AGI will be developed, think about alignment in this context	~4 researchers, started March 2021
Center for AI Safety (CAIS)	Engaging the ML community, many technical problems	Technical research, Infrastructure, and ML community field-building for safety	7-10 FTE, founded in ~March 2022
CHAI	Outer alignment, though CHAI is diverse	Improve CIRL + many other independent approaches.	~20 FTE?, founded in 2016
CLR	Suffering risks	Foundational game theory research	5-10 FTE, founded before 2015
Conjecture	Inner alignment	Interpretability + automating alignment research with LLMs	~20 FTE, announced April 2022
David Krueger	Goal misgeneralization	Empirical examples and understanding ML inductive biases	Academic lab with 7 students
DeepMind	Many including scalable oversight and goal misgeneralization	Many including Debate, discovering agents, ERO, and understanding threat models. [4]	>1000 FTE for the company as a whole, ~20-25 FTE on the alignment + scalable alignment teams
Dylan Hadfield-Menell	Value Alignment	Reward specification + Norms	Academic research lab
Encultured	Multipolar failure from lack of coordination	Video game	~3 people, announced August 2022
Externalized Reasoning Oversight	Deception	Get the reasoning of the AGI to happen in natural language, then oversee that reasoning	~1 person's project for a summer (though others are working on this approach)
FHI	Agent incentives / wireheading (?)	Causal model formalism to study incentives.	~3 people in the causal group / ~20 total?, FHI founded in 2005, Causal group founded in 2021
FAR	Many	Incubate new, scalable alignment research agendas, technical support for existing researchers	4 people on leadership but I'm guessing ~5 more engineers, announced July 2022
MIRI	Many including deception, the sharp left turn, corrigibility is anti-natural	Mathematical research to resolve fundamental confusion about the nature of goals/agency/optimization	11 research staff, founded in approximately 2005
Jacob Steinhardt	Distribution Shift	Conceptual alignment	Academic lab of 9 PhD students + Postdocs
OpenAI	Scalable oversight	RLHF / Recursive Reward Modeling, then automate alignment research	100 capabilities and 30 alignment researchers, founded December 2015.
Ought	Scalable oversight	Supervise process rather than outcomes + augment alignment researchers	10 employees, founded in ~2018
Redwood	Inner alignment (?)	Interpretability + Adversarial Training	12-15 research staff, started sometime before September 2021
Sam Bowman	LLM Outer Alignment	Creating datasets for evaluation + inverse scaling prize	Academic lab
Selection Theorems	Being able to robustly point at objects in the world	Selection Theorems based on natural abstractions	~2 FTE, started around August 2019
Team Shard	Instilling inner values from an outer training loop	Find patterns of values given by current RL setups and humans, then create quantitative rules to do this	~4-6 people, started Spring 2022
Truthful AI	Deception	Create standards and datasets to evaluate model truthfulness	~10 people, one research project

Type of approach	Mostly conceptual	Mixed	Mostly applied
Organization	MIRI, John Wentworth, ARC	Team Shard, CHAI, DeepMind	Conjecture, Encultured, OpenAI, Anthropic, Redwood, Ought

413

413

Ω 95

413

Ω 95

15.2.1.2 The “Reverse-engineer human social instincts” research program

15.2.2.2 The “Easy-to-use super-secure sandbox for AGIs” research program

Introduction

Aligned AI / Stuart Armstrong

Eliciting Latent Knowledge / Paul Christiano

Evaluating LM power-seeking / Beth Barnes

LLM Alignment

Interpretability

Scaling laws

Center for AI Safety (CAIS) / Dan Hendrycks

Center for Human Compatible AI (CHAI) / Stuart Russell

Scalable LLM Interpretability

Simulacra Theory

Externalized Reasoning Oversight / Tamera Lanham

Future of Humanity Institute (FHI)

Communicate their view on alignment

Deception + Inner Alignment / Evan Hubinger

Agent Foundations / Scott Garrabrant and Abram Demski

Infra-Bayesianism / Vanessa Kosoy

Selection Theorems / John Wentworth

Truthful AI / Owain Evans and Owen Cotton-Barratt

Other Organizations

Appendix

Visualizing Differences

Automating alignment and alignment difficulty

Conceptual vs. applied

Thomas’s Alignment Big Picture