All of Vika's Comments + Replies

I really enjoyed this sequence, it provides useful guidance on how to combine different sources of knowledge and intuitions to reason about future AI systems. Great resource on how to think about alignment for an ML audience. 

I think this is still one of the most comprehensive and clear resources on counterpoints to x-risk arguments. I have referred to this post and pointed people to a number of times. The most useful parts of the post for me were the outline of the basic x-risk case and section A on counterarguments to goal-directedness (this was particularly helpful for my thinking about threat models and understanding agency). 

I still endorse the breakdown of "sharp left turn" claims in this post. Writing this helped me understand the threat model better (or at all) and make it a bit more concrete.

This post could be improved by explicitly relating the claims to the "consensus" threat model summarized in Clarifying AI X-risk. Overall, SLT seems like a special case of that threat model, which makes a subset of the SLT claims: 

  • Claim 1 (capabilities generalize far) and Claim 3 (humans fail to intervene), but not Claims 1a/b (simultaneous / discontinuous generalization) or Claim
... (read more)

I continue to endorse this categorization of threat models and the consensus threat model. I often refer people to this post and use the "SG + GMG → MAPS" framing in my alignment overview talks. I remain uncertain about the likelihood of the deceptive alignment part of the threat model (in particular the requisite level of goal-directedness) arising in the LLM paradigm, relative to other mechanisms for AI risk. 

In terms of adding new threat models to the categorization, the main one that comes to mind is Deep Deceptiveness (let's call it Soares2), whi... (read more)

I'm glad I ran this survey, and I expect the overall agreement distribution probably still holds for the current GDM alignment team (or may have shifted somewhat in the direction of disagreement), though I haven't rerun the survey so I don't really know. Looking back at the "possible implications for our work" section, we are working on basically all of these things. 

Thoughts on some of the cruxes in the post based on last year's developments:

  • Is global cooperation sufficiently difficult that AGI would need to deploy new powerful technology to make it
... (read more)

I agree that a possible downside of talking about capabilities is that people might assume they are uncorrelated and we can choose not to create them. It does seem relatively easy to argue that deception capabilities arise as a side effect of building language models that are useful to humans and good at modeling the world, as we are already seeing with examples of deception / manipulation by Bing etc. 

I think the people who think we can avoid building systems that are good at deception often don't buy the idea of instrumental convergence either (e.g. Yann LeCun), so I'm not sure that arguing for correlated capabilities in terms of intelligence would have an advantage. 


Re 4, we were just discussing this paper in a reading group at DeepMind, and people were confused why it's not on arxiv.

An Arxiv version is forthcoming. We're working with Gavin Leech to publish these results as a conference paper. 

The issue with being informal is that it's hard to tell whether you are right. You use words like "motivations" without defining what you mean, and this makes your statements vague enough that it's not clear whether or how they are in tension with other claims. (E.g. what I have read so far doesn't seems to rule out that shards can be modeled as contextually activated subagents with utility functions.) 

An upside of formalism is that you can tell when it's wrong, and thus it can help make our thinking more precise even if it makes assumptions that may ... (read more)

It seems worth pointing out: the informality is in the hypothesis, which comprises a set of somewhat illegible intuitions and theories I use to reason about generalization. However, the prediction itself is what needs to be graded in order to see whether I was right. I made a prediction fairly like "the policy tends to go to the top-right 5x5, and searches for cheese once there, because that's where the cheese-seeking computations were more strongly historically reinforced" and "the policy sometimes pursues cheese and sometimes navigates to the top-right 5x5 corner." These predictions are (informally) gradable, even if the underlying intuitions are informal.  As it pertains to shard theory more broadly, though, I agree that more precision is needed. Increasing precision and formalism is the reason I proposed and executed the project underpinning Understanding and controlling a maze-solving policy network. I wanted to understand more about realistic motivational circuitry and model internals in the real world. I think the last few months have given me headway on a more mechanistic definition of a "shard-based agent."

Thanks Daniel, this is a great summary. I agree that internal representation of the reward function is not load-bearing for the claim. The weak form of representation that you mentioned is what I was trying to point at. I will rephrase the sentence to clarify this, e.g. something like "We assume that the agent learns a goal during the training process: some form of implicit internal representation of desired state features or concepts". 

Great, this sounds much better!

Thanks Daniel for the detailed response (which I agree with), and thanks Alex for the helpful clarification.

I agree that the training-compatible set is not predictive for how the neural network generalizes (at least under the "strong distributional shift" assumption in this post where the test set is disjoint from the training set, which I think could be weakened in future work). The point of this post is that even though you can't generally predict behavior in new situations based on the training-compatible set alone, you can still predict power-seeking t... (read more)

You're right. I was critiquing "power-seeking due to your assumptions isn't probable, because I think your assumptions won't hold" and not "power-seeking isn't predictive." I had misremembered the predictive/probable split, as introduced in Definitions of “objective” should be Probable and Predictive: Sorry for the confusion. I agree that power-seeking is predictive given your assumptions. I disagree that power-seeking is probable due to your assumptions being probable. The argument I gave above was actually:  1. The assumptions used in the post ("learns a randomly-selected training-compatible goal") assign low probability to experimental results, relative to other predictions which I generated (and thus relative to other ways of reasoning about generalization), 2. Therefore the assumptions become less probable 3. Therefore power-seeking becomes less probable (at least, due to these specific assumptions becoming less probable; I still think P(power-seeking) is reasonably large) I suspect that you agree that "learns a training-compatible goal" isn't very probable/realistic. My point is then that the conclusions of the current work are weakened; maybe now more work has to go into the "can" in "Power-seeking can be probable and predictive." 

The internal representations assumption was meant to be pretty broad, I didn't mean that the network is explicitly representing a scalar reward function over observations or anything like that - e.g. these can be implicit representations of state features I think this would also include the kind of representations you are assuming in the maze-solving post, e.g. cheese shards / circuits. 


Thanks Alex! Your original comment didn't read as ill-intended to me, though I wish that you'd just messaged me directly. I could have easily missed your comment in this thread - I only saw it because you linked the thread in the comments on my post.

Your suggested rephrase helps to clarify how you think about the implications of the paper, but I'm looking for something shorter and more high-level to include in my talk. I'm thinking of using this summary, which is based on a sentence from the paper's intro: "There are theoretical results showing that many d... (read more)

I think this is reasonable, although I might say "suggesting" instead of "showing." I think I might also be more cautious about further inferences which people might make from this -- like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.

Sorry about the cite in my "paradigms of alignment" talk, I didn't mean to misrepresent your work. I was going for a high-level one-sentence summary of the result and I did not phrase it carefully. I'm open to suggestions on how to phrase this differently when I next give this talk.

Similarly to Steven, I usually cite your power-seeking papers to support a high-level statement that "instrumental convergence is a thing" for ML audiences, and I find they are a valuable outreach tool. For example, last year I pointed David Silver to the optimal policies paper ... (read more)

Thanks for your patient and high-quality engagement here, Vika! I hope my original comment doesn't read as a passive-aggressive swipe at you. (I consciously tried to optimize it to not be that.) I wanted to give concrete examples so that Wei_Dai could understand what was generating my feelings. It's a tough question to say how to apply the retargetablity result to draw practical conclusions about trained policies. Part of this is because I don't know if trained policies tend to autonomously seek power in various non game-playing regimes.  If I had to say something, I might say "If choosing the reward function lets us steer the training process to produce a policy which brings about outcome X, and most outcomes X can only be attained by seeking power, then most chosen reward functions will train power-seeking policies." This argument appropriately behaves differently if the "outcomes" are simply different sentiment generations being sampled from an LM -- sentiment shift doesn't require power-seeking. My guess is that the optimal policies paper was net negative for technical understanding and progress, but net positive for outreach, and agree it has strong benefits in the situations you highlight. I think that it's locally valid to point out "under your beliefs (about optimal policies mattering a lot), the situation is dangerous, read this paper." But I feel a tad queasy about the overall point, since I don't think alignment's difficulty has much to do with the difficulties pointed out by "Optimal Policies Tend to Seek Power." I feel better about saying "Look, if in fact the same thing happens with trained policies, which are sometimes very different, then we are in trouble." Maybe that's what you already communicate, though.

Here is my guess on how shard theory would affect the argument in this post:

  1. In my understanding, shard theory would predict that the model learns multiple goals from the training-compatible (TC) set (e.g. including both the coin goal and the go-right goal in CoinRun), and may pursue different learned goals in different new situations. The simplifying assumption that the model pursues a randomly chosen goal from the TC set also covers this case, so this doesn't affect the argument. 
  2. Shard theory might also imply that the training-compatible set should b
... (read more)
I still expect instrumental convergence from agentic systems with shard-encoded goals, but think this post doesn't offer any valid argument for that conclusion.  I don't think these results cover the shard case. I don't think reward functions are good ways of describing goals in settings I care about. I also think that realistic goal pursuit need not look like "maximize time-discounted sum of a scalar quantity of world state."  My point is not that instrumental convergence is wrong, or that shard theory makes different predictions. I just think that these results are not predictive of trained systems. 

Great post! I especially enjoyed the intuitive visualizations for how the heavy-tailed distributions affect the degree of overoptimization of X. 

As a possibly interesting connection, your set of criteria for an alignment plan can also be thought of as criteria for selecting a model specification that approximates the ideal specification well, especially trying to ensure that the approximation error is light-tailed. 

David had many conversations with Bengio about alignment during his PhD, and gets a lot of credit for Bengio taking AI risk seriously


Thanks Alex for the detailed feedback! I agree that learning a goal from the training-compatible set is a strong assumption that might not hold. 

This post assumes a standard RL setup and is not intended to apply to LLMs (it's possible some version of this result may hold for fine-tuned LLMs, but that's outside the scope of this post). I can update the post to explicitly clarify this, though I was not expecting anyone to assume that this work applies to LLMs given that the post explicitly assumes standard RL and does not mention LLMs at all. 

I agr... (read more)

Thanks for the reply. This comment is mostly me disagreeing with you.[1] But I really wish someone had said the following things to me before I spent thousands of hours thinking about optimal policies.  My point is not just that this post has made a strong assumption which may not hold. My point is rather that these results are not probable because the assumption won't hold. The assumptions are already known to not be good approximations of trained policies, in at least some prototypical RL situations. I also think that there is not good a priori reason to have expected "training-compatible" "goals" to be learned. According to me, "learning and optimizing a reward function" is both unclear communication and doesn't actually seem to happen in practice.  I don't see any formal assumption which excludes LLM finetuning. Which assumption do you think should exclude them? EDIT: Someone privately pointed out that LLM finetuning uses KL, which isn't present for your results. In that case I would agree your results don't apply to LLMs for that reason.  This point is, in large part, my fault. As I argued in my original comment, this terminology makes readers actively worse at reasoning about realistic trained systems. I regret each of the thousands of hours I spent on the power-seeking work, and sometimes fantasize about retracting one or both papers. I disagree that these are equivalent, and expect the policy and value function to come apart in practice. Indeed, that was observed in the original goal misgeneralization paper (3.3, actor-critic inconsistency).  Anyways, we can talk about utility functions, but then we're going to lose claim to probable-ness, no? Why should we assume that the network will internally represent a scalar function over observations, consistent with a historical training signal's scalar values (and let's not get into nonstationary reward), such that the network will maximize discounted sum return of this internally represented function? That se

Which definition / result are you referring to?

Seems like quoting doesn't work for LaTeX, it was definitions 2/3. Reading again I saw D2 was indeed applicable to sets.

We expect that an aligned (blue-cloud) model would have an incentive to preserve its goals, though it would need some help from us to generalize them correctly to avoid becoming a misaligned (red-cloud) model. We talk about this in more detail in Refining the Sharp Left Turn (part 2)


Just added some more detail on this to the slides. The idea is that we have various advantages over the model during the training process: we can restart the search, examine and change beliefs and goals using interpretability techniques, choose exactly what data the model sees, etc.

While the model has the advantage of only having to "win" once.


Thanks Alex for the detailed feedback! I have updated the post to fix these errors. 

Curious if you have high-level thoughts about the post and whether these definitions have been useful in your work. 

This post provides a maximally clear and simple explanation of a complex alignment scheme. I read the original "learning the prior" post a few times but found it hard to follow. I only understood how the imitative generalization scheme works after reading this post (the examples and diagrams and clear structure helped a lot). 

This post helped me understand the motivation for the Finite Factored Sets work, which I was confused about for a while. The framing of agency as time travel is a great intuition pump. 

I like this research agenda because it provides a rigorous framing for thinking about inductive biases for agency and gives detailed and actionable advice for making progress on this problem. I think this is one of the most useful research directions in alignment foundations since it is directly applicable to ML-based AI systems. 


+1. This section follows naturally from the rest of the article, and I don't see why it's labeled as an appendix -  this seems like it would unnecessarily discourage people from reading it. 

I'm convinced, I relabeled it.

It's great to hear that you have updated away from ambitious value learning towards corrigibility-like targets. It sounds like you now find it plausible that corrigibility will be a natural concept in the AI's ontology, despite it being incompatible with expected utility maximization. Does this mean that you expect we will be able to build advanced AI that doesn't become an expected utility maximizer?

I'm also curious how optimistic you are about the interpretability field being able to solve the empirical side of the abstraction problem in the next 5-10 ye... (read more)

Bah! :D It's sad to hear he's updated away from ambitions value learning towards corrigiblity-like targets. Eliezer's second-hand argument sounds circular to me; suppose that corrigibility as we'd recognize it isn't a natural abstraction - then generic AIs wouldn't use it to align child agents (instead doing something like value learning, or something even more direct), and so there wouldn't be a bunch of human-independent examples, so it wouldn't show up as a natural abstraction to those AIs.

When talking about whether some physical system "is a utility maximizer", the key questions are "utility over what variables?", "in what model do those variables live?", and "with respect to what measuring stick?". My guess is that a corrigible AI will be a utility maximizer over something, but maybe not over the AI-operator interface itself? I'm still highly uncertain what that type-signature will look like, but there's a lot of degrees of freedom to work with. We'll need qualitatively different methods. But that's not new; interpretability researchers already come up with qualitatively new methods pretty regularly.

I would consider goal generalization as a component of goal preservation, and I agree this is a significant challenge for this plan. If the model is sufficiently aligned to the goal of being helpful to humans, then I would expect it would want to get feedback about how to generalize the goals correctly when it encounters ontological shifts. 

Too bad that my list of AI safety resources didn't make it into the survey - would be good to know to what extent it would be useful to keep maintaining it. Will you be running future iterations of this survey? 

Would have been good to ask about that and also mine it for resources. Re: future iterations, I'm not sure. On one hand, I think it's kind of bad for this kind of thing to be run by a person who stands to benefit from his thing ranking high on the survey. On the other hand, I'm not sure if anyone else wants to do it, and I think it would be good to run future iterations. If anyone does want to take it over, please let me know. I'm not sure how many would be interested in doing that (maybe grantmaking orgs?), but if there are multiple such people it would probably be good to pick a designated successor. I should say that I reserve the right to wait until next year to make any sort of decision on this.

I agree that a sudden gain in capabilities can make a simulated agent undergo a sharp left turn (coming up with more effective takeover plans is a great example). My original question was about whether the simulator itself could undergo a sharp left turn. My current understanding is that a pure simulator would not become misaligned if its capabilities suddenly increase because it remains myopic, so we only have to worry about a sharp left turn for simulated agents rather than the simulator itself. Of course, in practice, language models are often fine-tune... (read more)


I would say the primary disagreement is epistemic - I think most of us would assign a low probability to a pivotal act defined as "a discrete action by a small group of people that flips the gameboard" being necessary. We also disagree on a normative level with the pivotal act framing, e.g. for reasons described in Critch's post on this topic. 


Thanks Richard for this post, it was very helpful to read! Some quick comments:

  • I like the level of technical detail in this threat model, especially the definition of goals and what it means to pursue goals in ML systems
  • The architectural assumptions (e.g. the prediction & action heads) don't seem load-bearing for any of the claims in the post, as they are never mentioned after they are introduced. It might be good to clarify that this is an example architecture and the claims apply more broadly.
  • Phase 1 and 2 seem to map to outer and inner alignment res
... (read more)
Thanks for the comments Vika! A few responses: Makes sense, will do. That doesn't quite seem right to me. In particular: * Phase 3 seems like the most direct example of inner misalignment; I basically think of "goal misgeneralization" as a more academically respectable way of talking about inner misalignment. * Phase 1 introduces the reward misspecification problem (which I treat as synonymous with "outer alignment") but also notes that policies might become misaligned by the end of phase 1 because they learn goals which are "robustly correlated with reward because they’re useful in a wide range of environments", which is a type of inner misalignment. * Phase 2 discusses both policies which pursue reward as an instrumental goal (which seems more like inner misalignment) and also policies which pursue reward as a terminal goal. The latter doesn't quite feel like a central example of outer misalignment, but it also doesn't quite seem like a central example of reward tampering (because "deceiving humans" doesn't seem like an example of "tampering" per se). Plausibly we want a new term for this - the best I can come up with after a few minutes' thinking is "reward fixation", but I'd welcome alternatives. It seems very unlikely for an AI to have perfect proxies when it becomes situationally aware, because the world is so big and there's so much it won't know. In general I feel pretty confused about Evan talking about perfect performance, because it seems like he's taking a concept that makes sense in very small-scale supervised training regimes, and extending it to AGIs that are trained on huge amounts of constantly-updating (possibly on-policy) data about a world that's way too complex to predict precisely. Mechanistic interpretability seems helpful in phase 2, but there are other techniques that could help in phase 2, in particular scalable oversight techniques. Whereas interpretability seems like the only thing that's really helpful in phase 3 - if it gets goo

Thank you for the insightful post. What do you think are the implications of the simulator framing for alignment threat models? You claim that a simulator does not exhibit instrumental convergence, which seems to imply that the simulator would not seek power or undergo a sharp left turn. The simulated agents could exhibit power-seeking behavior or rapidly generalizing capabilities or try to break out of the simulation, but this seems less concerning than the top-level model having these properties, and we might develop alignment techniques specifically tar... (read more)

Re sharp left turn: Maybe I misunderstand the "sharp left turn" term, but I thought this just means a sudden extreme gain in capabilities? If I am correct, then I expect you might get "sharp left turn" with a simulator during training --- eg, a user fine-tunes it on one additional dataset, and suddenly FOOOM. (Say, suddenly it can simulate agents that propose takeover plans that would actually work, when previously they failed at this with identical prompting.) One implication I see is that it if the simulator architecture becomes frequently used, it might be really hard to tell whether a thing is dangerous or not. For example might just behave completely fine with most prompts and catastrophically with some other prompts, and you will never know until you try. (Or unless you do some extra interpretability/other work that doesn't yet exist.) It would be rather unfortunate if the Vulnerable World Hypothesis was true because of specific LLM prompts :-).

I would expect that the way Ought (or any other alignment team) influences the AGI-building org is by influencing the alignment team within that org, which would in turn try to influence the leadership of the org. I think the latter step in this chain is the bottleneck - across-organization influence between alignment teams is easier than within-organization influence. So if we estimate that Ought can influence other alignment teams with 50% probability, and the DM / OpenAI / etc alignment team can influence the corresponding org with 20% probability, then... (read more)

Good point, and you definitely have more expertise on the subject than I do. I think my updated view is ~5% on this step. I might be underconfident about my pessimism on the first step (competitiveness of process-based systems) though. Overall I've updated to be slightly more optimistic about this route to impact.

Thanks Thomas for the helpful overview post! Great to hear that you found the AGI ruin opinions survey useful.

I agree with Rohin's summary of what we're working on. I would add "understanding / distilling threat models" to the list, e.g. "refining the sharp left turn" and "will capabilities generalize more". 

Some corrections for your overall description of the DM alignment team:

  • I would count ~20-25 FTE on the alignment + scalable alignment teams (this does not include the AGI strategy & governance team)
  • I would put DM alignment in the "fairly hard"
... (read more)
3Thomas Larsen1y
Sorry for the late response, and thanks for your comment, I've edited the post to reflect these. 

This post resonates with me on a personal level, since my mother was really into mountain climbing in her younger years. She quit after seeing a friend die in front of her (another young woman who broke her neck against an opposing rock face in an unlucky fall). It seems likely I wouldn't be here otherwise. Happy to report that she is still enjoying safer mountain activities 50 years later. 


Correct. I think that doing internal outreach to build an alignment-aware company culture and building relationships with key decision-makers can go a long way. I don't think it's possible to have complete binding power over capabilities projects anyway, since the people who want to run the project could in principle leave and start their own org.

Hmm, thanks... Can you elaborate what "this" is? 


We don't have the power to shut down projects, but we can make recommendations and provide input into decisions about projects

So you can have non-binding recommendations and input, but no actual binding power over the capabilities researchers, right?

Thanks! For those interested in conducting similar surveys, here is a version of the spreadsheet you can copy (by request elsewhere in the comments). 


Here is a spreadsheet you can copy. This one has a column for each person - if you want to sort the rows by agreement, you need to do it manually after people enter their ratings. I think it's possible to automate this but I was too lazy. 

Thanks, glad you found the post useful!

Maintaining uncertainty over the goal allows the system to model the set of goals that are consistent with the training data, notice when they disagree with each other out of distribution, and resolve that disagreement in some way (e.g. by deferring to a human). 


Ah, I think you intended level 6 as an OR of learning from imitation / imagined experience, while I interpreted it as an AND. I agree that humans learn from imitation on a regular basis (e.g. at school). In my version of the hierarchy, learning from imitation and imagined experience would be different levels (e.g. level 6 and 7) because the latter seems a lot harder. In your decision theory example, I think a lot more people would be able to do the imitation part than the imagined experience part. 

2Daniel Kokotajlo2y
Well said; I agree it should be split up like that.

I think some humans are at level 6 some of the time (see Humans Who Are Not Concentrating Are Not General Intelligences). I would expect that learning cognitive algorithms from imagined experience is pretty hard for many humans (e.g. examples in the Astral Codex post about conditional hypotheticals). But maybe I have a different interpretation of Level 6 than what you had in mind?

4Daniel Kokotajlo2y
Good point re learning cognitive algorithms from imagined experience, that does seems pretty hard. From imitation though? We do it all the time. Here's an example of me doing both: I read books about decision theory & ethics, and learn about expected utility maximization & the bounded variants that humans can actually do in practice (back of envelope calculations, etc.) I immediately start implementing this algorithm myself on a few occasions. (Imitation) Then I read more books and learn about "pascal's mugging" and the like. People are arguing about whether or not it's a problem for expected utility maximization. I think through the arguments myself and come up with some new arguments of my own. This involves imagining how the expected utility maximization algorithm would behave in various hypothetical scenarios, and also just reasoning analytically about the properties of the algorithm. I end up concluding that I should continue using the algorithm but with some modifications. (Learning from imagined experience.) Would you agree with this example, or are you thinking about the hierarchy somewhat differently than me? I'm keen to hear more if the latter.

This is an interesting hierarchy! I'm wondering how to classify humans and various current ML systems along this spectrum. My quick take is that most humans are at Levels 4-5, AlphaZero is at level 5, and GPT-3 is at level 4 with the right prompting. Curious if you have specific ML examples in mind for these levels. 

2Daniel Kokotajlo2y
Thanks! Hmm, I would have thought humans were at Level 6, though of course most of their cognition most of the time is at lower levels.

Makes sense, thanks. I think the current version of the list is not a significant infohazard since the examples are well-known, but I agree it's good to be cautious. (I tweeted about it to try to get more examples, but it didn't get much uptake, happy to delete the tweet if you prefer.) Focusing on outreach to people who care about AI risk seems like a good idea, maybe it could be useful to nudge researchers who don't work on AI safety because of long timelines to start working on it. 

No need to delete the tweet. I dagree the examples are not info hazards, they're all publicly known. I just probably wouldn't want somebody going to good ML researchers who currently are doing something that isn't really capabilities (e.g., application of ML to some other area) and telling them "look at this, AGI soon."

Really excited to see this list, thanks for putting it together! I shared it with the DM safety community and tweeted about it here, so hopefully some more examples will come in. (Would be handy to have a short URL for sharing the spreadsheet btw.)

I can see several ways this list can be useful: 

  • as an outreach tool (e.g. to convince skeptics that recursive self-improvement is real) 
  • for forecasting AI progress
  • for coming up with specific strategies for slowing down AI progress

Curious whether you primarily intend this to be an outreach tool or a... (read more)

Thanks! Here is a shorter url: I intended it as somewhat of an outreach tool, though probably to people who already care about AI risk, since I wouldn't want it to serve as a reason for somebody to start working on it more because they see it's possible. Mostly, I'm did it for epistemics/forecasting: I think it will be useful for the community to know how this particular kind of work is progressing, and since it's in disparate research areas I don't think it's being tracked by the research community by default.

I think it's plausible that the alignment community could figure out how to build systems without power-seeking incentives, or with power-seeking tendencies limited to some safe set of options, by building on your formalization, so the retrospective seems plausible to me. 

In addition, this work is useful for convincing ML people that alignment is hard, which helps to lay the groundwork for coordinating the AI community to not build AGI. I've often pointed researchers at DM (especially RL people) to your power-seeking paper when trying to explain convergent instrumental goals (a formal neurips paper makes a much better reference for that audience than Basic AI Drives). 


Thanks Alex for writing this. I think the social failure modes you described in the Mistakes section are all too common, and I've often found myself held back by these. 

I agree that impact measures are not super useful for alignment (apart from deconfusion) and I've also moved on from working on this topic. Improving our understanding of power-seeking seems pretty useful though, so I'm curious why you wish you had stopped working on it sooner. 

Research on power-seeking tendencies is more useful than nothing, but consider the plausibility of the following retrospective: "AI alignment might not have been solved except for TurnTrout's deconfusion of power-seeking tendencies." Doesn't sound like something which would actually happen in reality, does it?

EDIT: Note this kind of visualization is not always valid -- it's easy to diminish a research approach by reframing it -- but in this case I think it's fine and makes my point.

Load More