Edit: Changed the title.

Or, why I no longer agree with the standard LW position on AI anymore.

In a sense, this is sort of a weird post compared to what LW usually posts on AI.

A lot of this is going to depend on some posts that changed my worldview on AI risk, and they will be linked below:

Deceptive alignment skepticism sequence, especially the 2nd post in the sequence is here:


Evidence of the natural abstractions hypothesis in action:



Summary: The big updates I made was that deceptive alignment was way more unlikely than I thought, and given that deceptive alignment was a big part of my model of how AI risk would happen (about 30-60% of my probability mass was on that failure mode), that takes a big bite out of the probability mass of extinction enough to make increasing AI capabilities having positive expected value. Combine this with the evidence that at least some form of the natural abstractions hypothesis is being borne out by empirical evidence, and I now think the probabilities of AI risk have steeply declined to only 0.1-10%, and all of that probability mass is plausibly reducible to ridiculously low numbers by going to the stars and speeding up technological progress.

In other words, I now believe a significant probability, on the order of 50-70%, that alignment is solved by default.

EDIT: While I explained why I increased my confidence in alignment by default in response to Shiminux, I now believe that for now I was overconfident on the precise probabilities on alignment by default.

What implications does this have, if this rosy picture is correct?

The biggest implication is that technological progress looks vastly positive, compared to what most LWers and the general public think.

This also implies a purpose shift for Lesswrong. For arguably 20 years, the site was focused on AI risk, though it arguably exploded with LLMs and actual AI capabilities being released.

What it will shift to is important, but assuming that this rosy model of alignment is correct, then I'd argue a significant part of the field of AI Alignment should and can change purpose to something else.

As for Lesswrong, I'd say we should probably focus more on progress studies like Jason Crawford and inadequate equilibria and how to change them.

I welcome criticism and discussion of this post, due to it's huge implications for LW.

New Comment
37 comments, sorted by Click to highlight new comments since:

I really wish there was an "agree/disagree" button for posts. I'd like to upvote this post (for epistemic virtue / presenting reasonable "contrarian" views and explaining why one holds them), but I also strongly disagree with the conclusions and suggested policies. (I ended up voting neither up nor down.)

EDIT: After reading Akash's comments, and re-reading the post more carefully: I largely agree with Akash (and updated towards thinking that my standards for "epistemic virtue" are/were too low).

I downvoted the post because I don't think it presents strong epistemics. Some specific critiques:

  • The author doesn't explain the reasoning that produced the updates. (They link to posts, but I don't think it's epistemically sound to link to say "I made updates and you can find the reasons why in these posts." At best, people read the posts, and then come away thinking "huh, I wonder which of these specific claims/arguments were persuasive to the poster.")
  • The author recommends policy changes (to LW and the field of alignment) that (in my opinion) don't seem to follow from the claims presented. (The claim "LW and the alignment community should shift their focuses" does not follow from "there is a 50-70% chance of alignment by default". See comment for more).
  • The author doesn't explain their initial threat model, why it was dominated by deception, and why they're unconvinced by other models of risk & other threat models.

I do applaud the author for sharing the update and expressing an unpopular view. I also feel some pressure to not downvote it because I don't want to be "downvoting something just because I disagree with it", but I think in this case it really is the post itself. (I didn't downvote the linked post, for example).

How much have you looked into other sources of AI risk?

In other words, I now believe a significant probability, on the order of 90-99.9%, that alignment is solved by default.

I am on the fence here, and I wonder what specifically pushed you toward this extremely strong update?

Good question. The major reason I updated so strongly here relates to the fact that once I realized that deceptive alignment was much more unlikely than I thought, I realized that I needed to up weight hard the possibility of alignment by default, and deceptive alignment was my key variable for alignment not being solved by default.

The other update is that as AI capabilities increase, we can point to natural abstractions/categories more by default, which neutralizes the pointers problem, in that we can point our AI to what goal we actually want.

I have now edited the post to be somewhat less confident of the probability of alignment by default success.

KataGo misjudging the safety of certain groups seems like a pretty significant blow to the Natural Abstractions Hypothesis to me.

It seems that it takes more something to arrive at abstractions that align with intuitive human abstractions, given that even in this case where the human version is unambiguously correct relative to what the AI was actually trained to do, and even though it reached superhuman abilities at Go, it still did not arrive at the same concept.

I don't doubt that a more powerful AI will be able to arrive at the correct concept in the case, given the unambiguous correctness of this abstraction at this level. But we already knew that was the easy case, where we can put the relevant criteria directly in the loss function. In the case where it actually matters, it needs to learn how to be Good even though we don't know how to put that directly in the loss function. And it failing at the easy version of this problem seems like significant evidence that it won't actually converge to the same sort of intuitive concepts we use in the harder cases where humans don't even converge all that well.

This is an interesting point, but it doesn't undermine the case that deceptive alignment is unlikely. Suppose that a model doesn't have the correct abstraction for the base goal, but its internal goal is the closest abstraction it has to the base goal. Because the model doesn't understand the correct abstraction, it can't instrumentally optimize for the correct abstraction rather than its flawed abstraction, so it can't be deceptively aligned. When it messes up due to having a flawed goal, that should push its abstraction closer to the correct abstraction. The model's goal will still point to that, and its alignment will improve. This should continue to happen until the base abstraction is correct. For more details, see my comment here

You're not wrong that this is problematic for the natural abstractions hypothesis, and this definitely suggests that my optimism on natural abstractions needs to be lowered.

However, this doesn't yet change my position capabilities work being net positive, because of 2 reasons:

  1. Deceptive alignment was arguably the main risk in that it posed a problem for iteration schemes, and if we remove that problem, a lot of other problems become iterable. In my model, pretty much all problems in AI safety that can be iterated away will be iterated away by default, so we have to focus on the problems that are not amenable to iteration, and right now I see the natural abstractions problem as quite iterable.

  2. We have reasons to suspect that the failure is a capabilities failure, in that convolutional neural nets implement something like a game of telephone, whereas as far as we know we don't have good reason to suspect other algorithms would have the failure mode. And since you already suggest how capabilities work solves the natural abstractions problem in the case of Go, then this implies that natural abstractions are an iterable problem.


Note that lesswrong is also about the philosophy of rationality, or making decisions for our human lives that have the highest probability of success.

While there are various crude hacks, the easiest way to know the highest probability action in your life is to ask an AI you can trust to analyze the situation and do the math.

For about 20 years such an AI was only theoretical with some effort to find what a possible algorithm might look like.

Lesswrong is also about surviving other risks, like plain old aging and being stuck on one planet - both things trustworthy AI makes solvable, and in reality, may have been the only feasible solution. Same with bypassing failed institutions.

Given your probabilities, I think you are under appreciating the magnitude of the downside risks.

I spent some time trying to dig into why we need to worry so much about downside risks of false positives (thinking we're going to get aligned AI when we're not) and how deep the problem goes, but most of the relevant bits of argumentation I would make to convince you to worry are right at the top of the post.

Your first link appears to be broken. Did you meant to link here? It looks like the last letter of the address got truncated somehow. If so, I'm glad you found it valuable!

For what it's worth, although I think deceptive alignment is very unlikely, I still think work on making AI more robustly beneficial and less risky is a better bet than accelerating capabilities. For example, my posts don't address these stories. There are also a lot of other concerns about potential downsides of AI that may not be existential, but are still very important. 

I edited it to include the correct link, thank you for asking.

My difference, and why I framed it as accelerating AI is good, comes down to the fact that in my view of AI risk, as well as most LWers models of AI risk, deceptive alignment and to a quite lesser extent, the pointers problem are the dominant variables for how likely existential risk is,and given your post as well as some other posts, I had to conclude that much of my and LWers pessimism over AI capabilities increases was pretty wrong.

Now a values point, I only stated that it was positive expected value to increase capabilities, not that all problems are gone. Not all problems are gone, nevertheless arguably the biggest problem of AI was functionally a non-problem.

That makes sense. I misread the original post as arguing that capabilities research is better than safety work. I now realize that it just says capabilities research is net positive. That's definitely my mistake, sorry!

I strong upvoted your comment and post for modifying your views in a way that is locally unpopular when presented with new arguments. That's important and hard to do! 

Recalling the 3 subclaims of the Natural Abstraction Hypothesis which I will quote verbatim:

  1. Abstractability: for most physical systems, the information relevant “far away” can be represented by a summary much lower-dimensional than the system itself.
  2. Human-Compatibility: These summaries are the abstractions used by humans in day-to-day thought/language.
  3. Convergence: a wide variety of cognitive architectures learn and use approximately-the-same summaries.

I will note that despite the ordering I think claim 2 is the weakest. I strongly disagree that these claims  being partially, or completely correct means we can expect an AI not to be deceptive.

  •  The Natural Abstraction Hypothesis is not a statement about systems converging to valuing certain abstractions the same, merely that they will use similar summaries of data in their decision making processes. Two opposing chess players may use completely identical abstractions in how they view the board (material advantage, king safety etc) but they directly opposed goals.
  • Extending that last point, we realise that some understanding of human abstractions is a powerful tool for effective deceptive alignment. If there is a behaviour that is selected for when humans are unaware of it (say, reward hacking) but is strongly selected against when humans are aware of it, it is possible the AI will learn "humans dislike this" but that doesn't mean that it will "dislike this"

The point of the natural abstractions hypothesis is really a question of how far can we get using interpretability on AI without goodharting? And the natural abstractions hypothesis says that we can functionally interpret the abstractions that the AI is using, even at really high levels of capabilities.

My interpretation is very wrong in that case. Could you spell out the goodharting connection for me?

Obviously it's a broader question than what I said, but from an AI safety perspective, the value of the natural abstractions hypothesis, conditional on it being right at least partially, is the following:

  1. Interpretability becomes easier as we can get at least some guarantees about how they form abstractions.

  2. Given that they're lower dimensional summaries, there's a chance we can understand the abstractions the AI is using, even when they are alien to us.

As far as Goodhart: a scenario that could come up is that trying to make the model explain itself might instead push us towards the failure mode where we don't have any real understanding, just simple sounding summaries that don't reveal much of anything. The natural abstractions hypothesis says that by default, AIs will make themselves more interpretable as they are more capable, avoiding goodharting interpretability efforts.

That's a really clear explanation.

I was thinking of the general case of Goodharting and hadnt made the connection to Goodharting the explanations.

In other words, I now believe a significant probability, on the order of 50-70%, that alignment is solved by default.

Let's suppose that you are entirely right about deceptive alignment being unlikely. (So we'll set aside things like "what specific arguments caused you to update?" and tricky questions about modest epistemology/outside views).

I don't see how "alignment is solved by default with 30-50% probability justifies claims like "capabilities progress is net positive" or "AI alignment should change purpose to something else."

If a doctor told me I had a disease that had a 50-70% chance to resolve on its own, otherwise it would kill me, I wouldn't go "oh okay, I should stop trying to fight the disease."

The stakes are also not symmetrical. Getting (aligned) AGI 1 year sooner is great, but it only leads to one extra year of flourishing. Getting unaligned AGI leads to a significant loss over the entire far-future. 

So even if we have a 50-70% chance of alignment by default, I don't see how your central conclusions follow.

I'll make another version of the thought experiment, in which we can get a genetic upgrade in which it gives you +1000 utils if you have it for a 70% chance, or it gives -1000 utils at a 30% chance.

Should you take it?

The answer is yes, in expectation, and it will give you +400 utils in expectation.

This is related to a general principle: As long as the probabilities of positive outcomes are over 50% and the costs and benefits are symmetrical, it is a good thing to do that activity.

And my contention is that AGI/ASI is just a larger version of the thought experiment above. AGI/ASI is a symmetric technology wrt good and bad outcomes, so that's why it's okay to increase capabilities.

I now think the probabilities of AI risk have steeply declined to only 0.1-10%, and all of that probability mass is plausibly reducible to ridiculously low numbers by going to the stars and speeding up technological progress.

I think this is wrong (in that how does speeding up reduce risk? What do you want to speed up?) . I'd be actually interested in the case for this I got promised in the title.

Specifically, it's the fact that one of the most intractable problems, arguably the core reason why AI safety is so hard to iterate, is likely a non-problem, and the fact that abstractions at least are interpretable and often converge to human abstractions is a good sign for the natural abstractions hypothesis. Thus, capabilities work shift from being net-negative to net positive in expectation.

I will change the title to reflect that it's capabilities work that is net positive, and while increasing AI capabilities is one goal, other goals might be evident.

Thus, capabilities work shift from being net-negative to net positive in expectation.

This feels to obvious to say, but I am not against building AGI ever, but because the stakes are so high and the incentives are aligned all wrong I think on the margin speeding up is bad. I do see the selfish argument and understand not everyone would like to sacrifice themselves, their loved ones or anyone likely to die before AGI is around for the sake of humanity. Also making AGI happen sooner is on the margin not good for taking over the galaxy I think (Somewhere in the EA forum is a good estimate for this. The basic argument is that space colonization is only O(n^2) or O(n^3) so very slow).

Also if you are very concerned about yourself cryonics seems like the more prosocial version. Like 0.1-10% seems still kinda high for my personal risk preferences.

What kind of information would you look out for, that would make you change your mind about alignment-by-default?

What information would cause you to inverse again? What information would cause you to adjust 50% down? 25%?

I know that most probability mass is some measure of gutfeel, and I don't want to introduce nickel-and-diming, more get a feel for what information you're looking for.

A major way I could be more pessimistic is if the deceptive alignment is 1% likely post was wrong in several aspects, and if that happened, I'd probably revise my beliefs back to where it was originally, at 30-60%.

Another way I could be wrong is if evidence came out that there are more ways for models to be deceptive than the standard scenario of deceptive alignment.

Finally, if the natural abstractions hypothesis didn't hold, or if goodharting was empiricaly shown to get worse with model capabilities, then I'd update towards quite a bit more pessimism, on the order of at least 20-30% less confidence than I currently hold.

Thanks! Though, hm.

Now I'm noodling how one would measure goodharting.

Interesting take.

Perhaps there was something I misunderstood, but wouldn't AI alignment work and AI capabilities slowdown still have extreme positive expected value even if the probability of unaligned AI is only 0.1-10%?

Let's say the universe will exist for 15 billion more years until the big rip.

Let's say we could decrease the odds of unaligned AI by 1% by "waiting" 20 years longer before creating AGI, we would lose out 20 years of extreme utility, which is roughly 0.00000001% of the total time (approximation of utility).

 On net we gain 15 billion * 0.01 - 20 * 0.99 ≈ 150 million years of utility.

I do agree that if we start 20 years earlier, we could possibly also populate a little bit more of space, but that should be several orders of magnitudes smaller difference than 1%.

I'm genuinely curios to hear your thoughts on this. 


You're choosing a certain death for 32% of currently living humans.  Or at least, the humans alive after [some medical research interval] at the time the AGI delay decision is made.  

The [medical research interval] is the time it requires, withly massively parallel research, for a network of AGI systems to learn which medical interventions will prevent most forms of human death, from injury and aging.   The economic motivation for a company to research this is obvious.  

Delaying AGi is choosing to shift the time until [humans start living their MTBF given perfect bodies and only accidents and murder, which is thousands of years], 20 years into the future.  

Note also that cryonics could be made to work, with clear and convincing evidence including revival of lab mammals, probably within a short time.  That [research interval until working cryo] might be months.  

Personally as a member of that subgroup, the improvement in odds ratio for misaligned AI for that 20 year period would need to be greater than 32%, or it isn't worth it.  Or essentially you'd have to show pDoom really was almost 1.0 to justify such a long delay.

Basically you would have to build AGIs and show they all inherently collaborate with each other to kill us by default.  Too few people are convinced by EY, even if he is correct.

There's another issue though, in that the benefits of AGI coming soon aren't considered by the top comment on this thread, and assuming a symmetric or nearly symmetric structure of how much utility it produces, my own values suggest that the positives of AGI outweigh the potential for extinction, especially over longer periods, which is why I have said that capabilities work is net positive.


Also how would you agree on a 20 year delay?

That would have been like, post WW2, a worldwide agreement not to build nukes. "Suuurrre" all the parties would say. "A weapon that let's us win even if outnumbered, we don't need THAt".

And they would basically all defect on the agreement. The weapon is too powerful. Or one side would honor it and be facing neighbors with nuclear arms and none of their own.

Okay, so seems like our disagreement comes down to two different factors:

  1. We have different value functions, I personally don’t value currently living human >> than future living humans, but I agree with the reasoning that to maximize your personal chance of living forever faster AI is better.

  2. Getting AGI sooner will have much greater positive benefits than simply 20 years of peak happiness for everyone, but for example over billions of years the accumulative effect will be greater than value from a few hundreds of thousands of years of of AGI.

Further I find the idea of everyone agreeing to delaying AGI 20 years to be equally absurd as you suggest Gerald, I just thought is could be a helpful hypothetical scenario for discussing the subject.

Thanks for amplifying the post by that caused your large update. It's pretty fascinating. I haven't thought through it enough yet to know if I find it as compelling as you do.

Let me try to reproduce the argument of that post to see if I've got it:

If an agent already understands the world well (e.g., by extensive predictive training) before you start aligning it (e.g., with RLHF), then the alignment should be easier. The rewards you're giving are probably attaching to the representations of the world that you want them to, because you and the model share a very similar model of the world.

In addition, you really need long term goals for deceptive alignment to happen. Those aren't present in current models, and there's no obvious way for models to develop them if their training signals are local in time.

I agree with both of these points, and I think they're really important - if we make systems with both of those properties, I think they'll be safe.

I'm not sure that AGI will be trained such that it knows a lot before alignment starts. Nor am I sure that it won't have long term goals. I think it will. Let alone ASI

But I think that tool AI might well continue be trained that way. And that will give us a little longer to work on alignment. But we aren't likely to stop with tool AI, even if it is enough to transform the world for the better.

What it will shift to is important, but assuming that this rosy model of alignment is correct, then I'd argue a significant part of the field of AI Alignment should and can change purpose to something else.

Even if you're forecasting is correct, AI alignment is still so pivotal that even the difference between 1% and 0.1% matter. At most, you're post implies that alignment researchers should favor accelerating AI instead of decelerating it.