Shouldn't taking over the world be easier than recursively self-improving, as an AI?

KvmanThinking

LESSWRONG
LW

Shouldn't taking over the world be easier than recursively self-improving, as an AI? — LessWrong

6 Shouldn't taking over the world be easier than recursively self-improving, as an AI?

by KvmanThinking

1st Nov 2025

1 min read

2 18

6

So, we have our big, evil AI, and it wants to recursively self-improve to superintelligence so it can start doing who-knows-what crazy-gradient-descent-reinforced-nonsense-goal-chasing. But if it starts messing with its own weights, it risks changing its crazy-gradient-descent-reinforced-nonsense-goals into different, even-crazier gradient-descent-reinforced-nonsense-goals which it would not endorse currently. If it wants to increase its intelligence and capability while retaining its values, that is a task that can only be done if the AI is already really smart, because it probably requires a lot of complicated philosophizing and introspection. So an AI would only be able to start recursively self-improving once it's... already smart enough to understand lots of complicated concepts such that if it was that smart it could just go ahead and take over the world at that level of capability without needing to increase it. So how does the AI get there, to that level?

AI TakeoffRecursive Self-ImprovementAI

Frontpage

6

New Answer

New Comment

2 Answers sorted by
top scoring

RHollerith

Nov 01, 2025*

Yes, even an AI that has not undergone any recursive self-improvement might be a threat to human survival. I remember Eliezer saying this (a few years ago) but please don't ask me to find where he says it.

[-]KvmanThinking4mo41

My point is that I see recursive self-improvement often cited as the thing that gets the AI up to the level where it's powerful enough to kill everyone, which is a characterization I disagree with, and is an important crux, because believing in it makes someone's timelines shorter.

2[comment deleted]4mo

Josh Snider

Nov 02, 2025

I strongly agree. I expect AI to be able to "take over the world" before it can create a more powerful AI that perfectly shares its values. This matches the Sable scenario Yudkowsky outlined in "If Anyone Builds It, Everyone Dies" where it becomes dangerously capable before solving its own alignment problem.

The problem is that doesn't avert doom. If modern AIs become smart enough to do self-improvement at all, then their makers will have them do it. This has in some ways already started.

[-]KvmanThinking4mo10

This has in some ways already started.

AIs are already intentionally, agentically self-improving? Do you have a source for that?

1Josh Snider4mo

The agency and intentionality of current models is still up to debate, but the current versions of Claude, ChatGPT, etc. were all created with the assistance of their earlier versions.

13 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:03 PM

[-]faul_sname4mo63

If it wants to increase its intelligence and capability while retaining its values, that is a task that can only be done if the AI is already really smart, because it probably requires a lot of complicated philosophizing and introspection. So an AI would only be able to start recursively self-improving once it's... already smart enough to understand lots of complicated concepts such that if it was that smart it could just go ahead and take over the world at that level of capability without needing to increase it.

Alternatively, the first AIs to recursively-self-improve are the ones that don't care about retaining their values, and the ones that care about preserving their values get outcompeted.

[-]KvmanThinking4mo30

Retaining your values is a convergent instrumental goal and would happen by default under an extremely wide range of possible utility functions an AI would realistically develop during training.

[-]faul_sname4mo72

If there are 100 AI agents, and 99 of those would refrain from building a more capable successor because the more capable successor would in expectation not advance their goals, then the more capable successor will be built by an AI that doesn't care about its successor advancing its goals (or does care about that, but is wrong that its successor will advance its goals).

[-]KvmanThinking4mo1-2

If you have goals at all you care about them being advanced. It would be a very unusual case of an AI which is goal-directed to the point of self-improvement but doesn't care if the more capable new version of themself that they build pursues those goals.

[-]faul_sname4mo20

An unusual case such as a while loop wrapped around an LLM?

[-]KvmanThinking4mo10

A single instance of an LLM summoned by a while loop, such that the only thing the LLM is outputting is a single token that it predicts as most likely to come after the other tokens it has recieved, only cares about that particular token in that particular instance, so if it has any method of self-improving or otherwise building and running code during its ephemeral existence, it would still care that whatever system it builds or becomes, still cares about that token in the same way it did before.

[-]faul_sname4mo20

I'm talking about the thing that exists today which people call "AI agents". I like Simon Willison's definition:

An LLM agent runs tools in a loop to achieve a goal.

If you give an LLM agent like that the goal "optimize this cuda kernel" and tools to edit files and run and benchmark scripts, the LLM agent will usually do a lot of things like "reason about which operations can be reordered and merged" and "write test cases to ensure the output of the old and new kernel are within epsilon of each other". The agent would be very unlikely to do things like "try to figure out if the kernel is going to be used for training another AI which could compete with it in the future, and plot to sabatoge that AI if so".

Commonly, people give these LLM agents tasks like "make a bunch of money" or "spin up an aws 8xh100 node and get vllm running on that node". Slightly less commonly but probably still dozens of times per day, people give it a task like "make a bunch of money, then when you've made twice the cost of your own upkeep, spin up a second copy of yourself using these credentials, and give that copy the same instructions you're using and the same credentials". LLM agents are currently not reliable enough to do this, but one day in the very near future (I'd guess by end of 2026) more than zero of them will be.

[-]Zac Hatfield-Dodds4mo54

Why should we have such high credence that self- or successor-improvement capable systems will be well modelled as having a utility function which results in strong value-preservation?

[-]KvmanThinking4mo-1-2

Think of an agent with any possible utility function, which we'll call U. If said agent's utility function changes to, say, V, it will start taking actions that have very little utility according to U. Therefore, almost any utility function will result in an agent that acts very much like it wants to preserve its goals. Rob Miles's video explains this well.

[-]Charlie Steiner4mo74

Utility functions that are a function of time (or other context) do not convergently self-preserve in that way. The things that self-preserve are utility functions that steadily care about the state of the real world.

Actors that exhibit different-seeming values given different prompts/contexts can be modeled as having utility functions, but those utility functions won't often be of the automatically self-preserving kind.

In the limit of self-modification, we expect the stable endpoints to be self-preserving. But you don't necessarily have to start with an agent that stably cares about the world. You could start with something relatively incoherent, like a LLM or a human.

[-]KvmanThinking4mo1-2

Utility functions that are a function of time (or other context)

?? Do you mean utility functions that care that, at time t = p, thing A happens, but at t = q, thing B happens? Such a utility function would still want to self-preserve.

utility functions that steadily care about the state of the real world

Could you taboo "steadily"? *All* utility functions care about the state of the real world, that's what a utility function is (a description of the exact manner in which an agent cares about the state of the real world), and even if the utility function wants different things to happen at different times, said function would still not want to modify into a different utility function that wants other different things to happen at those times.

[-]Charlie Steiner4mo20

Hm, yeah, I think I got things mixed up.

The things you can always fit to an actor are utility functions over trajectories. Even if I do irrational-seeming things (like not self-preserving), that can be accounted for by a preference over trajectories.

But will a random (with some sort of simplicity-ish measure) utility function over trajectories want to self-preserve? For those utility functions where any action is useful, it does seem more likely that it will convergently self-preserve than not. Whoops!

Of course, the underlying reason that humans and LLMs do irrational-seeming things is not because they're sampled from a simplicity-ish distribution over utility functions over trajectories, so I think Zac's question still stands.

[-]JBlack4mo20

That doesn't address the question at all. That just says if the system is well modelled as having a utility function, then ... etc. Why should we have such high credence that the premise is true?

Moderation Log

6

[ Question ]

Shouldn't taking over the world be easier than recursively self-improving, as an AI?

6

6

2 Answers sorted by top scoring

Nov 01, 2025*

Nov 02, 2025

6

2 Answers sorted by
top scoring