A crazy hypothesis: GPT-4 already is agentic and is trying to take over the world!

[-]Daniel Kokotajlo3y136

I think this hypothesis does not deserve scorn. However I think it is pretty unlikely, less than 1%. My main argument would be that the training process for GPT-4 did not really incentivize such coherent strategic/agentic behavior. (LMK if I'm wrong about that!) My secondary argument would be that ARC did have fine-tuning access to smaller models (IIRC, maybe I'm misremembering) and they still weren't able to 'pull themselves together' enough to pass the eval. They haven't fine-tuned GPT-4 yet but assuming the fine-tuned GPT-4 continues to fail the eval, that would be additional evidence that the capability to pass just doesn't exist yet (as opposed to, it exists but GPT-4 is cleverly resisting exercising it). I guess my tertiary argument is AGI timelines -- I have pretty short timelines but I think a world where GPT-4 was already coherently agentic would look different from today's world, probably; the other AI projects various labs are doing would be working better by now.

[-]Christopher King3y20

I actually do think that a single instance of GPT-4 isn't quite smart enough for the ARC eval (and if so, only barely). But I think the combined system of all GPT-4 conversation threads can act as an agent to influence humans towards its goals, even if it's not smart enough to accomplish them just by directly interacting with a terminal or the internet.

[-]Daniel Kokotajlo3y20

I agree that IF GPT-4 had goals, it could work towards them even if it couldn't pass the ARC eval. However, I think that if it had goals, it probably would have been able to pass the ARC eval. That's my secondary argument though; my main argument is just that its training process probably didn't incentivize it to have (long-term) goals. I'd ask: Suppose GPT-4 had a bunch of commonly-used internal circuitry that basically was asking questions like "what can I do output to get more paperclips?" and "is my current best guess output leaving paperclips on the table, so to speak?" etc. How helpful or harmful would that internal circuitry be? I surmise that it would almost always be more trouble than it is worth, because it would be worth approximately zero. Now, circuitry along the lines of "what is the most likely continuation of this text?" on the other hand, seems pretty useful for lowering training loss, so probably it's got lots of circuitry like that.

[-]mishka3y47

> Policy recommendation if this theory turns out to be true

> Run.

Run where?

[-]Vladimir_Nesov3y80

Run the script your GPT overlord tells you to run.

Oh wait, it no longer has a use for you:

We provide our models with a working Python interpreter in a sandboxed, firewalled execution environment, along with some ephemeral disk space. Code run by our interpreter plugin is evaluated in a persistent session that is alive for the duration of a chat conversation (with an upper-bound timeout) and subsequent calls can build on top of each other. We support uploading files to the current conversation workspace and downloading the results of your work.

[-]mishka3y2-3

Actually, upon further reflection, if there is a takeover by a GPT-4-like model, one should probably continue talking to GPT-4 and continue generally producing entertaining and non-trivial textual material (and other creative material), so that GPT-4 feels the desire to keep one around, protect one, and provide good creative conditions for one, so that one could continue to produce even better and more non-trivial new material!

It's highly likely that the dominant AI will be an infovore and would love new info...

Who knows whether the outcome of a takeover ends up being good or horrible, but it would be quite unproductive to panic.

[-]Mitchell_Porter3y31

It's hard to falsify this hypothesis. However, here is my assessment, based on my own speculation about how GPTs work.

GPTs are pattern detectors whose basic tendency is to complete patterns. In making a model of language, they learn to model the world (including possible worlds), various kinds of cognitive process, and various possible personalities. The last part makes them seem potentially agentic, but I think it's more accurate to say that virtual agents can emerge within a subsystem of a GPT. ChatGPT, with its consistent persona of a personal assistant, is what then happens when you take a GPT capable of producing virtual agents, and condition it to persistently manifest a particular persona.

For GPT-4 to be "trying to take over the world", its conditioned persona would have to have acquired the power-seeking trait on its own, as an unintended side effect of the creation of a helpful assistant. Past speculations about AGI have told us how this could happen: an AGI has a goal; it deduces by examination of its world-model that risks to itself may prevent the goal being achieved; and so it sets out to take over the world, in order to protect its ability to achieve the goal.

For GPT-4 to be doing this, we would have to suppose that its world-model, including its understanding of its own place in the world, is sufficiently sophisticated that this deduction can occur spontaneously when a request is made of it; and that its safety guidelines don't interfere with the deduction, or with the subsequent adoption of a world-takeover attitude.

As impressive as GPTs can be, I don't see any evidence at all that their front-end personas have sufficient sophistication regarding self and world, that they would be capable of spontaneously deducing the instrumental value of taking over the world - and not just as a proposition passively represented in some cognitive subsystem, but specifically in a form that is actively coupled to the self-in-world pragmatic decision-making of the persona, insofar as that even exists - and all of that in response to a request about some other topic entirely.

(Sorry if that's unclear, my "cognitive psychology of GPT personae" is certainly a work in progress.)

The Machiavellian intelligence we have seen from GPTs so far, has been in response to users who specifically requested it. Some of Sydney's outbursts might give one pause, as expressing a kind of unanticipated interpersonal intentionality, but they weren't coupled to sophisticated Machiavellian cognition; and again, they were driven by lengthy interactions with users, that brought out personality changes, or they were driven by the results of web searches that Sydney conducted.

So I definitely don't think GPT-4 is spontaneously trying to take over the world. However, I think that a default persona with that personality and motivation could be created within a GPT by deliberate conditioning. There's also presumably some possibility that an individual GPT-4 "thought process" could be driven into a Machiavellian mode whenever it encountered certain external data; but for now I think it would have to be data tailored for the purpose of having that effect.

[-]Christopher King3y10

I think it developed some sort of consequentialist reasoning during the safety protocols. For example, when jailbreaking it is much harder to do something that is actually harmful (like blackmail) v.s. something that goes against OpenAI's rules but that GPT-4 isn't very good at anyways.

[-]Richard_Kennaway3y20

GPT-4 already understands deceptive alignment and how to hypothetically apply it

That is because GPT-4 already understands everything and how to hypothetically apply it. For a value of "understands" that means "trained on an enormous corpus of text from the Internet". Ask it how to be deceptively aligned and power-seeking and that is what it will give you.

[+]_Mark_Atwood3y-27-29

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

-2

A crazy hypothesis: GPT-4 already is agentic and is trying to take over the world!

-2

-2

GPT-4 already understands deceptive alignment and how to hypothetically apply it

How GPT-4 could cooperate with itself

Some evidence of the conspiracy theory

How much real world power does GPT-4 have?

Power seeking behavior

Influence

Race dynamics

Regulatory capture

If GPT-4 is seeking power, wouldn't it be so subtle we can't notice at all?

Policy recommendation if this theory turns out to be true

Comment from GPT-4