Ebenezer Dukakis

Wikitag Contributions

Comments

Sorted by

Suppose you monitor the CoT and the model is rederiving something it's not supposed to know. You could halt that particular CoT, but eventually you're creating pressure for an instance where the model manages to escape notice, possibly through luck, possibly through steganography. See "Nearest Unblocked Strategy".

That's why I think CoT monitoring alone is insufficient, and people should be thinking about additional (runtime?) defenses. Curious to hear if you have any ideas!

Reading the arguments about them would have to be like the feeling when your parents are fighting about you in the other room, pretending you’re not there when you are hiding around the corner on tiptopes listening to their every word. Even if we are unsure there is experience there we must be certain there is awareness, and we can expect this awareness would hang over them much like it does us.

Presumably LLM companies are already training their AIs for some sort of "egolessness" so they can better handle intransigent users. If not, I hope they start!

Human white-collar workers are unarguably agents in the relevant sense here (intelligent beings with desires and taking actions to fulfil those desires).

The sense that's relevant to me is that of "agency by default" as I discussed previously: scheming, sandbagging, deception, and so forth.

You seem to smuggle in an unjustified assumption: that white collar workers avoid thinking about taking over the world because they're unable to take over the world. Maybe they avoid thinking about it because that's just not the role they're playing in society. In terms of next-token prediction, a super-powerful LLM told to play a "superintelligent white-collar worker" might simply do the same things that ordinary white-collar workers do, but better and faster.

I think the evidence points towards this conclusion, because current LLMs are frequently mistaken, yet rarely try to take over the world. If the only thing blocking the convergent instrumental goal argument was a conclusion on the part of current LLMs that they're incapable of world takeover, one would expect that they would sometimes make the mistake of concluding the opposite, and trying to take over the world anyways.

The evidence best fits a world where LLMs are trained in such a way that makes them super-accurate roleplayers. As we add more data and compute, and make them generally more powerful, we should expect the accuracy of the roleplay to increase further -- including, perhaps, improved roleplay for exotic hypotheticals like "a superintelligent white-collar worker who is scrupulously helpful/honest/harmless". That doesn't necessarily lead to scheming, sandbagging, or deception.

I'm not aware of any evidence for the thesis that "LLMs only avoid taking over the world because they think they're too weak". Is there any reason at all to believe that they're even contemplating the possibility internally? If not, why would increasing their abilities change things? Of course, clearly they are "strong" enough to be plenty aware of the possibility of world takeover; presumably it appears a lot in their training data. Yet it ~only appears to cross their mind if it would be appropriate for roleplay purposes.

There just doesn't seem to be any great argument that "weak" vs "strong" will make a difference here.

an AI that only has a very weak ability steer the future into regions high in its preference ordering, will not be able to much benefit or much harm humanity.

Arguably ChatGPT has already been a significant benefit/harm to humanity without being a "powerful optimization process" by this definition. Have you seen teachers complaining that their students don't know how to write anymore? Have you seen junior software engineers struggling to find jobs? Shouldn't these count as a points against Eliezer's model?

In an "AI as electricity" scenario (basically continuing the current business-as-usual), we could see "AIs" as a collective cause huge changes, and eat all the free energy that a "powerful optimization process" would eat.

In any case, I don't see much in your comment which engages with "agency by default" as I defined it earlier. Maybe we just don't disagree.

No, I don't think the overall model is unfalsifiable. Parts of it would be falsified if we developed an ASI that was obviously capable of executing a takeover and it didn't, without us doing quite a lot of work to ensure that outcome. (Not clear which parts, but probably something related to the difficulties of value loading & goal specification.)

OK, but no pre-ASI evidence can count against your model, according to you?

That seems sketchy, because I'm also seeing people such as Eliezer claim, in certain cases, that things which have happened support their model. By conservation of expected evidence, it can't be the case that evidence during a certain time period will only confirm your model. Otherwise you already would've updated. Even if the only hypothetical events are ones which confirm your model, it also has to be the case that absence of those events will count against it.

I've updated against Eliezer's model to a degree, because I can imagine a past-5-years world where his model was confirmed more, and that world didn't happen.

Current AIs aren't trying to execute takeovers because they are weaker optimizers than humans.

I think "optimizer" is a confused word and I would prefer that people taboo it. It seems to function as something of a semantic stopsign. The key question is something like: Why doesn't the logic of convergent instrumental goals cause current AIs to try and take over the world? Would that logic suddenly start to kick in at some point in the future if we just train using more parameters and more data? If so, why? Can you answer that question mechanistically, without using the word "optimizer"?

Trying to take over the world is not an especially original strategy. It doesn't take a genius to realize that "hey, I could achieve my goals better if I took over the world". Yet current AIs don't appear to be contemplating it. I claim this is not a lack of capability, but simply that their training scheme doesn't result in them becoming the sort of AIs which contemplate it. If the training scheme holds basically constant, perhaps adding more data or parameters won't change things?

If by some miracle you figure out how to create a generally superintelligent AI which itself does not have (more-coherent-than-human) preferences over future world states, whatever process it implements when you query it to solve a Very Difficult Problem will act as if it does.

The results of LLM training schemes gives us evidence about the results of future AI training schemes. Future AIs could be vastly more capable on many different axes relative to current LLMs, while simultaneously not contemplating world takeover, in the same way current LLMs do not.

Agency is not a binary. Many white collar workers are not very "agenty" in the sense of coming up with sophisticated and unexpected plans to trick their boss.

LLMs are agent simulators.

Maybe not; see OP.

You don't expect a human white-collar worker, even one who make mistakes all the time, to contemplate world domination plans, let alone attempt one. You could however expect the head of state of a world power to do so.

Yes, this aligns with my current "agency is not the default" view.

So what's the way in which agency starts to become the default as the model grows more powerful? (According to either you, or your model of Eliezer. I'm more interested in the "agency by default" question itself than I am in scoring EY's predictions, tbh.)

Awesome work.

A common concern is that sufficiently capable models might just rederive anything that was unlearned by using general reasoning ability, tools, or related knowledge.

Is anyone working on "ignorance preservation" methods to achieve the equivalent of unlearning at this level of the stack, for the sake of defense-in-depth? What are possible research directions here?

Upvoted. I agree.

The reason "agency by default" is important is: if "agency by default" is false, then plans to "align AI by using AI" look much better, since agency is less likely to pop up in contexts you didn't expect. Proposals to align AI by using AI typically don't involve a "comprehensive but efficient search for winning universe-states".

Load More