WilliamKiely - LessWrong

I was happy to see the progression in what David Silver is saying re what goals AGIs should have:

David Silver, April 10, 2025 (from 35:33 of DeepMind podcast episode Is Human Data Enough? With David Silver):

David Silver: And so what we need is really a way to build a system which can adapt and which can say, well, which one of these is really the important thing to optimize in this situation. And so another way to say that is, wouldn't it be great if we could have systems where, you know, a human maybe specifies, what they want, but that gets translated into, a set of different numbers that the system can then optimize for itself completely autonomously.

Hannah Fry: So, okay, an example then let's say I said, okay, I want to be healthier this year. And that's kind of a bit nebulous, a bit fuzzy. But what you're saying here is that that can be translated into a series of metrics like resting heart rate or BMI or whatever it might be. And a combination of those metrics could then be used as a reward for reinforcement learning that, if I understood that correctly?

Silver: Absolutely correctly.

Fry: Are we talking about one metric, though? Are we talking about a combination here?

Silver: The general idea would be that you've got one thing which the human wants like two optimize my health. And and then the system can learn for itself. Like which rewards help you to be healthier. And so that can be like a combination of numbers that adapts over time. So it could be that it starts off saying, okay, well, you know, right now it's your resting heart rate that really matters. And then later you might get some feedback saying, hang on. You know, I really don't just care about that, I care about my anxiety level or something. And then that includes that into the mixture. And and based on on on feedback it could actually adapt. So one way to say this is that a very small amount of human data can allow the system to generate goals for itself that enable a vast amount of learning from experience.

Fry: Because this is where the real questions of alignment come in, right? I mean, if you said, for instance, let's do a reinforcement learning algorithm that just minimizes my resting heart rate. I mean, quite quickly, zero is is like a good minimization strategy that which would achieve its objective, just not maybe quite in the way that you wanted it to. I mean, obviously you really want to avoid that kind of scenario. So how do you have confidence that the metrics that you're choosing aren't creating additional problems?

Silver: One way you can do this is to leverage the the same answer, which has been so effective so far elsewhere in AI, which is at that level, you can make use of of some human input. If it's a human goal that we're optimizing, then we probably at that level need to measure, you know, and say, well, you know, human gives feedback to say, actually, you know, I'm starting to feel uncomfortable. And in fact, while I don't want to claim that we have the answers, and I think there's an enormous amount of research to get this right and make sure that this kind of thing is safe, it could actually help in certain ways in terms of this kind of safety and adaptation. There's this famous example of paving over the whole world with paperclips when, a system's been asked to make as many paperclips as possible. If you have a system which which is really its overall goal is to, you know, support human, well-being. And, and it gets that feedback from humans about and it understands their, their distress signals and their happiness signals and so forth. The moment it starts to, you know, do create too many paperclips and starts to cause people distress, it would adapt that that combination and it would choose a different combination and start to optimize for something which isn't going to pave over the world with paperclips. We're not there yet. Yeah, but I think there are some, some versions of this which could actually end up not only addressing some of the alignment issues that have been faced by previous approaches to, you know, goal focused systems that maybe even, you know, be, be more adaptive and therefore safer than what we have today.

Transformative VR Is Likely Coming Soon

WilliamKiely3mo10

It's now been 2.5 years. I think this resolves negatively?

How AI Takeover Might Happen in 2 Years

WilliamKiely5mo20

Around 50% within 2 years or over all time?

How AI Takeover Might Happen in 2 Years

WilliamKiely5mo10

Thanks for the clarification. My conclusion is that I think your emoji was meant to signal disagreement with the claim that 'opaque vector reasoning makes a difference' rather than a thing I believe.

I had rogue AIs in mind as well, and I'll take your word on "for catching already rogue AIs and stopping them, opaque vector reasoning doesn't make much of a difference".

How AI Takeover Might Happen in 2 Years

WilliamKiely5mo12

Why do you think that?

Don't the mountain of posts on optimization pressure explain why ending with "U3 was up a queen and was a giga-grandmaster and hardly needed the advantage. Humanity was predictably toast" is actually sufficient? In other words, doesn't someone who understands all the posts on optimization pressure not need the rest of the story after the "U3 was up a queen" part to understand that the AIs could actually take over?

If you disagree, then what do you think the story offers that makes it a helpful concrete example for people who both are skeptical that AIs can take over and already understand the posts on optimization pressure?

How AI Takeover Might Happen in 2 Years

WilliamKiely5mo10

Ryan disagree-reacted to the bold part of this sentence in my comment above and I'm not sure why: "This tweet predicts two objections to this story that align with my first and third bullet point (common objections) above."

This seems pretty unimportant to gain clarity on, but I'll explain my original sentence more clearly anyway:

For reference, my third bullet point was the common objection: "How would humanity fail to notice this and/or stop this?"

To my mind, someone objecting that the story is unrealistic because "there's no reason why OpenAI would ever let the model do its thinking steps in opaque vectors instead of written out in English" (as stated in the tweet) is an objection of the form "humanity wouldn't fail to stop AI from sneakily engaging in power-seeking behavior by thinking in opaque vectors." Like it's a "sure, AI could takeover if humanity were dumb like that, but there's no way OpenAI would be dump like that."

It seems like Ryan was disagreeing with this with his emoji, but maybe I misunderstood it.

How AI Takeover Might Happen in 2 Years

WilliamKiely5mo*114

Good point. At the same time, I think the underlying cruxes that lead people to being skeptical of the possibility that AIs could actually take over are commonly:

Why would an AI that well-intentioned human actors create be misaligned and motivated to takeover?
How would such an AI go from existing on computer servers to acquiring power in the physical world?
How would humanity fail to notice this and/or stop this?

I mention these points because people who mention these objections typically wouldn't raise these objections to the idea of an intelligent alien species invading Earth and taking over.

People generally have no problem granting that aliens may not share our values, may have actuators / the ability to physically wage war against humanity, and could plausibly overpower us with their superior intellect and technological know-how.

Providing a detailed story of what a particular alien takeover process might look like then isn't actually necessarily helpful to addressing the objections people raise about AI takeover.

I'd propose that authors of AI takeover stories should therefore make sure that they aren't just describing aspects of a plausible AI takeover story that could just as easily be aspects of an alien takeover story, but are instead actually addressing peoples' underlying reasons for being skeptical that AI could take over.

This means doing things like focusing on explaining:

what about the future development of AIs leads to the development of powerful agentic AIs with misaligned goals where takeover could be a plausible instrumental subgoal,
how the AIs initially acquire substantial amounts of power in the physical world,
how they do the above either without people noticing or without people stopping them.

(With this comment I don't intend to make a claim about how well the OP story does these things, though that could be analyzed. I'm just making a meta point about what kind of description of a plausible AI takeover scenario I'd expect to actually engage with the actual reasons for disagreement of the people who say "can the AIs actually take over".)

Edited to add: This tweet predicts two objections to this story that align with my first and third bullet point (common objections) above:

It was a good read, but the issue most people are going to have with this is how U3 develops that misalignment in its thoughts in the first place.
That, plus there's no reason why OpenAI would ever let the model do its thinking steps in opaque vectors instead of written out in English, as it is currently

How AI Takeover Might Happen in 2 Years

WilliamKiely5mo187

Thanks for the story. I found the beginning the most interesting.

U3 was up a queen and was a giga-grandmaster and hardly needed the advantage. Humanity was predictably toast.

I think ending the story like this is actually fine for many (most?) AI takeover stories. The "point of no return" has already occurred at this point (unless the takeover wasn't highly likely to be successful), and so humanity's fate is effectively already sealed even though the takeover hasn't happened yet.

What happens leading up to the point of no return is the most interesting part because it's the part where humanity can actually still make a difference to how the future goes.

After the point of no return, I primarily want to know what the (now practically inevitable) AI takeover implies for the future: does it mean near-term human extinction, or a future in which humanity is confined to Earth, or a managed utopia, etc?

Trying to come up with a detailed concrete plausible story of what the actual process of takeover actually looks like isn't as interesting seeming (at least to me). So I would have preferred more detail and effort put into the beginning of the story explaining how humanity managed to fail to stop the creation of a powerful agentic AI that would takeover rather than see as much detail and effort put into imagining how the takeover actually happens.

How AI Takeover Might Happen in 2 Years

WilliamKiely5mo92

Specifically I’m targeting futures that are at my top 20th percentile of rate of progress and safety difficulty.

Does this mean that you think AI takeover within 2 years is at least 20% likely?

Or are there scenarios where progress is even faster and safety is even more difficult than illustrated in your story and yet humanity avoids AI takeover?

How AI Takeover Might Happen in 2 Years

WilliamKiely5mo88

The fake names are a useful reminder and clarification that it's fiction.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments