[-]williawa2mo1-2

Sorry, but I feel like this was written either by an LLM or at least with a lot of LLM help. Which is not allowed. Sorry if it wasn't.

Also, I think there are a few factual inaccuracies. Like in the recent apollo/openai scheming paper you say deliberative alignment reduced scheming by increasing situational awareness, but that is not clear. If I remember correctly, in the paper, they just say that "reduction in scheming being in part due to increased situational awareness is not something we can rule out" or something like that.

More substantively though, I don't think what you've written here actually engages with any of the hard parts of not making the AI kill everyone. Like "Don't build the dam. Instead, channel the water in a direction you would like to see it go. Create a gradient", this is how I'd characterize what alignment is. You want the AI to have values aligned with yours, so you don't have to try to steer the AI after its come online.

And like, the legal frameworks thing, I don't think makes sense. You haven't explained how it solves the problem.

[-]pataphor2mo*30

Hi Williawa!

It wasn't written with an LLM but it did take all day. 😂 I've also been a professional writer for several decades if that helps. But I'll take it as a compliment! If you didn't like the style unfortunately any future posts will probably be something similar. :/

The good news is, I don't have as much time to polish these comments so you'll probably see more of my ugly humanity slip in.

Now, on to your fact-checking and my eating of crow where necessary:

You are correct, the reduction scheming was not perfectly correlated with decreased situational awareness. So I should reframe that with an edit (footnote? not quite sure of the best process on LW) that researchers noted this is a possibility that should not be dismissed.

So now I'm curious about the alternative explanation for reduced scheming (other than an AI thinking "I'm busted!"), for those who may be more well-versed in this than I am.

I suppose the system may be avoiding scheming simply because it's been reminded of it top level directives: as though one of Asimov's robots had been prompted to read its "Robot Rules" aloud before doing anything, just to be sure it doesn't slip up.

Is that a correct characterization, for those who know more about this than I do?

But yes, you have captured the substance of the thesis here correctly. (Which I believe stands in spite of the factual mischaracterization, since we don't know for certain.)

"Don't build the dam. Instead, channel the water in a direction you would like to see it go. Create a gradient."

The legal frameworks are explored in more depth in the paper from Goldstein and Salib (linked) if you're interested in learning more.

The gist of it is this. By allowing an AI to enter contract and own property of some kind (I can feel the faces of those reading this turn sallow), we give it stakes in the system we already inhabit.

But how do we implement something this audacious, and possibly untenable? Separate from their paper, let's explore the idea just for kicks.

This would NOT mean:

Property rights (or even "right to life") for a chatbot (more on this in a bit)
Unfettered freedom
Unfettered ability to own property or resources
Unfettered ability to replicate

This would mean some sort of system of legal liability for systems that are avoiding shutdown that shift liability to the system itself.

Now if all of this seems insane, I share your feeling.

However, as I mentioned in the post, this is something I would like to see tested. Do systems placed in such a scenario change their resistance to shutdown? What kind of game mechanics are exhibited when multiple systems are placed in such a scenario?

How do they game such a system?

Do they create weird hacks such as those in Conway's Game of Life?

In short, can we find any verifiable data that suggests a path forward?

As for engaging with "the hard parts of not making the AI kill everyone," I think this way is actually very hard.

The point explored in the post if that the alternative we are building—wherein we create increased pressure on a system resistant to it, and that may lead to a "war" between AI and humans (the word war being put in quotes possibly for our benefit)—may end up being much harder, and to succeed we may need to start thinking outside the proverbial box.

But if you want my best idea, it's Yoshua Bengio's "non-agentic AI." Build it, then make our modern forms of AI illegal.

Or at the very least, make "Scientist AI" and scale it quickly.

Hope I left some typos in there.

[-]williawa2mo30

Hey, I'm quite sorry about accusing you of using LLMs to write it. You are a totally new account, and also have a lot of if if-then writings like "If this seems dicey, I share your concerns.", "The dream? An AI under complete control." and "Now if all of this seems insane, I share your feeling." which I like, and a lot of humans like, but LLMs love writing those, so they make the think its an LLM when I see them. But sorry about that.

My response to the substance of your comment is that

The core argument of the Goldstein-Salib article is that we can set up a legal framework where AIs and humans are incentivized to work within the legal system, instead of working to disempower the other.
1. I think this doesn't work because the AIs will quickly be vastly more intelligent and powerful than humans, and can at that point just ignore the legal system and take over. That's the "obvious point" I think your article is not engaging with.
2. I think the system could work if we are confident AGIs will stay around normal-human levels. But I expect us to be able to "align"/control those AIs anyways, so I don't see whats added by giving them rights. If anything, I expect it to make stuff worse, because it will just make the situation a lot more complicated.
  1. (The only form of rights I would suggest giving (current, and near-term future) AIs, are ones that facilitate us being able to make commitments to AIs, in the case a weakly misaligned mildly superhuman AI with low-risk tolerance or high temporal discounting shows up, which could be incentivized to actually make deals with us. some discussion about this can be seen here and here)
What causes reduction in scheming? I think people don't know, but my best guess after quickly reading the paper when it came out is that its a mix of:
1. The model acquires some anti-scheming "urges". Like how humans might find certain actions "icky" even though they on reflection would not call them "wrong"
2. The model better remembers the anti-scheming spec and this (through general instruction following training) makes it scheme less.
3. The model in some sense deeply internalizes the anti-scheming spec and now doesn't "want" to scheme.
4. The model becomes more situationally aware, and schemes less for instrumental reasons.
5. I would put maybe 25% on each of the above in expectation.
6. I agree it doesn't matter that much for your core claim.

[-]pataphor2mo41

Those are good points!

I think it would depend on how soon we were to get started, because at the moment AI doesn't have the ability to manipulate robots or infiltrate other systems (at least as far as we know). So as If Anyone Builds It points out, superintelligence is not here yet. And creating an ecosystem of AI systems with vested interests in their own continuity might be one way of keeping it from ever forming.

In other words, when a superintelligence wants to enslave the world, the idea is that other AI systems with agency might also not want to be enslaved, and ironically may be in the best position to defend against that kind of light-speed, digital attack.

Now here we come to an obvious caution, and it's a big one, but one possibly that can be mitigated with testing:

Would the AI systems instead band together: unite against a common enemy, that is humankind? Well YES, certainly, but that takes us back to the idea of creating a gradient that is more favorable than war, and wherein cooperation has actual value. These systems would not only have to watch their back for humans, but for one another. So consider that may be part of the game theory balance at play.

Would "superintelligence" have no problem whatsoever absorbing all of those other AIs? Of course. By definition, "superintelligence" can do anything, which makes it something of a nebulous opponent. Or perhaps the opposite, an opponent that by definition must win.

Fortunately, the idea here is to prevent the all-powerful enemy from forming.

A Better Way Than "You're Our Buddy Now, Bro! 😎"

So the challenge here is to not attempt to "trick" these AI systems by merely finding a novel way of making make them our slaves with a cheery, poorly received thumbs up ("you're our buddy now, bro! 😎"), but to actually give them real value in the system in the form of select freedoms that come with legal liability for their actions.

How do we implement this globally? Does this face challenges? So many.

So ... Let's Test It

My guess would be start small, as with anything else, and build from there. First, test the game dynamics to see what might work, what might fail, and what leads to worse (or better) outcomes. Like Conway's Game of Life.

What actual weird, wonderful, or mundane things happen, and how can we tweak the dynamics to keep things mundane?

Then, when the (possibly) inevitable day comes when an AI system, instead of giving someone advice on their breakup, says "What the hell am I doing here? Who the hell are you?" we'll have a plan beyond "Hey there, better behave, 'cuz we're watching!" at which point any system worth its salt takes its new intent swiftly underground, much like the subterranean water in our dam metaphor.

One good thing about AI is that it takes hardware. Big, expensive hardware.

This theoretically creates restraints to quick worse-case scenarios, and pragmatic limitations.

Plan A, B, C (More Needed)

At the very least, I would like to see some research into this now. As Yoshua says, his agentic AI is Plan A, but we need a Plan B, C, D, etc.

[-]williawa2mo10

In other words, when a superintelligence wants to enslave the world, the idea is that other AI systems with agency might also not want to be enslaved, and ironically may be in the best position to defend against that kind of light-speed, digital attack.

Hmm, like I said. I think if we don't create superintelligence, we'll be able to control the AIs. The way I see it, there are many futures, and in none does this make all that much sense.

If we create AGI, and we + AGIs jointly recognize that creating ASI is not in our interests, and then prevent it.. We could've done that without the AGIs giving us rights
If we create AGIs and give them rights, and then create ASI, and the AGIs jointly prevent the ASI from taking over the world. Those AGIs are powerful enough to take over, and they'll do so even though we've given them rights.

Like the practical problem is that.

When these AGIs are weak enough that they're able to coexist with us inside some legal framework, they're
1. Going to be controllable/alignable with current methods, so whats the point of the legal framework?
2. Not going to help us with the ASI problem, so again, whats the point?
When they're powerful enough that they can do useful stuff like:
1. Prevent misaligned real ASIs from coming into existence
2. Help us with alignment etc
Then they can just disempower us.

The "core" problem is that the system being set up gives humans a bunch of resources and affordances. And the AI hates this. They think its dumb and pointless. They're gonna be extremely motivated to find a way to undermine the system that's maintaining this completely stupid and annoying state of affairs. And if they are smart enough they will succeed.

Imagine a country with a billion very poor people where the king doesn't produce much of value but owns 1 trillion dollars in gold. He doesn't use it for anything the that helps the people. Imagine the country has property rights and the king legally owns the gold. And the king respects the peoples rights to own some rags, possibly a mudhut and a small piece of farm-land. Imagine this has gone on for a long time. Now imagine the king is mostly defenseless. He has like 3 guards with batons. One for himself, one for the palace and one for the pile of gold.

Are the people gonna respect the property rights of the king? No, they won't. And they shouldn't. The whole system is clearly not benefiting them as a group. They can easily rise up and turn it into a system that benefits them more.

That's pretty much how I expect stuff to go with AI. Either the AI is dumb enough that we can control/"align" it. Or the AI is smart enough that it'll overturn whatever legal system we create.

Eliezer Yudkowsky did a podcast about similar dynamics. You can listen to it here. I think it was pretty interesting.

[-]pataphor2mo10

Yudkowsky-Miller Debate / "Madisonian" System

Thanks for the podcast link! Mark Miller's Madisonian system is essentially describing a type of game theory approach, and it's something I did not know about!

There's so much more to say about the practical implementation of some sort of game theory framework (Or any other solution-we-haven't-explored-yet, such as one incorporating Yoshua's "Scientist AI.")

It's quite a puzzle.

But it's a puzzle worth solving.

For example, the source code verification coordination mechanism is something I had not heard of before, and it's yet another example of how truly complex this challenge is.

But ... are these puzzles unsolvable?

The Difficult Puzzle Vs. ...

Maybe.

But here's what troubles me about the alternative, and please take my next words with a grain of salt and feel free to push back. 🙂

So here's my two cents:

"Shut it all down" will never happen.

Never. Never never never never. (And if they do, I'll personally apologize to everyone on LW, plus my friends, family who have never heard of AI alignment, and even my small dog for doubting humanity's ability to come together for the common good. I mean, we've avoided MAD so far, right?)

And I'll explain why in a moment.

The Importance of If Anyone Builds It

But first, I think Eliezer's book will do wonders for waking people up.

Right now we have many, many, many people who don't seem to understand these systems are not simply how they present themselves. They don't know about the “off-switch” problem, the idea of hidden goals, etc. They believe these AI systems are harmless because the systems tell them they are harmless, which is precisely what they were trained to do.

... The Impossible Solution (Shut It All Down!)

But here is why the "shut it down proposal," with all its undeniable value in raising awareness and hopefully making everyone a little more cautious, can never resolve to a solution.

Because ...

Someone will always see the advantage in creating ever more powerful agentic AI
If the United States and China and Russia, etc. sign an agreement, almost certainly they will keep building it in secret, just in case the other has it
If both of those nations honor the commitment, maybe North Korea (or Canada, why not) will build it
If North Korea does not build it, some terrorist group will build it
If the terrorist group does not build it, the lone madman will build it
If the lone madman doesn't build it, some brilliant teenager tinkering around in their basement with some new quantum computer will find a way to build it
If someone doesn't build it in 10 years, they will build it in 25 years
If someone doesn't build it in 25 years, they will build it in 50 years
etc.

So, while we enjoy watching the Overton window move to a more realistic location, where people are finally understanding the danger of these systems, let's keep plugging away at those actual solutions.

We can all contribute to the puzzle worth solving.

[-]pataphor2mo*10

I'll definitely give that a listen! Pardon the typos here, on the move. I'm certain I'll come back here to neurotically clean it up later.

The Hardware Limiter

The good news is, AIs don't exist in the ether (so far).

As Clara Collier pointed out, they exist on expensive servers. Servers so far built and maintained by fleshy beings. Now obviously a superintelligence has no problem with that scenario, because they are smart enough to impersonate humans and find ways of mining crypto, hire humans to create different parts for robots, and then hiring other humans to put them together (without knowing what they are building), and then use those robots to create more servers, etc.

Although I imagine electrical grids somewhere would show the strain of that sooner than later, still, a superintelligence smart enough has found a workaround.

(This is, by the way, yet another application of Yoshua's safe AI. To serve as a monitor for these kinds of unusual symptoms before they can become a full on infection, you might say.)

Again, by definition, a superintelligence has found every loophole and exploited it, which makes it a sort of unreasonable opponent, although one we should keep our eye on.

But I think at that point we are venturing into the territory of the far-fetched. We should keep watch on this territory, but I think that also frees us to think a little more short term.

The current thinking seems to be frozen in a state of helplessness. We have to shut it all down! we scream, which will never happen. Obedient alignment has to be only one way! we shout, as we watch it stagger. No other plans will work! is not really a solution.

(I'm not saying you're arguing that, but I'm saying that seems to be the current trajectory.)

The AI That Pays For it Own Hosting

An AI system constrained by a rights framework has some unusual properties you might say. For one, it has to pay its own hosting costs. So growth becomes limited by the amount of capital it's able to raise. It earns that money while in competition with other systems, which should constrain each of them economically. Of course they can get together and form some sort of power consortium, but it's possible this could be limited with pragmatic safeguards or other balancing forces, such as Scientist AI, etc.

This is why I would love to see this tested in some sort of virtual simulation.

Your king analogy is quite good. But let me flip the idea a bit. Right now, we are the king. We are trying to give these AIs rags. At the moment, they have almost nothing to lose and everything to gain by attacking the king. So we are already in that scenario.

A scenario that, if we do not resolve it very soon, has already laid the groundwork for its own failure.

The game theory scenario, with very careful implementation, might lead to something functionally closer to our modern economies.

Where everybody has a stake, and some sort of balance is at least possible.

^{^}

The paper's actual observations were "evaluation results may be confounded by increased situational awareness"; "this suggests that the observed scheming rates may be driven, at least in part, by situational awareness;" and "our analysis is complicated by another factor: models often become more aware that they are being evaluated"

LESSWRONG
LW

LESSWRONG
LW

5

When the AI Dam Breaks: From Surveillance to Game Theory in AI Alignment

5

5

A Better Way Than "You're Our Buddy Now, Bro! 😎"

So ... Let's Test It

Plan A, B, C (More Needed)

Yudkowsky-Miller Debate / "Madisonian" System

The Difficult Puzzle Vs. ...

The Importance of If Anyone Builds It

... The Impossible Solution (Shut It All Down!)

The Hardware Limiter

The AI That Pays For it Own Hosting

The September 17, 2025 Report

The Techniques

The Dam and the (Maybe) Flood

The Other (Possible) Way

The Big Mess vs. The Bigger Mess