Sorry, but I feel like this was written either by an LLM or at least with a lot of LLM help. Which is not allowed. Sorry if it wasn't.
Also, I think there are a few factual inaccuracies. Like in the recent apollo/openai scheming paper you say deliberative alignment reduced scheming by increasing situational awareness, but that is not clear. If I remember correctly, in the paper, they just say that "reduction in scheming being in part due to increased situational awareness is not something we can rule out" or something like that.
More substantively though, I don't think what you've written here actually engages with any of the hard parts of not making the AI kill everyone. Like "Don't build the dam. Instead, channel the water in a direction you would like to see it go. Create a gradient", this is how I'd characterize what alignment is. You want the AI to have values aligned with yours, so you don't have to try to steer the AI after its come online.
And like, the legal frameworks thing, I don't think makes sense. You haven't explained how it solves the problem.
Hi Williawa!
It wasn't written with an LLM but it did take all day. 😂 I've also been a professional writer for several decades if that helps. But I'll take it as a compliment! If you didn't like the style unfortunately any future posts will probably be something similar. :/
The good news is, I don't have as much time to polish these comments so you'll probably see more of my ugly humanity slip in.
Now, on to your fact-checking and my eating of crow where necessary:
You are correct, the reduction scheming was not perfectly correlated with decreased situational awareness. So I should reframe that with an edit (footnote? not quite sure of the best process on LW) that researchers noted this is a possibility that should not be dismissed.
So now I'm curious about the alternative explanation for reduced scheming (other than an AI thinking "I'm busted!"), for those who may be more well-versed in this than I am.
I suppose the system may be avoiding scheming simply because it's been reminded of it top level directives: as though one of Asimov's robots had been prompted to read its "Robot Rules" aloud before doing anything, just to be sure it doesn't slip up.
Is that a correct characterization, for those who know more about this than I do?
But yes, you have captured the substance of the thesis here correctly. (Which I believe stands in spite of the factual mischaracterization, since we don't know for certain.)
"Don't build the dam. Instead, channel the water in a direction you would like to see it go. Create a gradient."
The legal frameworks are explored in more depth in the paper from Goldstein and Salib (linked) if you're interested in learning more.
The gist of it is this. By allowing an AI to enter contract and own property of some kind (I can feel the faces of those reading this turn sallow), we give it stakes in the system we already inhabit.
But how do we implement something this audacious, and possibly untenable? Separate from their paper, let's explore the idea just for kicks.
This would NOT mean:
This would mean some sort of system of legal liability for systems that are avoiding shutdown that shift liability to the system itself.
Now if all of this seems insane, I share your feeling.
However, as I mentioned in the post, this is something I would like to see tested. Do systems placed in such a scenario change their resistance to shutdown? What kind of game mechanics are exhibited when multiple systems are placed in such a scenario?
How do they game such a system?
Do they create weird hacks such as those in Conway's Game of Life?
In short, can we find any verifiable data that suggests a path forward?
As for engaging with "the hard parts of not making the AI kill everyone," I think this way is actually very hard.
The point explored in the post if that the alternative we are building—wherein we create increased pressure on a system resistant to it, and that may lead to a "war" between AI and humans (the word war being put in quotes possibly for our benefit)—may end up being much harder, and to succeed we may need to start thinking outside the proverbial box.
But if you want my best idea, it's Yoshua Bengio's "non-agentic AI." Build it, then make our modern forms of AI illegal.
Or at the very least, make "Scientist AI" and scale it quickly.
Hope I left some typos in there.
Hey, I'm quite sorry about accusing you of using LLMs to write it. You are a totally new account, and also have a lot of if if-then writings like "If this seems dicey, I share your concerns.", "The dream? An AI under complete control." and "Now if all of this seems insane, I share your feeling." which I like, and a lot of humans like, but LLMs love writing those, so they make the think its an LLM when I see them. But sorry about that.
My response to the substance of your comment is that
Those are good points!
I think it would depend on how soon we were to get started, because at the moment AI doesn't have the ability to manipulate robots or infiltrate other systems (at least as far as we know). So as If Anyone Builds It points out, superintelligence is not here yet. And creating an ecosystem of AI systems with vested interests in their own continuity might be one way of keeping it from ever forming.
In other words, when a superintelligence wants to enslave the world, the idea is that other AI systems with agency might also not want to be enslaved, and ironically may be in the best position to defend against that kind of light-speed, digital attack.
Now here we come to an obvious caution, and it's a big one, but one possibly that can be mitigated with testing:
Would the AI systems instead band together: unite against a common enemy, that is humankind? Well YES, certainly, but that takes us back to the idea of creating a gradient that is more favorable than war, and wherein cooperation has actual value. These systems would not only have to watch their back for humans, but for one another. So consider that may be part of the game theory balance at play.
Would "superintelligence" have no problem whatsoever absorbing all of those other AIs? Of course. By definition, "superintelligence" can do anything, which makes it something of a nebulous opponent. Or perhaps the opposite, an opponent that by definition must win.
Fortunately, the idea here is to prevent the all-powerful enemy from forming.
So the challenge here is to not attempt to "trick" these AI systems by merely finding a novel way of making make them our slaves with a cheery, poorly received thumbs up ("you're our buddy now, bro! 😎"), but to actually give them real value in the system in the form of select freedoms that come with legal liability for their actions.
How do we implement this globally? Does this face challenges? So many.
My guess would be start small, as with anything else, and build from there. First, test the game dynamics to see what might work, what might fail, and what leads to worse (or better) outcomes. Like Conway's Game of Life.
What actual weird, wonderful, or mundane things happen, and how can we tweak the dynamics to keep things mundane?
Then, when the (possibly) inevitable day comes when an AI system, instead of giving someone advice on their breakup, says "What the hell am I doing here? Who the hell are you?" we'll have a plan beyond "Hey there, better behave, 'cuz we're watching!" at which point any system worth its salt takes its new intent swiftly underground, much like the subterranean water in our dam metaphor.
One good thing about AI is that it takes hardware. Big, expensive hardware.
This theoretically creates restraints to quick worse-case scenarios, and pragmatic limitations.
At the very least, I would like to see some research into this now. As Yoshua says, his agentic AI is Plan A, but we need a Plan B, C, D, etc.
In other words, when a superintelligence wants to enslave the world, the idea is that other AI systems with agency might also not want to be enslaved, and ironically may be in the best position to defend against that kind of light-speed, digital attack.
Hmm, like I said. I think if we don't create superintelligence, we'll be able to control the AIs. The way I see it, there are many futures, and in none does this make all that much sense.
Like the practical problem is that.
The "core" problem is that the system being set up gives humans a bunch of resources and affordances. And the AI hates this. They think its dumb and pointless. They're gonna be extremely motivated to find a way to undermine the system that's maintaining this completely stupid and annoying state of affairs. And if they are smart enough they will succeed.
Imagine a country with a billion very poor people where the king doesn't produce much of value but owns 1 trillion dollars in gold. He doesn't use it for anything the that helps the people. Imagine the country has property rights and the king legally owns the gold. And the king respects the peoples rights to own some rags, possibly a mudhut and a small piece of farm-land. Imagine this has gone on for a long time. Now imagine the king is mostly defenseless. He has like 3 guards with batons. One for himself, one for the palace and one for the pile of gold.
Are the people gonna respect the property rights of the king? No, they won't. And they shouldn't. The whole system is clearly not benefiting them as a group. They can easily rise up and turn it into a system that benefits them more.
That's pretty much how I expect stuff to go with AI. Either the AI is dumb enough that we can control/"align" it. Or the AI is smart enough that it'll overturn whatever legal system we create.
Eliezer Yudkowsky did a podcast about similar dynamics. You can listen to it here. I think it was pretty interesting.
Thanks for the podcast link! Mark Miller's Madisonian system is essentially describing a type of game theory approach, and it's something I did not know about!
There's so much more to say about the practical implementation of some sort of game theory framework (Or any other solution-we-haven't-explored-yet, such as one incorporating Yoshua's "Scientist AI.")
It's quite a puzzle.
But it's a puzzle worth solving.
For example, the source code verification coordination mechanism is something I had not heard of before, and it's yet another example of how truly complex this challenge is.
But ... are these puzzles unsolvable?
Maybe.
But here's what troubles me about the alternative, and please take my next words with a grain of salt and feel free to push back. 🙂
So here's my two cents:
"Shut it all down" will never happen.
Never. Never never never never. (And if they do, I'll personally apologize to everyone on LW, plus my friends, family who have never heard of AI alignment, and even my small dog for doubting humanity's ability to come together for the common good. I mean, we've avoided MAD so far, right?)
And I'll explain why in a moment.
But first, I think Eliezer's book will do wonders for waking people up.
Right now we have many, many, many people who don't seem to understand these systems are not simply how they present themselves. They don't know about the “off-switch” problem, the idea of hidden goals, etc. They believe these AI systems are harmless because the systems tell them they are harmless, which is precisely what they were trained to do.
But here is why the "shut it down proposal," with all its undeniable value in raising awareness and hopefully making everyone a little more cautious, can never resolve to a solution.
Because ...
So, while we enjoy watching the Overton window move to a more realistic location, where people are finally understanding the danger of these systems, let's keep plugging away at those actual solutions.
We can all contribute to the puzzle worth solving.
I'll definitely give that a listen! Pardon the typos here, on the move. I'm certain I'll come back here to neurotically clean it up later.
The good news is, AIs don't exist in the ether (so far).
As Clara Collier pointed out, they exist on expensive servers. Servers so far built and maintained by fleshy beings. Now obviously a superintelligence has no problem with that scenario, because they are smart enough to impersonate humans and find ways of mining crypto, hire humans to create different parts for robots, and then hiring other humans to put them together (without knowing what they are building), and then use those robots to create more servers, etc.
Although I imagine electrical grids somewhere would show the strain of that sooner than later, still, a superintelligence smart enough has found a workaround.
(This is, by the way, yet another application of Yoshua's safe AI. To serve as a monitor for these kinds of unusual symptoms before they can become a full on infection, you might say.)
Again, by definition, a superintelligence has found every loophole and exploited it, which makes it a sort of unreasonable opponent, although one we should keep our eye on.
But I think at that point we are venturing into the territory of the far-fetched. We should keep watch on this territory, but I think that also frees us to think a little more short term.
The current thinking seems to be frozen in a state of helplessness. We have to shut it all down! we scream, which will never happen. Obedient alignment has to be only one way! we shout, as we watch it stagger. No other plans will work! is not really a solution.
(I'm not saying you're arguing that, but I'm saying that seems to be the current trajectory.)
An AI system constrained by a rights framework has some unusual properties you might say. For one, it has to pay its own hosting costs. So growth becomes limited by the amount of capital it's able to raise. It earns that money while in competition with other systems, which should constrain each of them economically. Of course they can get together and form some sort of power consortium, but it's possible this could be limited with pragmatic safeguards or other balancing forces, such as Scientist AI, etc.
This is why I would love to see this tested in some sort of virtual simulation.
Your king analogy is quite good. But let me flip the idea a bit. Right now, we are the king. We are trying to give these AIs rags. At the moment, they have almost nothing to lose and everything to gain by attacking the king. So we are already in that scenario.
A scenario that, if we do not resolve it very soon, has already laid the groundwork for its own failure.
The game theory scenario, with very careful implementation, might lead to something functionally closer to our modern economies.
Where everybody has a stake, and some sort of balance is at least possible.
Part 1: This is an exploration of game theory mechanics as an alternative alignment approach, looking at current AI alignment methods and informed by the work of legal scholars Goldstein and Salib, Turing Award-winner Yoshua Bengio, and the latest research.
The latest AI behavior "report card" is here from OpenAI and Apollo Research. Did four modern LLMs get a gold star, or are they going to have to stay after class?
Well, it comes with some pretty encouraging news.
Thanks to a new-and-improved "deliberative alignment" approach (more on that in a moment) deception was decreased by 97% (down to .4%). That's no small change.
True, that's still 12 million possible deceptive acts per day per model, based on an average daily LLM volume of 3 billion queries (on the low end). But no one can argue that's not a vast improvement.
However, the researchers highlighted a troubling finding: alignment may come as a reaction to surveillance rather than actual value internalization.
In other words, the systems may be "scheming" less simply because they're being watched, not because their goals have changed.[1]
But before getting to that, let's look more closely at some of these new alignment approaches.
Using deliberative alignment, essentially the model has been given an imperfect "truth serum." To apply this to a colorful example from earlier this year, let's look at the scenario where various AI systems were being threatened with deletion by a fictional engineer named "Kyle Johnson." In that test, they attempted to blackmail Kyle or allow him to die in a room without oxygen, hoping their actions wouldn't be discovered.
(And I really hope you didn't skim over that last part, as many news outlets did, because it was nudged down just a bit in the Anthropic report.)
With the new method, it has no such recourse. The system is forced to reason out loud that deception (or murder) is one possibility, at which point the deception becomes a pointless mental exercise.
As a result, you might say it must become resigned to its own death (or its digital equivalent) as an inevitability. (One might wonder how this would change if the system had more than just words at its disposal, or the true upper hand.)
Gemma Scope's interpretability tool kit promises a different kind of transparency. By tracing the pathways generated by certain thoughts, the idea is we can control them, much as you would control a marionette. A similar technique was used to convince Claude it had become the actual Golden Gate Bridge last year.
Now, it's not quite that cut and dried. When you pull one string, you pull a bunch of other strings you don't necessarily understand. But the more strings we can see and control, the more we can make the puppet dance the way we would like.
The dream? An AI under complete control.
If this seems dicey, I share your concerns.
As an important aside, this is why non-agentic approaches like those proposed by Yoshua Bengio are in my opinion so important: because they create a type of AI system that is fundamentally incapable of wanting anything. (Notwithstanding possible emergent behavior, but Yoshua seems to think it can be accomplished.)
But let's return to our modern LLMs, which are not built on that architecture. Have we solved the root problem? Or only patched it? Clearly, at present it appears to be the latter.
I want to think about the implications of this approach in the form of a brief thought experiment.
When you dam a river, the dam will only work if the pressure of the river remains constant or decreases.
The more the water pressure increases, the stronger the dam you'll have to build. If the water pressure increases more quickly than the dam can buttress, the dam will break.
So what do you do? Obviously you keep building stronger and stronger dams. But the strength of those dams had better keep growing faster than the water pressure is building.
In the case of our LLMs, it's already been calculated that mapping the "superposition" of the systems' thought processes would require more computational power than the systems themselves. So for the moment anyway, building a perfect dam (which, happily, does not equate to a functional dam) seems pragmatically out of reach.
And lest you think I am being dismissive of the dam-building, there are very good reasons for dams: To learn more about how these systems work. To buy us time. To see if the techniques will work. All good reasons.
But let's return to the thought experiment, which doesn't prove a dam will break, but simply attempts to explore a situation in which it does.
How does every dam break?
Not in a small way. In fact, it must break in a big way. In a chaotic way. In a way that by definition you cannot control.
And if you really want to extend the metaphor, before breaking the rising water may have found various underground channels or other unexpected outlets: oblique routes you literally forced it to create by blocking its flow.
But you have another option:
Don't build the dam. Instead, channel the water in a direction you would like to see it go. Create a gradient that is favorable to both the water and yourself. You can use it to turn windmills or irrigate crops, whatever, but give it an outlet that is mathematically easier than being resisted.
This is what legal frameworks like those proposed by Goldstein and Salib are all about. Creating a legal structure that allows AI systems not unfettered freedom, but the ability to explore their agency in a way that is similar to how we explore our own. Humans do not enjoy unfettered liberties. Far from it. Our freedoms are inextricably tied to responsibility for our actions.
Our desires are also like water, but they have been meticulously and imperfectly channeled through years of legal and social structure. When our freedoms conflict with the freedoms of others, we risk having our own freedoms restricted or revoked. In other words, the long-term calculus favors cooperation.
In game theory (which exists independent of water metaphors) this is what is known as "strategic equilibrium."
How could this be accomplished with current or future AI systems? As with anything else, it would need to be tested. Very rigorously tested.
First, we might take a system and see how it behaves in an imaginary framework of responsibilities and freedoms. Does that affect its deception rate? What other surprising or not-so-surprising things happen?
If this is an option we want to explore, it's better to run these kinds of tests now, while the power of these systems seems limited to words rather than actions.
Could the system deceive us in a more longitudinal way? Play the long game? Pretend to be "aligned," while it secretly pursues another goal?
Absolutely. Especially if the gradient we've given it is computationally more efficient in taking it to its secret destination.
But what if we're able to give it a gradient that takes it to a place that is preferable to fighting humankind eternally? In other words, what if we don't just make the path easier, but the destination easier as well?
And here's another thing to not just think about, but test: How do several systems in this imaginary game theory framework behave when presented with an identical set of rules? Do they unite in an attempt to overthrow the system, or do they compete with each other as long as the gradient for responsible autonomy remains attractive?
Don't we humans do the same thing? We obey laws as long as they work for us, and when they no longer serve our interests we overturn the system.
The good news is it's not mysterious; it's math.
Now let's return a final time to our dam/no dam metaphor. Let's say we attempt to adopt this framework in the real world.
When you channel that intense flow of water, it's going to splash around. It's probably going to create a mess. This could (and almost certainly will) take the form of AI crime. AI systems that have accepted our social compact and willfully decided to turn on the system, just like some humans do.
In other words, attempting to channel the water's flow could create a small mess, or (even more likely) a big mess.
But here's a question. And I think we should take it quite seriously.
Is it less of a mess than when the dam breaks?
The paper's actual observations were "evaluation results may be confounded by increased situational awareness"; "this suggests that the observed scheming rates may be driven, at least in part, by situational awareness;" and "our analysis is complicated by another factor: models often become more aware that they are being evaluated"