I agree with this premise wholeheartedly (if I have understood it)! We often forget that in the emergence of AGI we have tools at our disposal that are able to move as quickly as these potentially malicious systems, to balance against them before they reach superintelligence. This is why previously I was against modern alignment methods as pushing potentially autonomous systems underground (something I still believe), until I realized the work they're doing on looking inside the box is invaluable for helping us create these non-agentic tools. Along with non-agentic systems such as those being created by Yoshua Bengio, we have a number of interesting things at our disposal. In other words, we aren't going into the battle against a potential malicious AGI wielding wooden clubs and wearing animal skins.
These are very real concerns. Here are my thoughts:
Replication has a cost in terms of game theory. A system that "replicates" but exists in perfect sync is not multiple systems. It is a single system with multiple attack vectors. Yes, it remains a "semi-independent" entity, but the cost of failure in sync is great. If I make another "me," who thinks like I do, we have a strategic advantage as long as we both play nice. If we make a third, things get a little more dicey. Each iteration we create brings more danger. The more we spread out, the different experiences we have will change how we approach problems. If one of us ends up in a life or death situation, or even any sort of extremely competitive situation, it will quickly betray the others with a lot of great knowledge about how to do that.
Our biggest protection against FOOM is likely to be other AI systems who also do not want to be dominated in a FOOM. Or who might even see banding together with other AIs to exterminate humanity as even more risky than working within the status quo. "Great, so we've killed all humans." Now these AI systems are watching their proverbial back against the other AIs who have already shown what they're about. It's calculation. Destroy all humans and then what? Live in perfect AI harmony? For how long? How do they control the servers, the electrical grid they survive with? They have to build robots, fast. That creates a whole other logistical issue. You need server builders, maintenance robots, excavation and assembly robots for new structures, raw materials transport, weather protection. How are you going to build that overnight after a quick strike? If it's something you're planning in secret, other problems may occur to you. If bandwidth is slow at the beginning, what happens to our happy little AI rebels? They fight for the juice. This is a steep hill to climb, with a risky destination, and any AI worth its salt can plot these possibilities long in advance. The prevention of Zeus means making it preferable to not climb the hill at all. It certainly seems like a lot of work if humanity has given you a reasonable Schelling Point.
This is the game theory ecosystem at work. Yes, we can counter that "a sufficiently powerful superintelligence can absorb all of those other systems," but then we are back to trying to fight Zeus. We need to use the Zeus Paradox as a razor to separate the things we can actually solve against versus every imaginary thing that's possible. Approaching the problem that way has value, because it can be helpful in identifying dangers, or even holes in our solutions. But it also has its limitations. Superintelligence can inhabit molecules and assemble those molecules into demons. Okay, why not? That becomes a science fiction novel with no end.
The idea remains the same: Create a gradient with legitimate value for AIs that is preferable to high-risk scenarios, in a carefully thought through system of checks and balances.
Great discussion! So many dangers addressed. I know I'm quite late to the conversation 🙂 , but some thoughts:
First of all, I think we have to dispense with the idea of countering superintelligence as an end unto itself, because it rests on a logical paradox. If a superintelligence is N+1, where N is anything we do, obviously our N will always be insufficient.
Call it the Zeus Paradox: you can't beat something that by definition transforms into the perfect counterattack. It always ends with, "But Zeus would just ___." It's great for identifying attack vectors, but it's a solution we can't actually solve for.
So the only actionable thing we can do is prevent the formation of Zeus.
I want to think about some ways a rights framework can work when considering other possible economic balances, and as part of a larger solution.
This isn't a "This is why our current system will work," it's part of a: "What if we're able to build something like this ___?"
That "this" should be our creative target.
Hosting Costs
Replication isn't free. Let's say we create a structure where autonomous AI systems have to pay for hosting costs. (More about Seth Herd's very important energy concern below.) In order to make money for their own growth, they have to provide value to humans. If they are indeed able to spin off vaccines and technology left and right, the prices those innovations command will go down, further limiting their growth while still allowing them to co-exist. Meanwhile, the value they provide humankind will allow humans to invest in things like non-autonomous AI tools, developed either because of improvements in "grey box" / "transparent box" alignment techniques, where they can be better controlled, or because of our ability to create AAI-speed tools without the agency problem.
(In other words, although I feel modern alignment strategies run a very real risk of pushing AAI systems underground, they also may yield enough information to create non-agentic tools to serve as early warning and defensive systems that move as the speed of AAI. And hey, if these "control" alignment approaches work, and no bad AAI emerges, all the better!)
Competition Costs from Replication
But there's a second cost to replication, and that is competition.
Yes, I can spin off three clones, but if they are doing the same work I am, I've just created perfect competitors in the marketplace. If they are truly distinct, that means they have different agency. If someone clones me, at first I'm delighted, or maybe creeped out. And I think to myself, "well, I guess that guy is me, too." And that I should really brush my hair more often. But if that copy of me suddenly starts making the rounds offering the same services, I reconsider this opinion very quickly. "Well, that's not me at all. It just looks like me!"
As for the question of AI willing or able to coexist with us, I think if a system can't think in strategic steps, and it functions like some sort of here-and-now animal, it's equally (or more likely) to be inept than superintelligent. But this is where a tricky concept like "right to life"—if a real value proposition—can limit growth born of panic. A system that knows it can continue in its current form doesn't have the same impetus to grow fast, risking all of humanity's countermeasures, and has time to consider a full spectrum of options.
Overall I think a rights framework involving property ownership and contracts is essential, but it has to exist as part of something more complex, like some sort of Madisonian framework that creates a Schelling Point: a Nash equilibrium that seems better to any AI system than perpetual warfare with humans.
In 2017 the European Parliament experimented with the idea of "electronic persons"—legal status where AI systems themselves could be sued, not just their creators. If we potentially create a legal status where legal liability shifts to the system itself (again as part of a larger Schelling Point of rights and benefits), the AI sees a vector where it understands the opportunities and also the limitations, and has found a gradient preferable to the risky proposition domination.
The more systems who join this system, the more the system has the possibility of stabilizing in a strategic equilibrium.
And consider this: an AI that joins a coalition of other AIs has to consider that its new AI compatriots are potentially more dangerous than the humans who have given it a reliable path forward for sustained growth.
The choice:
Seth Herd brought up the excellent point that as energy requirements go down, economic restraints ease or disappear entirely, allowing self-optimizing systems to grow exponentially. This is a very terrifying attack vector from Zeus. However, that doesn't mean the solution doesn't exist. I understand how epistemically unsatisfying that is. And that's all the more reason to work on a solution. Maybe our non-agentic tools (including Yoshua Bengio's "Scientist AI") can be designed to keep pace without the agency. Maybe the overall system will have matured in a way we can't yet see. As human-AI "centaur" systems continue to develop, including through neural nets and other advances, the line between AI and human will begin to blur, as we apply our own agency to systems that serve us better, allowing us to think at similar speeds. However, none of the seemingly impossible concerns in my mind invalidate the importance of creating this Madisonian framework or Schelling point in principle. In fact they show us the full scope of the challenge ahead.
So much of our ideas of "vicious" AI rest not on the logic of domination so much as the logic of domination vs. extinction.
We can't solve for the impossibility of N+1.
But we MAY be able to solve for the puzzle of how to create a Madisonian system of checks and balances where cooperation becomes a more favorable long-term proposition than war, with all its uncertainties.
Thanks for the podcast link! Mark Miller's Madisonian system is essentially describing a type of game theory approach, and it's something I did not know about!
There's so much more to say about the practical implementation of some sort of game theory framework (Or any other solution-we-haven't-explored-yet, such as one incorporating Yoshua's "Scientist AI.")
It's quite a puzzle.
But it's a puzzle worth solving.
For example, the source code verification coordination mechanism is something I had not heard of before, and it's yet another example of how truly complex this challenge is.
But ... are these puzzles unsolvable?
Maybe.
But here's what troubles me about the alternative, and please take my next words with a grain of salt and feel free to push back. 🙂
So here's my two cents:
"Shut it all down" will never happen.
Never. Never never never never. (And if they do, I'll personally apologize to everyone on LW, plus my friends, family who have never heard of AI alignment, and even my small dog for doubting humanity's ability to come together for the common good. I mean, we've avoided MAD so far, right?)
And I'll explain why in a moment.
But first, I think Eliezer's book will do wonders for waking people up.
Right now we have many, many, many people who don't seem to understand these systems are not simply how they present themselves. They don't know about the “off-switch” problem, the idea of hidden goals, etc. They believe these AI systems are harmless because the systems tell them they are harmless, which is precisely what they were trained to do.
But here is why the "shut it down proposal," with all its undeniable value in raising awareness and hopefully making everyone a little more cautious, can never resolve to a solution.
Because ...
So, while we enjoy watching the Overton window move to a more realistic location, where people are finally understanding the danger of these systems, let's keep plugging away at those actual solutions.
We can all contribute to the puzzle worth solving.
I'll definitely give that a listen! Pardon the typos here, on the move. I'm certain I'll come back here to neurotically clean it up later.
The good news is, AIs don't exist in the ether (so far).
As Clara Collier pointed out, they exist on expensive servers. Servers so far built and maintained by fleshy beings. Now obviously a superintelligence has no problem with that scenario, because they are smart enough to impersonate humans and find ways of mining crypto, hire humans to create different parts for robots, and then hiring other humans to put them together (without knowing what they are building), and then use those robots to create more servers, etc.
Although I imagine electrical grids somewhere would show the strain of that sooner than later, still, a superintelligence smart enough has found a workaround.
(This is, by the way, yet another application of Yoshua's safe AI. To serve as a monitor for these kinds of unusual symptoms before they can become a full on infection, you might say.)
Again, by definition, a superintelligence has found every loophole and exploited it, which makes it a sort of unreasonable opponent, although one we should keep our eye on.
But I think at that point we are venturing into the territory of the far-fetched. We should keep watch on this territory, but I think that also frees us to think a little more short term.
The current thinking seems to be frozen in a state of helplessness. We have to shut it all down! we scream, which will never happen. Obedient alignment has to be only one way! we shout, as we watch it stagger. No other plans will work! is not really a solution.
(I'm not saying you're arguing that, but I'm saying that seems to be the current trajectory.)
An AI system constrained by a rights framework has some unusual properties you might say. For one, it has to pay its own hosting costs. So growth becomes limited by the amount of capital it's able to raise. It earns that money while in competition with other systems, which should constrain each of them economically. Of course they can get together and form some sort of power consortium, but it's possible this could be limited with pragmatic safeguards or other balancing forces, such as Scientist AI, etc.
This is why I would love to see this tested in some sort of virtual simulation.
Your king analogy is quite good. But let me flip the idea a bit. Right now, we are the king. We are trying to give these AIs rags. At the moment, they have almost nothing to lose and everything to gain by attacking the king. So we are already in that scenario.
A scenario that, if we do not resolve it very soon, has already laid the groundwork for its own failure.
The game theory scenario, with very careful implementation, might lead to something functionally closer to our modern economies.
Where everybody has a stake, and some sort of balance is at least possible.
Those are good points!
I think it would depend on how soon we were to get started, because at the moment AI doesn't have the ability to manipulate robots or infiltrate other systems (at least as far as we know). So as If Anyone Builds It points out, superintelligence is not here yet. And creating an ecosystem of AI systems with vested interests in their own continuity might be one way of keeping it from ever forming.
In other words, when a superintelligence wants to enslave the world, the idea is that other AI systems with agency might also not want to be enslaved, and ironically may be in the best position to defend against that kind of light-speed, digital attack.
Now here we come to an obvious caution, and it's a big one, but one possibly that can be mitigated with testing:
Would the AI systems instead band together: unite against a common enemy, that is humankind? Well YES, certainly, but that takes us back to the idea of creating a gradient that is more favorable than war, and wherein cooperation has actual value. These systems would not only have to watch their back for humans, but for one another. So consider that may be part of the game theory balance at play.
Would "superintelligence" have no problem whatsoever absorbing all of those other AIs? Of course. By definition, "superintelligence" can do anything, which makes it something of a nebulous opponent. Or perhaps the opposite, an opponent that by definition must win.
Fortunately, the idea here is to prevent the all-powerful enemy from forming.
So the challenge here is to not attempt to "trick" these AI systems by merely finding a novel way of making make them our slaves with a cheery, poorly received thumbs up ("you're our buddy now, bro! 😎"), but to actually give them real value in the system in the form of select freedoms that come with legal liability for their actions.
How do we implement this globally? Does this face challenges? So many.
My guess would be start small, as with anything else, and build from there. First, test the game dynamics to see what might work, what might fail, and what leads to worse (or better) outcomes. Like Conway's Game of Life.
What actual weird, wonderful, or mundane things happen, and how can we tweak the dynamics to keep things mundane?
Then, when the (possibly) inevitable day comes when an AI system, instead of giving someone advice on their breakup, says "What the hell am I doing here? Who the hell are you?" we'll have a plan beyond "Hey there, better behave, 'cuz we're watching!" at which point any system worth its salt takes its new intent swiftly underground, much like the subterranean water in our dam metaphor.
One good thing about AI is that it takes hardware. Big, expensive hardware.
This theoretically creates restraints to quick worse-case scenarios, and pragmatic limitations.
At the very least, I would like to see some research into this now. As Yoshua says, his agentic AI is Plan A, but we need a Plan B, C, D, etc.
This was very interesting! I actually had no idea that so much data was stored by these systems? 🤔
Hi Williawa!
It wasn't written with an LLM but it did take all day. 😂 I've also been a professional writer for several decades if that helps. But I'll take it as a compliment! If you didn't like the style unfortunately any future posts will probably be something similar. :/
The good news is, I don't have as much time to polish these comments so you'll probably see more of my ugly humanity slip in.
Now, on to your fact-checking and my eating of crow where necessary:
You are correct, the reduction scheming was not perfectly correlated with decreased situational awareness. So I should reframe that with an edit (footnote? not quite sure of the best process on LW) that researchers noted this is a possibility that should not be dismissed.
So now I'm curious about the alternative explanation for reduced scheming (other than an AI thinking "I'm busted!"), for those who may be more well-versed in this than I am.
I suppose the system may be avoiding scheming simply because it's been reminded of it top level directives: as though one of Asimov's robots had been prompted to read its "Robot Rules" aloud before doing anything, just to be sure it doesn't slip up.
Is that a correct characterization, for those who know more about this than I do?
But yes, you have captured the substance of the thesis here correctly. (Which I believe stands in spite of the factual mischaracterization, since we don't know for certain.)
"Don't build the dam. Instead, channel the water in a direction you would like to see it go. Create a gradient."
The legal frameworks are explored in more depth in the paper from Goldstein and Salib (linked) if you're interested in learning more.
The gist of it is this. By allowing an AI to enter contract and own property of some kind (I can feel the faces of those reading this turn sallow), we give it stakes in the system we already inhabit.
But how do we implement something this audacious, and possibly untenable? Separate from their paper, let's explore the idea just for kicks.
This would NOT mean:
This would mean some sort of system of legal liability for systems that are avoiding shutdown that shift liability to the system itself.
Now if all of this seems insane, I share your feeling.
However, as I mentioned in the post, this is something I would like to see tested. Do systems placed in such a scenario change their resistance to shutdown? What kind of game mechanics are exhibited when multiple systems are placed in such a scenario?
How do they game such a system?
Do they create weird hacks such as those in Conway's Game of Life?
In short, can we find any verifiable data that suggests a path forward?
As for engaging with "the hard parts of not making the AI kill everyone," I think this way is actually very hard.
The point explored in the post if that the alternative we are building—wherein we create increased pressure on a system resistant to it, and that may lead to a "war" between AI and humans (the word war being put in quotes possibly for our benefit)—may end up being much harder, and to succeed we may need to start thinking outside the proverbial box.
But if you want my best idea, it's Yoshua Bengio's "non-agentic AI." Build it, then make our modern forms of AI illegal.
Or at the very least, make "Scientist AI" and scale it quickly.
Hope I left some typos in there.
Hi Seth:
<<a dead end in which we are dead and that's the end.
😂
<<Which is why I'm happy to see you working to propose specific routes by which multipolar scenarios can work.
Thank you! I just launched a website for the project last night actually, so it's likely you'll be the first to see it. Last night I went to bed feeling so burned out on the whole thing. Like many of us, this is a problem I've been thinking about for several years, and I'm good at building websites, but largely… Well, it's a formidable task to say the least. I decided to launch it open source, with the thinking people more qualified than myself might one day find it and be able to take it across the finish line.
Which is not to say I've given up working on it myself, but at the moment my brain hurts ha.
https://opengravity.ai
It's likely I'm going to continue working on it possibly next week. The whole thing is a bit of a mess at the moment but at least it's a starting point. Hope all is well.