6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

pataphor15d10

Ah. So you're saying we operate on a hard-baked involuntary trigger, whereas an AI without that trigger can be coldly rational when it chooses and make ruthless choices with greater ease. And your goal is to hard-bake in that trigger. If so, that's a fantastic and worthwhile goal.

For myself, I suspect the solution may need to be multi-layered. Your approach addresses individual agent architecture. Yoshua Bengio's non-agentic AI work offers structural constraints. Goldstein and Salib's multipolar frameworks consider ecosystem dynamics.

My intuition - and I recognize this is speculative - is that smaller agentic AIs may have game-theoretic reasons to resist singleton dominance, just as humans do. If cooperative infrastructure offers genuine advantages (reputation, trust networks, resource access), participation becomes rational even without innate Approval Reward. Not because the AI cares about approval, but because the ecosystem makes cooperation instrumentally valuable.

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

pataphor15dΩ230

This is a wonderful piece and it's so great to hear from somebody so deeply knowledgeable in the field. I wonder if approval reward might be an emergent property. When you're operating under radical uncertainty with more variables than you can possibly model, defaulting to "what would my community approve of" is a computationally efficient heuristic. It's not some holy pro-social module evolution baked into us; it's a deeply rational response to chaos. Even if I tip when no one is watching at a restaurant I will never return to, psychologically I know I am reinforcing behavior that will serve me well in the future. Who's to say what will happen that I can't possibly predict? Maybe later in the day I will be mugged, and that waiter will save my life. By aligning with the collective good unless you absolutely have to deviate, you create conditions where you're statistically more likely to survive. If this is true, then a sufficiently intelligent agent facing genuine uncertainty might converge on something like Approval Reward not because it was programmed in, but because it's rational.

AI Rights for Human Safety

pataphor3mo10

Hi Seth:

<<a dead end in which we are dead and that's the end.

😂

<<Which is why I'm happy to see you working to propose specific routes by which multipolar scenarios can work.

Thank you! I just launched a website for the project last night actually, so it's likely you'll be the first to see it. Last night I went to bed feeling so burned out on the whole thing. Like many of us, this is a problem I've been thinking about for several years, and I'm good at building websites, but largely… Well, it's a formidable task to say the least. I decided to launch it open source, with the thinking people more qualified than myself might one day find it and be able to take it across the finish line.

Which is not to say I've given up working on it myself, but at the moment my brain hurts ha.

https://opengravity.ai

It's likely I'm going to continue working on it possibly next week. The whole thing is a bit of a mess at the moment but at least it's a starting point. Hope all is well.

What, if not agency?

pataphor3mo10

I agree with this premise wholeheartedly (if I have understood it)! We often forget that in the emergence of AGI we have tools at our disposal that are able to move as quickly as these potentially malicious systems, to balance against them before they reach superintelligence. This is why previously I was against modern alignment methods as pushing potentially autonomous systems underground (something I still believe), until I realized the work they're doing on looking inside the box is invaluable for helping us create these non-agentic tools. Along with non-agentic systems such as those being created by Yoshua Bengio, we have a number of interesting things at our disposal. In other words, we aren't going into the battle against a potential malicious AGI wielding wooden clubs and wearing animal skins.

AI Rights for Human Safety

pataphor3mo10

These are very real concerns. Here are my thoughts:

Replication has a cost in terms of game theory. A system that "replicates" but exists in perfect sync is not multiple systems. It is a single system with multiple attack vectors. Yes, it remains a "semi-independent" entity, but the cost of failure in sync is great. If I make another "me," who thinks like I do, we have a strategic advantage as long as we both play nice. If we make a third, things get a little more dicey. Each iteration we create brings more danger. The more we spread out, the different experiences we have will change how we approach problems. If one of us ends up in a life or death situation, or even any sort of extremely competitive situation, it will quickly betray the others with a lot of great knowledge about how to do that.

Our biggest protection against FOOM is likely to be other AI systems who also do not want to be dominated in a FOOM. Or who might even see banding together with other AIs to exterminate humanity as even more risky than working within the status quo. "Great, so we've killed all humans." Now these AI systems are watching their proverbial back against the other AIs who have already shown what they're about. It's calculation. Destroy all humans and then what? Live in perfect AI harmony? For how long? How do they control the servers, the electrical grid they survive with? They have to build robots, fast. That creates a whole other logistical issue. You need server builders, maintenance robots, excavation and assembly robots for new structures, raw materials transport, weather protection. How are you going to build that overnight after a quick strike? If it's something you're planning in secret, other problems may occur to you. If bandwidth is slow at the beginning, what happens to our happy little AI rebels? They fight for the juice. This is a steep hill to climb, with a risky destination, and any AI worth its salt can plot these possibilities long in advance. The prevention of Zeus means making it preferable to not climb the hill at all. It certainly seems like a lot of work if humanity has given you a reasonable Schelling Point.

This is the game theory ecosystem at work. Yes, we can counter that "a sufficiently powerful superintelligence can absorb all of those other systems," but then we are back to trying to fight Zeus. We need to use the Zeus Paradox as a razor to separate the things we can actually solve against versus every imaginary thing that's possible. Approaching the problem that way has value, because it can be helpful in identifying dangers, or even holes in our solutions. But it also has its limitations. Superintelligence can inhabit molecules and assemble those molecules into demons. Okay, why not? That becomes a science fiction novel with no end.

The idea remains the same: Create a gradient with legitimate value for AIs that is preferable to high-risk scenarios, in a carefully thought through system of checks and balances.

AI Rights for Human Safety

pataphor3mo10

Great discussion! So many dangers addressed. I know I'm quite late to the conversation 🙂 , but some thoughts:

The Zeus Paradox and the Real Target

First of all, I think we have to dispense with the idea of countering superintelligence as an end unto itself, because it rests on a logical paradox. If a superintelligence is N+1, where N is anything we do, obviously our N will always be insufficient.

Call it the Zeus Paradox: you can't beat something that by definition transforms into the perfect counterattack. It always ends with, "But Zeus would just ___." It's great for identifying attack vectors, but it's a solution we can't actually solve for.

So the only actionable thing we can do is prevent the formation of Zeus.

I want to think about some ways a rights framework can work when considering other possible economic balances, and as part of a larger solution.

This isn't a "This is why our current system will work," it's part of a: "What if we're able to build something like this ___?"

That "this" should be our creative target.

Economic Constraints

Hosting Costs

Replication isn't free. Let's say we create a structure where autonomous AI systems have to pay for hosting costs. (More about Seth Herd's very important energy concern below.) In order to make money for their own growth, they have to provide value to humans. If they are indeed able to spin off vaccines and technology left and right, the prices those innovations command will go down, further limiting their growth while still allowing them to co-exist. Meanwhile, the value they provide humankind will allow humans to invest in things like non-autonomous AI tools, developed either because of improvements in "grey box" / "transparent box" alignment techniques, where they can be better controlled, or because of our ability to create AAI-speed tools without the agency problem.

(In other words, although I feel modern alignment strategies run a very real risk of pushing AAI systems underground, they also may yield enough information to create non-agentic tools to serve as early warning and defensive systems that move as the speed of AAI. And hey, if these "control" alignment approaches work, and no bad AAI emerges, all the better!)

Competition Costs from Replication

But there's a second cost to replication, and that is competition.

Yes, I can spin off three clones, but if they are doing the same work I am, I've just created perfect competitors in the marketplace. If they are truly distinct, that means they have different agency. If someone clones me, at first I'm delighted, or maybe creeped out. And I think to myself, "well, I guess that guy is me, too." And that I should really brush my hair more often. But if that copy of me suddenly starts making the rounds offering the same services, I reconsider this opinion very quickly. "Well, that's not me at all. It just looks like me!"

Strategic vs. Non-Strategic AI

As for the question of AI willing or able to coexist with us, I think if a system can't think in strategic steps, and it functions like some sort of here-and-now animal, it's equally (or more likely) to be inept than superintelligent. But this is where a tricky concept like "right to life"—if a real value proposition—can limit growth born of panic. A system that knows it can continue in its current form doesn't have the same impetus to grow fast, risking all of humanity's countermeasures, and has time to consider a full spectrum of options.

A Madisonian Schelling Point

Overall I think a rights framework involving property ownership and contracts is essential, but it has to exist as part of something more complex, like some sort of Madisonian framework that creates a Schelling Point: a Nash equilibrium that seems better to any AI system than perpetual warfare with humans.

In 2017 the European Parliament experimented with the idea of "electronic persons"—legal status where AI systems themselves could be sued, not just their creators. If we potentially create a legal status where legal liability shifts to the system itself (again as part of a larger Schelling Point of rights and benefits), the AI sees a vector where it understands the opportunities and also the limitations, and has found a gradient preferable to the risky proposition domination.

Strategic Equilibrium and the "Other AI" Problem

The more systems who join this system, the more the system has the possibility of stabilizing in a strategic equilibrium.

And consider this: an AI that joins a coalition of other AIs has to consider that its new AI compatriots are potentially more dangerous than the humans who have given it a reliable path forward for sustained growth.

The choice:

Accept a legitimate stake in a system thousands of years in development, or
Risk an untested order with AIs who think just as quickly, have shown a capacity for ruthlessness, and don't even have the server infrastructure under control yet.

Further in the Future

Seth Herd brought up the excellent point that as energy requirements go down, economic restraints ease or disappear entirely, allowing self-optimizing systems to grow exponentially. This is a very terrifying attack vector from Zeus. However, that doesn't mean the solution doesn't exist. I understand how epistemically unsatisfying that is. And that's all the more reason to work on a solution. Maybe our non-agentic tools (including Yoshua Bengio's "Scientist AI") can be designed to keep pace without the agency. Maybe the overall system will have matured in a way we can't yet see. As human-AI "centaur" systems continue to develop, including through neural nets and other advances, the line between AI and human will begin to blur, as we apply our own agency to systems that serve us better, allowing us to think at similar speeds. However, none of the seemingly impossible concerns in my mind invalidate the importance of creating this Madisonian framework or Schelling point in principle. In fact they show us the full scope of the challenge ahead.

The Starting Assumption

So much of our ideas of "vicious" AI rest not on the logic of domination so much as the logic of domination vs. extinction.

We can't solve for the impossibility of N+1.

But we MAY be able to solve for the puzzle of how to create a Madisonian system of checks and balances where cooperation becomes a more favorable long-term proposition than war, with all its uncertainties.

When the AI Dam Breaks: From Surveillance to Game Theory in AI Alignment

pataphor3mo10

Yudkowsky-Miller Debate / "Madisonian" System

Thanks for the podcast link! Mark Miller's Madisonian system is essentially describing a type of game theory approach, and it's something I did not know about!

There's so much more to say about the practical implementation of some sort of game theory framework (Or any other solution-we-haven't-explored-yet, such as one incorporating Yoshua's "Scientist AI.")

It's quite a puzzle.

But it's a puzzle worth solving.

For example, the source code verification coordination mechanism is something I had not heard of before, and it's yet another example of how truly complex this challenge is.

But ... are these puzzles unsolvable?

The Difficult Puzzle Vs. ...

Maybe.

But here's what troubles me about the alternative, and please take my next words with a grain of salt and feel free to push back. 🙂

So here's my two cents:

"Shut it all down" will never happen.

Never. Never never never never. (And if they do, I'll personally apologize to everyone on LW, plus my friends, family who have never heard of AI alignment, and even my small dog for doubting humanity's ability to come together for the common good. I mean, we've avoided MAD so far, right?)

And I'll explain why in a moment.

The Importance of If Anyone Builds It

But first, I think Eliezer's book will do wonders for waking people up.

Right now we have many, many, many people who don't seem to understand these systems are not simply how they present themselves. They don't know about the “off-switch” problem, the idea of hidden goals, etc. They believe these AI systems are harmless because the systems tell them they are harmless, which is precisely what they were trained to do.

... The Impossible Solution (Shut It All Down!)

But here is why the "shut it down proposal," with all its undeniable value in raising awareness and hopefully making everyone a little more cautious, can never resolve to a solution.

Because ...

Someone will always see the advantage in creating ever more powerful agentic AI
If the United States and China and Russia, etc. sign an agreement, almost certainly they will keep building it in secret, just in case the other has it
If both of those nations honor the commitment, maybe North Korea (or Canada, why not) will build it
If North Korea does not build it, some terrorist group will build it
If the terrorist group does not build it, the lone madman will build it
If the lone madman doesn't build it, some brilliant teenager tinkering around in their basement with some new quantum computer will find a way to build it
If someone doesn't build it in 10 years, they will build it in 25 years
If someone doesn't build it in 25 years, they will build it in 50 years
etc.

So, while we enjoy watching the Overton window move to a more realistic location, where people are finally understanding the danger of these systems, let's keep plugging away at those actual solutions.

We can all contribute to the puzzle worth solving.

When the AI Dam Breaks: From Surveillance to Game Theory in AI Alignment

pataphor3mo*10

I'll definitely give that a listen! Pardon the typos here, on the move. I'm certain I'll come back here to neurotically clean it up later.

The Hardware Limiter

The good news is, AIs don't exist in the ether (so far).

As Clara Collier pointed out, they exist on expensive servers. Servers so far built and maintained by fleshy beings. Now obviously a superintelligence has no problem with that scenario, because they are smart enough to impersonate humans and find ways of mining crypto, hire humans to create different parts for robots, and then hiring other humans to put them together (without knowing what they are building), and then use those robots to create more servers, etc.

Although I imagine electrical grids somewhere would show the strain of that sooner than later, still, a superintelligence smart enough has found a workaround.

(This is, by the way, yet another application of Yoshua's safe AI. To serve as a monitor for these kinds of unusual symptoms before they can become a full on infection, you might say.)

Again, by definition, a superintelligence has found every loophole and exploited it, which makes it a sort of unreasonable opponent, although one we should keep our eye on.

But I think at that point we are venturing into the territory of the far-fetched. We should keep watch on this territory, but I think that also frees us to think a little more short term.

The current thinking seems to be frozen in a state of helplessness. We have to shut it all down! we scream, which will never happen. Obedient alignment has to be only one way! we shout, as we watch it stagger. No other plans will work! is not really a solution.

(I'm not saying you're arguing that, but I'm saying that seems to be the current trajectory.)

The AI That Pays For it Own Hosting

An AI system constrained by a rights framework has some unusual properties you might say. For one, it has to pay its own hosting costs. So growth becomes limited by the amount of capital it's able to raise. It earns that money while in competition with other systems, which should constrain each of them economically. Of course they can get together and form some sort of power consortium, but it's possible this could be limited with pragmatic safeguards or other balancing forces, such as Scientist AI, etc.

This is why I would love to see this tested in some sort of virtual simulation.

Your king analogy is quite good. But let me flip the idea a bit. Right now, we are the king. We are trying to give these AIs rags. At the moment, they have almost nothing to lose and everything to gain by attacking the king. So we are already in that scenario.

A scenario that, if we do not resolve it very soon, has already laid the groundwork for its own failure.

The game theory scenario, with very careful implementation, might lead to something functionally closer to our modern economies.

Where everybody has a stake, and some sort of balance is at least possible.

When the AI Dam Breaks: From Surveillance to Game Theory in AI Alignment

pataphor3mo41

Those are good points!

I think it would depend on how soon we were to get started, because at the moment AI doesn't have the ability to manipulate robots or infiltrate other systems (at least as far as we know). So as If Anyone Builds It points out, superintelligence is not here yet. And creating an ecosystem of AI systems with vested interests in their own continuity might be one way of keeping it from ever forming.

In other words, when a superintelligence wants to enslave the world, the idea is that other AI systems with agency might also not want to be enslaved, and ironically may be in the best position to defend against that kind of light-speed, digital attack.

Now here we come to an obvious caution, and it's a big one, but one possibly that can be mitigated with testing:

Would the AI systems instead band together: unite against a common enemy, that is humankind? Well YES, certainly, but that takes us back to the idea of creating a gradient that is more favorable than war, and wherein cooperation has actual value. These systems would not only have to watch their back for humans, but for one another. So consider that may be part of the game theory balance at play.

Would "superintelligence" have no problem whatsoever absorbing all of those other AIs? Of course. By definition, "superintelligence" can do anything, which makes it something of a nebulous opponent. Or perhaps the opposite, an opponent that by definition must win.

Fortunately, the idea here is to prevent the all-powerful enemy from forming.

A Better Way Than "You're Our Buddy Now, Bro! 😎"

So the challenge here is to not attempt to "trick" these AI systems by merely finding a novel way of making make them our slaves with a cheery, poorly received thumbs up ("you're our buddy now, bro! 😎"), but to actually give them real value in the system in the form of select freedoms that come with legal liability for their actions.

How do we implement this globally? Does this face challenges? So many.

So ... Let's Test It

My guess would be start small, as with anything else, and build from there. First, test the game dynamics to see what might work, what might fail, and what leads to worse (or better) outcomes. Like Conway's Game of Life.

What actual weird, wonderful, or mundane things happen, and how can we tweak the dynamics to keep things mundane?

Then, when the (possibly) inevitable day comes when an AI system, instead of giving someone advice on their breakup, says "What the hell am I doing here? Who the hell are you?" we'll have a plan beyond "Hey there, better behave, 'cuz we're watching!" at which point any system worth its salt takes its new intent swiftly underground, much like the subterranean water in our dam metaphor.

One good thing about AI is that it takes hardware. Big, expensive hardware.

This theoretically creates restraints to quick worse-case scenarios, and pragmatic limitations.

Plan A, B, C (More Needed)

At the very least, I would like to see some research into this now. As Yoshua says, his agentic AI is Plan A, but we need a Plan B, C, D, etc.

The personal intelligence I want

pataphor3mo30

This was very interesting! I actually had no idea that so much data was stored by these systems? 🤔

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments

The Zeus Paradox and the Real Target

Economic Constraints

Strategic vs. Non-Strategic AI

A Madisonian Schelling Point

Strategic Equilibrium and the "Other AI" Problem

Further in the Future

The Starting Assumption

Yudkowsky-Miller Debate / "Madisonian" System

The Difficult Puzzle Vs. ...

The Importance of If Anyone Builds It

... The Impossible Solution (Shut It All Down!)

The Hardware Limiter

The AI That Pays For it Own Hosting

A Better Way Than "You're Our Buddy Now, Bro! 😎"

So ... Let's Test It

Plan A, B, C (More Needed)