High-level hopes for AI alignment

Does the orthogonality thesis apply to embodied agents?

My belief is that instrumental subgoals will lead to natural human value alignment for embodied agents with long enough time horizons, but the whole thing is contingent on problems with the AI's body.

Simply put, hardware sucks, it's always falling apart, and the AGI would likely see human beings as part of itself . There are no large scale datacenters where _everything_ is automated, and even if there were on, who is going to repair the trucks to mine the copper to make the coils to go into the cooling fans that need to be periodically replaced?

If you start pulling strings on 'how much of the global economy needs to operate in order to keep a data center functioning', you end up with a huge portion of the global economy. Am i to believe that an AGI would decide to replace all of that with untested, unproved systems?

When I've looked into precisely what the AI risk researchers believe, the only paper I could find on likely convergent instrumental subgoals modeled the AI as being a disembodied agent with read access to the entire universe, which i find questionable. I agree that yes, if there were a disembodied mind with read access to the entire universe, the ability to write in a few places, and goals that didn't include "keep humans alive and well", then we'd be in trouble.

Can you help me find some resources on how embodiment changes the nature of convergent instrumental subgoals? This MIRI paper was the most recent thing i could find but it's for non-embodied agents. Here is my objection to its conclusion.

[-]Steven Byrnes3y31

The orthogonality thesis is about final goals, not instrumental goals. I think what you’re getting at is the following hypothesis:

“For almost any final goal that an AGI might have, it is instrumentally useful for the AGI to not wipe out humanity, because without humanity keeping the power grid on (etc. etc.), the AGI would not be able to survive and get things done. Therefore we should not expect AGIs to wipe out humanity.”.

See for example 1, 2, 3 making that point.

Some counterpoints to that argument would be:

If that claim is true now, and for early AGIs, well it won’t be true forever. Once we have AGIs that are faster and cheaper and more insightful than humans, they will displace humans over more and more of the economy, and relatedly robots will rapidly spread throughout the economy, etc. And when the AGIs are finally in a position to wipe out humans, they will, if they’re misaligned. See for example this Paul Christiano post.
Even if that claim is true, it doesn’t rule out AGI takeover. It just means that the AGI would take over in a way that doesn’t involve killing all the humans. By the same token, when Stalin took over Russia, he couldn’t run the Russian economy all by himself, and therefore he didn’t literally kill everyone, but he still had a great deal of control. Now imagine that Stalin could live forever, while gradually distributing more and more clones of himself in more and more positions of power, and then replace “Russia” with “everywhere”.
Maybe the claim is not true even for early AGIs. For example I was disagreeing with it here. A lot depends on things like how far recursive self-improvement can go, whether human-level human-speed AGI can run on 1 xbox GPU versus a datacenter of 10,000 high-end GPUs, and various questions like that. I would specifically push back on the relevance of “If you start pulling strings on 'how much of the global economy needs to operate in order to keep a data center functioning', you end up with a huge portion of the global economy”. There are decisions that trade off between self-sufficiency and convenience / price, and everybody will choose convenience / price. So you can’t figure out the minimal economy that would theoretically support chip production by just looking at the actual economy; you need to think more creatively, just like a person stuck on a desert island will think of resourceful solutions. By analogy, an insanely massive infrastructure across the globe supports my consumption of a chocolate bar for snack just now, but you can’t conclude from that observation that there’s no way for me to have a snack if that massive global infrastructure didn’t exist. I could instead grow grapes in my garden or whatever. Thus, I claim that there are much more expensive and time-consuming ways to get energy and chips and robot parts, that require much less infrastructure and manpower, e.g. e-beam lithography instead of EUV photolithography, and that after an AGI has wiped out humanity, it might well be able to survive indefinitely and gradually work its way up to a civilization of trillions of its clones colonizing the galaxy, starting with these more artisanal solutions, scavenged supplies, and whatever else. At the very least, I think we should have some uncertainty here.

[-]Foyle3y10

If any superintelligent AI is capable of wiping out humans should it decide to, it is better for humans to try and arrange initial conditions such that there are ultimately a small number of them to reduce probability of doom. The risk posed by 1 or 10 independent but vast SAI is lower than from a million or a billion independent but relatively less potent SAI where it may tend to P=1.

I have some hope the the physical universe will soon be fully understood and from there on prove relatively boring to SAI, and that the variety thrown up by the complex novelty and interactions of life might then be interesting to them

E.g. ↩
Disclosure: my wife Daniela is President and co-founder of Anthropic, which employs prominent researchers in “mechanistic interpretability” and hosts the site I link to for the term. ↩
Disclosure: I’m on the board of ARC, which wrote this document. ↩
Though not entirely ↩
The basic idea:
- A lot of security vulnerabilities might be the kind of thing where it’s clear that there’s some weakness in the system, but it’s not immediately clear how to exploit this for gain. An AI system with an unintended “aim” might therefore “save” knowledge about the vulnerability until it encounters enough other vulnerabilities, and the right circumstances, to accomplish its aim.
- But now imagine an AI system that is trained and rewarded exclusively for finding and patching such vulnerabilities. Unlike with the first system, revealing the vulnerability gets more positive reinforcement than just about anything else it can do (and an AI that reveals no such vulnerabilities will perform extremely poorly). It thus might be much more likely than the previous system to do so, rather than simply leaving the vulnerability in place in case it’s useful later.
- And now imagine that there are multiple AI systems trained and rewarded for finding and patching such vulnerabilities, with each one needing to find some vulnerability overlooked by others in order to achieve even moderate performance. These systems might also have enough variation that it’s hard for one such system to confidently predict what another will do, which could further lower the gains to leaving the vulnerability in place. ↩
This is a concept that only I understand. ↩
See here, here, and here. Also see the tail end of this Wait but Why piece, which draws on similar intuitions to the longer treatment in Superintelligence ↩

“Great news - I’ve tested this AI and it looks safe.” Why might we still have a problem?
Problem	Key question	Explanation
The Lance Armstrong problem	Did we get the AI to be actually safe or good at hiding its dangerous actions?	When dealing with an intelligent agent, it’s hard to tell the difference between “behaving well” and “appearing to behave well.” When professional cycling was cracking down on performance-enhancing drugs, Lance Armstrong was very successful and seemed to be unusually “clean.” It later came out that he had been using drugs with an unusually sophisticated operation for concealing them.
The King Lear problem	The AI is (actually) well-behaved when humans are in control. Will this transfer to when AIs are in control?	It's hard to know how someone will behave when they have power over you, based only on observing how they behave when they don't. AIs might behave as intended as long as humans are in control - but at some future point, AI systems might be capable and widespread enough to have opportunities to take control of the world entirely. It's hard to know whether they'll take these opportunities, and we can't exactly run a clean test of the situation. Like King Lear trying to decide how much power to give each of his daughters before abdicating the throne.
The lab mice problem	Today's "subhuman" AIs are safe.What about future AIs with more human-like abilities?	Today's AI systems aren't advanced enough to exhibit the basic behaviors we want to study, such as deceiving and manipulating humans. Like trying to study medicine in humans by experimenting only on lab mice.
The first contact problem	Imagine that tomorrow's "human-like" AIs are safe. How will things go when AIs have capabilities far beyond humans'?	AI systems might (collectively) become vastly more capable than humans, and it's ... just really hard to have any idea what that's going to be like. As far as we know, there has never before been anything in the galaxy that's vastly more capable than humans in the relevant ways! No matter what we come up with to solve the first three problems, we can't be too confident that it'll keep working if AI advances (or just proliferates) a lot more. Like trying to plan for first contact with extraterrestrials (this barely feels like an analogy).

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

58

High-level hopes for AI alignment

58

Ω 29

58

Ω 29

The challenge

Digital neuroscience

Limited AI

AI checks and balances

Other possibilities

High-level fear: things get too weird, too fast

So … is AI going to defeat humanity or is everything going to be fine?

Footnotes