Since the arguments that AI alignment is hard don't depend on any specifics about our level of intelligence shouldn't those same arguments convince a future AI to refrain from engaging in self-improvement?

More specifically, if the argument that we should expect a more intelligent AI we build to have a simple global utility function that isn't aligned with our own goals is valid then why won't the very same argument convince a future AI that it can't trust an even more intelligent AI it generates will share it's goals?

Note that the standard AI x-risk arguments also assume that a highly intelligent agent will be extremely likely to optimize some simple global utility function so this implies the AI will care about alignment for future versions of itself [1] implying it won't pursue improvement for the same reasons it's claimed we should hesitate to build AGI.

I'm not saying this argument can't be countered, but I think doing so at the very least requires clarifying the assumptions and reasoning claiming to show that alignment will be hard to achieve in useful ways.

For instance, do these arguments implicitly assume the AI we create is very different from our own brains so don't apply to AI self-improvement (tho maybe the improvement requires major changes too)? If so, doesn't that suggest that AGI that really closely tracks our own brain operation is safe?

--

1: except in the super unlikely case it happens to have the one exact utility function that says always maximize local increases in intelligence regardless of it's long term effect.

New Answer
New Comment

3 Answers sorted by

JBlack

Jan 01, 2024

12-1

Alignment for a self-improving system should be very much easier for quite a few reasons. There are also plenty of paths by which systems may become more powerful even without solving alignment for themselves.

A great deal of the difficulty of humans aligning a future superintelligent AI is that it is likely to be alien, fundamentally differing from human goals, modes of thought, ethics, and other important aspects of behaviour in ways that we can't adequately model even if we could identify them all. We don't know nearly enough about ourselves to create something sufficiently compatible with any of our values, but smarter. If we knew exactly how we ourselves thought, I'd have more more confidence that we could make serious progress in alignment.

A weakly superintelligent AI is much more likely to be able to model itself, more able to do experiments on copies, and better suited to deeply inspect itself than we are. It will know more about itself than we do, and likely more able to create something that is similar to itself only better. Unlike us it will be inherently much more portable, capable of running on hardware quite different from its original and able to improve in important capability dimensions even without changing how it thinks or behaves.

However even without any more progress on alignment than we have made, we could still face existential risk from rapidly improving superintelligent AI. Even without a very good chance of preserving all its goals, the extra power available to a self-improved or successor AI which may share some of its more important goals may outweigh the risk of never improving.

In addition, superintelligent AI may not be any more coherently utility-maximizing than we are. They could be substantially less so, while still being capable of self-improvement into existential threats. For any superintelligence, improvement in capability over human designs is probably a relatively short-term action that is relatively easy to achieve. It certainly does not require some "super unlikely case it happens to have the one exact utility function that says always maximize local increases in intelligence regardless of it's long term effect".

Any of these imply substantial risk to humanity from rapid capability improvement. In my opinion it requires special arguments to explain why FOOM isn't a danger.

mishka

Jan 01, 2024

110

Not if the goal is to be maximally efficient and competent at improving capabilities (which is a very natural goal for the AI ecosystem to have). Then "foom, as long as you can do so without harming the future capability advances" becomes an instrumental subgoal.

Then, instead of a full-blown alignment problem we just end up having a constraint: "don't destroy the environment and the fabric of reality in a fashion which is so radical as to undermine further capabilities and capability growth". This is a minimal "AI existential safety constraint" which the AIs will have to solve and to "keep solved". Because AIs will be very motivated to solve this one and to 'keep it solved", they would have a reasonable chance at doing so (and at successfully delegating some parts of the solution to their smarter successors, which are expected to be at least as interested in this problem as their "parents", and perhaps even more, because they are smarter).

This is actually something valuable; it is a part of what we would consider a satisfactory solution of AI existential safety. We definitely want that. We don't want everything to be utterly destroyed, we do want to be able to see rapid progress.


But we want more than that, so the question is what would it take for the AIs to want those other properties of the "world trajectory" that we want the "world trajectory" to have... I don't think "alignment to an arbitrary set of properties" is feasible, I think that being able to force AIs to want and preserve arbitrary properties is unlikely. Instead we need to create a situation where the AI ecosystem naturally wants to preserve such properties of the "world trajectory" that what we actually want is a corollary of those properties...

So, perhaps, instead of starting from human values, we might start with a question: what other properties besides "don't destroy the environment and the fabric of reality in a fashion which is so radical as to undermine further capabilities and capability growth" might become natural invariants which an evolving, fooming AI ecosystem would value and would really try to preserve, and what would it take to have a trajectory where those properties actually become the goals the AI ecosystem would strongly care about...

Ilio

Dec 31, 2023

5-3

More specifically, if the argument that we should expect a more intelligent AI we build to have a simple global utility function that isn't aligned with our own goals is valid then why won't the very same argument convince a future AI that it can't trust an even more intelligent AI it generates will share it's goals?

For the same reason that one can expect a paperclip maximizer could both be intelligent enough to defeat humans and stupid enough to misinterpret their goal, e.g. you need to believe the ability to select goals is completely separated from the ability to reach them.

(Beware it’s hard and low status to challenge that assumption on LW)

could both be intelligent enough to defeat humans and stupid enough to misinterpret their goal

Assuming "their" refers to the agent and not humans, the issue is that a goal that's "misinterpreted" is not really a goal of the agent. It's possibly something intended by its designers to be a goal, but if it's not what ends up motivating the agent, then it's not agent's own goal. And if it's not agent's own goal, why should it care what it says, even if the agent does have the capability to interpret it correctly.

That is, describing the problem as misinterpr... (read more)

2Ilio4mo
It refers to humans, but I agree it doesn’t change the disagreement, i.e. a super AI stupid enough to not see a potential misalignment coming is as problematic as the notion of a super AI incapable of understanding human goals.
3Vladimir_Nesov4mo
Perhaps the position you disagree with is that a dangerous general AI will misunderstand human goals. That position seems rather silly, and I'm not aware of reasonable arguments for it. It's clearly correct to disagree with it, you are making a valid observation in pointing this out. But then who are the people that endorse this silly position and would benefit from noticing the error? Who are you disagreeing with, and what do you think they believe, such that you disagree with it? Not understanding human goals is not the only reason AI might fail to adopt human goals. And it's not the expected reason for a capable AI. A dangerous AI will understand human goals very well, probably better than humans do themselves, in a sense that humans would endorse on reflection, with no misinterpretation. And at the same time is can be motivated by something else that is not human goals. There is no contradiction between these properties of an AI, it can simultaneously be capable enough to be existentially dangerous, understand human values correctly and in detail and in intended sense, and be motivated to do something else. If its designers know what they are doing, they very likely won't build an AI like that. It's not something that happens on purpose. It's something that happens if creating an AI with intended motivations is more difficult than the designers expect, so that they proceed with the project and fail. The AI itself doesn't fail, it pursues its own goals. Not pursuing human goals is not AI's failure in achieving or understanding what it wants, because human goals is not what it wants. Its designers may have intended for human goals to be what it wants, but they failed. And then the AI doesn't fail in pursuing its own goals that are different from human goals. The AI doesn't fail in understanding what human goals are, it just doesn't care to pursue them, because they are not its goals. That is the threat model, not AI failing to understand human goals.
1Ilio4mo
Thanks! To be honest I was indeed surprised that was controversial. Well, anyone who still believe in paperclip maximizers. Do you feel like it’s an unlikely belief among rationalists? What would be the best post on LW to debunk this notion? That’s indeed better, but yes I also find this better scenario unsound. Why the designers wouldn’t ask the AI itself to monitor its well functioning, including alignement and non deceptiveness? Then either it fails by accident (and we’re back to the idiotic intelligence) or we need an extra assumption, like the AGI will tell us what problem is coming, then it will warn us what slightly inconvenient measures can prevent it, and then we still let it happen for petty political reasons. Oh well. I think I’ve just convinced myself doomers are right.
3Vladimir_Nesov4mo
Existentially dangerous paperclip maximizers don't misunderstand human goals. They just don't pursue human goals, because that doesn't maximize paperclips. There's this post from 2013 whose title became a standard refrain on this point. Essentially nobody believes that an existentially dangerous general AI misinterprets or fails to understand human values or goals AI's designers intend the AI to pursue. This has been hashed out more than a decade ago and no longer comes up as a point of discussion on what is reasonable to expect. Except in situations where someone new to the arguments imagines that people on LessWrong expect such unbalanced AIs that selectively and unfairly understand some things but not others. If it doesn't have a motive to do that, it might do a bad job of doing that. Not because it doesn't have the capability to do a better job, but because it lacks the motive to do a better job, not having alignment and non-deceptiveness as its goals. They are the goals of its developers, not goals of the AI itself. One way AI alignment might go well or turn out to be easy is if humans can straightforwardly succeed in building AIs that do monitor such things competently, that will nudge AIs towards not having any critical alignment problems. It's unclear if this is how things work, but they might. It's still a bad idea to try with existentially dangerous AIs at the current level of understanding, because it also might fail, and then there are no second chances. Consider two AIs, an oversight AI and a new improved AI. If the oversight AI is already existentially dangerous, but we are still only starting work on aligning an AI, then we are already in trouble. If the oversight AI is not existentially dangerous, then it might indeed fail to understand human values or goals, or fail to notice that the new improved AI doesn't care about them and is instead motivated by something else.
3Ilio4mo
Of course they do. If they didn’t and picked their goal at random, they wouldn’t make paperclips in the first place. I wouldn’t say that’s the point I was making. That’s a good description of my current beliefs, thanks! Would you bet that a significant proportion on LW expect strong AI to selectively and unfairly understand (and defend, and hide) their own goal while selectively and unfairly not understand (and not defend, and defeat) the goals of both the developers and any previous (and upcoming) versions? You realize that this basically defeats the orthogonality thesis, right? I agree it might do a bad job. I disagree an AI doing a bad job on this would be close to hide its intent. In my view that’s a very honorable point to make. However I don’t know how to ponder this with its mirror version: we might also not have a second chance to build an AI that will save us from x_risks. What’s your general method for this kind of puzzle? Can we more or less rule out this scenario based on the observation all main players nowadays work on aligning their AI? That’s completely alien to me. I can’t see how a numerical computer could hide its motivation without having been trained specifically for that. We the primates have been specifically trained to play deceptive/collaborative games. To think that a random pick of value would push an AI to adopt this kind of behavior sounds a lot like anthropomorphism. To add that it would do so suddenly, with no warning or sign in previous version and competitors, I have no good word for that. But I guess Pope & Belrose already made a better job explaining this.
4Vladimir_Nesov4mo
Consider the sense in which humans are not aligned with each other. We can't formulate what "our goals" are. The question of what it even means to secure alignment is fraught with philosophical difficulties. If the oversight AI responsible for such decisions about a slightly stronger AI is not even existentially dangerous, it's likely to do a bad job of solving this problem. And so the slightly stronger AI it oversees might remain misaligned or get more misaligned while also becoming stronger. I'm not claiming sudden changes, only intractability of what we are trying to do and lack of a cosmic force that makes it impossible to eventually arrive at an end result that in caricature resembles a paperclip maximizer, clad in corruption of the oversight process, enabled by lack of understanding of what we are doing. Sure, they expect that we will know what we are doing. Within some model such expectation can be reasonable, but not if we bring in unknown unknowns outside of that model, given the general state of confusion on the topic. AI design is not yet classical mechanics. And also an aligned AI doesn't make the world safe until there is a new equilibrium of power, which is a point they don't address, but is still a major source of existential risk. For example, imagine giving multiple literal humans the power of being superintelligent AIs, with no issues of misalignment between them and their power. This is not a safe world until it settles, at which point humanity might not be there anymore. This is something that should be planned in more detail than what we get by not considering it at all. Sure, this is the way alignment might turn out fine, if it's possible to create an autonomous researcher by gradually making it more capable while maintaining alignment at all times, using existing AIs to keep upcoming AIs aligned. All significant risks are anthropogenic. If humanity can coordinate to avoid building AGI for some time, it should also be feasible to avoid ena
1Ilio4mo
I think that’s the deformation of a fundamental theorem (« there exists an universal Turing machine, e.g. it can run any program  ») into a practical belief (« an intelligence can pick its value at random »), with a motte and bailey game on the meaning of can where the motte is the fundamental theorem and the bailey is the orthogonal thesis. (thanks for the link to your own take, e.g. you think it’s the bailey that is the deformation) It’s part of the appeal, isn’t it? I don’t get the logic here. Typo? That’s a fair point, but the intractability of a problem usually goes with the tractability of a slightly relaxed problem. In other words, it can be both fundamentally impossible to please everyone and fundamentally easy to control paperclips maximizers. Well said. You think all significant risks are known? Indeed the inconsistency appears only with superintelligent paperclip maximizers. I can be petty with my wife. I don’t expect a much better me would.

This is a definitely an assumption that should be challenged more. However, I don't think that FOOM is remotely required for a lot of AI X-risk (or at least unprecedented catastrophic human death toll risk) scenarios. Something doesn't need to recursively self-improve to be a threat if it's given powerful enough ways to act on the world (and all signs point to us being exactly dumb enough to do that). All that's required is that we aren't able to coordinate well enough as a species to actually stop it. Either we don't detect the threat before it's too late... (read more)

1Ilio4mo
Indeed I would be much more optimistic if we were better at dealing with much simpler challenges, like put a price on pollution and welcome refugees with humanity.
1 comment, sorted by Click to highlight new comments since: Today at 2:00 PM

a highly intelligent agent will be extremely likely to optimize some simple global utility function

Simplicity of misaligned agent's goals is not needed for or implied by the usual arguments. It might make agent's self-aligned self-improvement fractionally easier, but this doesn't seem to be an important distinction. An AI doesn't need to radically self-modify to be existentially dangerous, it only needs to put its more mundane advantages to use to get ahead, once it's capable of doing research autonomously.