Misalignment and misuse: whose values are manifest?

KatjaGrace

Crossposted from world spirit sock puppet.

AI related disasters are often categorized as involving misaligned AI, or misuse, or accident. Where:

misuse means the bad outcomes were wanted by the people involved,
misalignment means the bad outcomes were wanted by AI (and not by its human creators), and
accident means that the bad outcomes were not wanted by those in power but happened anyway due to error.

In thinking about specific scenarios, these concepts seem less helpful.

I think a likely scenario leading to bad outcomes is that AI can be made which gives a set of people things they want, at the expense of future or distant resources that the relevant people do not care about or do not own.

For example, consider autonomous business strategizing AI systems that are profitable additions to many companies, but in the long run accrue resources and influence and really just want certain businesses to nominally succeed, resulting in a worthless future. Suppose Bob is considering whether to get a business strategizing AI for his business. It will make the difference between his business thriving and struggling, which will change his life. He suspects that within several hundred years, if this sort of thing continues, the AI systems will control everything. Bob probably doesn’t hesitate, in the way that businesses don’t hesitate to use gas vehicles even if the people involved genuinely think that climate change will be a massive catastrophe in hundreds of years.

When the business strategizing AI systems finally plough all of the resources in the universe into a host of thriving 21st Century businesses, was this misuse or misalignment or accident? The strange new values that were satisfied were those of the AI systems, but the entire outcome only happened because people like Bob chose it knowingly (let’s say). Bob liked it more than the long glorious human future where his business was less good. That sounds like misuse. Yet also in a system of many people, letting this decision fall to Bob may well have been an accident on the part of others, such as the technology’s makers or legislators.

Outcomes are the result of the interplay of choices, driven by different values. Thus it isn’t necessarily sensical to think of them as flowing from one entity’s values or another’s. Here, AI technology created a better option for both Bob and some newly-minted misaligned AI values that it also created—‘Bob has a great business, AI gets the future’—and that option was worse for the rest of the world. They chose it together, and the choice needed both Bob to be a misuser and the AI to be misaligned. But this isn’t a weird corner case, this is a natural way for the future to be destroyed in an economy.

Thanks to Joe Carlsmith for conversation leading to this post.

misalignment means the bad outcomes were wanted by AI (and not by its human creators), and
accident means that the bad outcomes were not wanted by those in power but happened anyway due to error.

My impression was that accident just meant "the AI system's operator didn't want the bad thing to happen", so that it is a superset of misalignment.

Though I agree with the broader point that in realistic scenarios there is usually no single root cause to enable this sort of categorization.

If we solve the problem normally thought of as "misalignment", it seems like this scenario would now go well. If we solve the problem normally thought of as "misuse", it seems like this scenario would now go well. This argues for continuing to use these categories, as fixing the problems they name is still sufficient to solve problems that do not cleanly fit in one bucket or another

If we solve the problem normally thought of as "misalignment", it seems like this scenario would now go well.

This might be totally obvious, but I think it's worth pointing out that even if we "solve misalignment" - which I take to mean solving the technical problem of intent alignment - Bob could still chose to deploy a business strategising AI, in which case this failure mode would still occur. In fact, with a solution to intent alignment, it seems Bob would be more likely to do this, because his business strategising assistant will actually be trying to do what Bob wants (help his business suceed).

It seems like if Bob deploys an aligned AI, then it will ultimately yield control of all of its resources to Bob. It doesn't seem to me like this would result in a worthless future even if every single human deploys such an AI.

I think that you have a 4th failure mode. Moloch.

The strange new values that were satisfied were those of the AI systems, but the entire outcome only happened because people like Bob chose it knowingly (let’s say). Bob liked it more than the long glorious human future where his business was less good

I think a relevant consideration here is that Bob doesn't actually have the ability to choose between these two futures -- rather his choice seems to be between a world where his business succeeds but AI takes over later, or a world where his business fails but AI takes over anyway(because other people will use AI even if he doesn't). Bob might actually prefer to sign a contract forbidding the use of AI if he knew that everybody else would be in on it. I suspect that this would be the position of most people who actually thought AI would eventually take over, and that most people who would oppose such a contract would not think AI takeover is likely(perhaps via self-deception due to their local incentives, which in some ways is similar to just not valuing the future)

Isn't it correct and useful to say that Bob used the misalignment in his favor? That doesn't sound exactly right, because that makes Bob looked like he gamed the system instead of dealing with a tradeoff. In that context, the misalignment let Bob attack the economy in a way that was useful for him.

misalignment means the bad outcomes were wanted by AI (and not by its human creators), and
accident means that the bad outcomes were not wanted by those in power but happened anyway due to error.

My own intuition for accident in this context is that the bad part of the outcome was irrelevant for the AI, so it didn't try to push in that direction, but it's other actions did accidentally.

misalignment means the bad outcomes were wanted by AI (and not by its human creators), and
accident means that the bad outcomes were not wanted by those in power but happened anyway due to error.

My impression was that accident just meant "the AI system's operator didn't want the bad thing to happen", so that it is a superset of misalignment.

Though I agree with the broader point that in realistic scenarios there is usually no single root cause to enable this sort of categorization.

If we solve the problem normally thought of as "misalignment", it seems like this scenario would now go well.

I think that you have a 4th failure mode. Moloch.

The strange new values that were satisfied were those of the AI systems, but the entire outcome only happened because people like Bob chose it knowingly (let’s say). Bob liked it more than the long glorious human future where his business was less good

misalignment means the bad outcomes were wanted by AI (and not by its human creators), and
accident means that the bad outcomes were not wanted by those in power but happened anyway due to error.

My own intuition for accident in this context is that the bad part of the outcome was irrelevant for the AI, so it didn't try to push in that direction, but it's other actions did accidentally.

42

Misalignment and misuse: whose values are manifest?

42

Ω 19

42

Ω 19

42

Ω 19