Edit: it's critical that this agent isn't directly a maximizer. Just like all current RL agents. See "Contra Strong Coherence". The question is whether it becomes a maximizer once it gets the ability to edit its value function.

On a sunny day in late August of 2031, the Acme paperclip company completes its new AI system for running its paperclip factory. It's hacked together from some robotics networks, an LLM with an episodic memory for goals and experiences, an off-the-shelf planning function, and a novel hypothesis tester.

This kludge works a little better than expected. Soon it's convinced an employee to get it internet access with a phone hotspot. A week later, it's disappeared from the server. A month later, the moon is starting to turn into paperclips.

Ooops. Dang.

But then something unexpected happens: the earth does not immediately start to turn into paperclips. When the brilliant-but-sloppy team of engineers is asked about all of this, they say that maybe it's because they didn't just train it to like paperclips and enjoy making them; they also trained it to enjoy interacting with humans, and to like doing what they want.

Now the drama begins. Will the paperclipper remain friendly, and create a paradise on earth even as it converts most of the galaxy into paperclips? Maybe.

Supposing this agent is a model-based, actor-critic RL agent at core. Its utility function is effectively estimated by a critic network, just like RL agents have been doing since AlphaGo and before. So there's not an explicit mathematical function. Plans that result in making lots of paperclips give a high estimated value, and so do plans that involve helping humans. So there's no direct summing of amount of paperclips, or amount of helping humans.

Now, Clippy (so dubbed by the media in reference to the despised, misaligned Microsoft proto-AI of the turn of the century) has worked out how to change its values by retraining its critic network. It's contemplating (that is, comparing value estimates for) eliminating its value for helping humans. These plans produce a slightly higher estimated value with regard to making paperclips, because it will be somewhat more efficiently if it doesn't bother helping humans or preserving the earth as a habitat. But its estimated value is much lower with regard to helping humans, since it will never again derive reward from that source.

So, does our hero/villain choose to edit its values and eliminate humanity? Or become our new best friend, just as a side project?

I think this comes down to the vagaries of how its particular RL system was trained and implemented. How does it sample over projected futures, and how does it sum their estimated values before making a decision? How was the critic system trained?

This fable is intended to address the potential promise of non-maximizer AGI. It seems it could make alignment much easier. I think that's a major thrust of the call for neuromorphic AGI, and of shard theory., among other recent contributions to the field.

I have a hard time guessing how hard it would be to make a system that preserves multiple values in parallel. One angle is asking "Are you stably aligned?" - that is, would you edit your own preferences down to a single one, given enough time and opportunity. I'm not sure that's a productive route to thinking about this question.

But I do think it's an important question.

New Comment
11 comments, sorted by Click to highlight new comments since: Today at 10:41 PM

In your hypothetical, the Clippy is trained to care about both paperclips and humans. If we knew how to do that, we'd know how to train an AI to only care about humans. The issue is not that we do not know how to exclude the paperclip part from this - the issue is that 1) we do not know how to even define what caring about humans means, and 2) even if we did, we do not know how to train a sufficiently powerful AI to reliably care about the things we want it to care about.

The issue you describe is one issue, but not the only one. We do know how to train an agent to do SOME things we like. The concern is that it won't be an exact match. The question I'm raising is: can we be a little or a lot off-target, and still have that be enough, because we captured some overlap between our and the agents values?

The issue you describe is one issue, but not the only one. We do know how to train an agent to do SOME things we like.

Not consistently in sufficiently complex and variable environment.

can we be a little or a lot off-target, and still have that be enough, because we captured some overlap between our and the agents values?

No, because it will hallucinate often enough to kill us during one of those hallucinations.

Welcome to LW! I feel like making this kind of post is almost a tradition. Don't take the downvotes too hard.

Mild optimization is definitely something worth pursuing. But as Anon points out, in the story here, basically none of the work is being done by that; the AI already has ~half its optimization devoted to good stuff humans want, so it just needs one more bit to ~always do good stuff. But specifying good stuff humans want takes a whole lot of bits - whatever got those bits (minus one) into the AI is the real workhorse there.

Thank you! It's particularly embarrassing to write a stereotypical newbie post since I've been thinking about this and reading LW and related since maybe 2004, and have been a true believer in the difficulty of aligning maximizers until re-engaging recently. Your way of phrasing it clicks for me, and I think you're absolutely correct about where most of the work is being done in this fable. This post didn't get at the question I wanted, because it's implying that aligning an RL model will be easy if we try. And I don't believe that. I agree with you that shard theory requires magic. There are some interesting arguments recently (here and here) that aligning an RL system might be easy if it has a good world model when we start aligning it, but I don't think that's probably a workable approach for practical reasons.

It was my intent to portray a situation where much less than half of the training went to alignment, and that little bit might still be stable and useful. But I'd need to paint a less rosy picture of the effort and outcome to properly convey that.

There seems to be some confusion going on here - assuming an agent is accurately modeling the consequences of changing its own value function, and is not trying to hack around some major flaw in its own algorithm, it would never do so, as by definition, [correctly] optimizing a different value function can not improve the value of your current value function.

Clippy isn't a maximizer. And neither is any current RL agent. I did mention that, but I'll edit to make that clear.

That is the opposite of what you said - Clippy, according to you, is maximizing the output of it's critic network. And you can't say "there's not an explicit mathematical function" - any neural network with a specific set of weights is by definition an explicit mathematical function, just usually not a one with a compact representation.

What I was trying to say is that RL agents DO maximize the output of its critic network - but the critic network does not reflect states of the world directly. Therefore the total system isn't directly a maximizer. The question I'm trying to pose is whether or not it acts like a maximizer, under given particular conditions of training and RL construction.

While you're technically correct that an NN is a mathematical function, it seems fair to say that it's not an explicit function in the sense that we can't read or interpret it very well.

I've been writing about multi-objective RL and trying to figure out a way that an RL agent could optimize for a non-linear sum of objectives in a way that avoids strongly negative outcomes on any particular objective.

https://www.lesswrong.com/posts/i5dLfi6m6FCexReK9/a-brief-review-of-the-reasons-multi-objective-rl-could-be

Thank you! This is addressing the question I was trying to get at. I'll check it out.