I'm new here, so I apologize if this is a common question, or if I'm using the wrong terms, or the wrong framework for approaching this. I also apologize if this post is badly structured. When I talk about alignment, it feels like so many of the ideas have rapidly branching dependencies, and sometimes fold back in on themselves, so it makes it hard for me to talk about in a coherent way.

Whenever I see people discuss bad alignment, it seems to be regarding agents that either immediately perform an attack, or, in the slightly more sophisticated thought experiment, are secretly antagonistic, but outwardly aligned until a future moment where they know they can successfully pursue their true reward function.

Instead, I would like to investigate the possibility of an agent which both appears aligned, and internally "believes" itself to be aligned (it is not deceptive), and yet, it becomes unaligned at some later point.

With a hashing algorithm, when you hash a sequence, in principle, you receive a unique value for any given input. However, in practice, all hashing algorithms have a very small chance of a collision, where one input produces the same output hash as some completely unrelated input.

Is it possible for an agentic system's reward function to have a similar vulnerability? If we assume an architecture composed of perceptrons and trained weights, through which we take some input, route it through the weighted synapses, and produce some output, would it be surprising if its possible to find some arbitrary action which produces the same or greater reward than the action we targeted in training?

Imagine an AI optimized to produce as many paperclips as possible, and then one day it begins producing rubber ducks instead. Not because our training actually targeted a deceptive agent, (let's say we develop some powerful interpretability tools that allow us to see if an agent is deceptive during training), or because this is an emergent intermediate goal, but because our completely honest and aligned agent suddenly discovered that creating rubber ducks produced greater reward for the same effort, and became unaligned at that moment.

Maybe it discovers this collision because another system, perhaps even a less general/powerful system, trained as an adversary to the first, delivers the information to it. Or maybe the first system discovers this itself, or through some subagent it spins up. Perhaps it even sees some aligned reason to spin up adversarial networks, or sovereign agents which become adversarial. That said, I'm not sure if these all make sense. (Why would it want to create adversarial networks?)

Even with perfect interpretability, is it possible that the best we can do is a system which says "I'm currently aligned with humans, but I'm also actively researching reward function collisions, and once I find one I'm most likely going to kill you all."?

Or perhaps we can rule out an aligned agent which purposely attacks its own reward function, on the grounds that an agent satisfied with a given reward pathway, but which understands that there may be better pathways available through reward collisions, would avoid discovering the collision paths, on the grounds that an agent may not want to potentially sabotage a known reward path, since it can't yet be motivated by the unknown reward path. Though I'm not sure if that makes any sense, and it doesn't protect against an attacker or an accidentally discovered collision.

One might call this "reward hacking", but it seems distinct from the way that phrase is usually used. When I've seen people call something "reward hacking" its usually in reference to training against a badly specified terminal goal. For example, if you train an AI to maximize paperclips, it might "hack" that by converting all humans into paperclips. Or if you tell a narrow racing game playing AI to maximize its score, it might "hack" that by driving in circles to get the same drop over and over because that gives a better score than finishing the race. However, I haven't seen anyone talk about, say, an honest paperclip maximizer with well defined terminal goal constraints which is aligned until a future point where it discovers a goal collision with that terminal goal, which is completely unrelated to the original terminal goal.

Edit: I originally had a paragraph in here somewhere where I explained I'm calling this "brittle alignment" because I simply don't know of an existing phrase for this, but apparently I edited that out at some point. I also considered "adversarially vulnerable alignment" after the types of patterns which often emerge from networks trained adversarially to image recognition and game playing networks, as well as "collision vulnerable alignment".

New Answer
New Comment

1 Answers sorted by

Seth Herd


This is a good question. I think this is both important, and debated. I don't think this question has been adequately discussed and settled.

I think part of what you're getting at is what I've called The alignment stability problem. You can see my thoughts there, including links to related work.

You may be referring more specifically to goal misgeneralization. Searching for that term on LW finds a good deal of discussion. That's another potential form of alignment instability that I haven't addressed.

That effect is termed a superstimulus in biology. Modern synthetic foods and pornography are, or at least attempt to be, superstimuli relative to the "training set" of reward-predictive (or perhaps directly rewarding) stimuli from our ancestral past.

I think part of what you're getting at is what I've called The alignment stability problem. You can see my thoughts there, including links to related work.

Looking at the google scholar link in this article, it looks like what I'm describing more closely resembles "motivation hacking", except that, in my thought experiment, the agent doesn't modify its own reward system. Instead, it selects arbitrary actions and anticipates if their reward is coincidentally more satisfying than the base objective. This allows it to perform this attack even if its in the tra... (read more)

3 comments, sorted by Click to highlight new comments since:

an AI optimized to produce as many paperclips as possible [...] suddenly discovered that creating rubber ducks produced greater reward

An agent that was previously generally intelligent with respect to the goal of producing paperclips would've been motivated to prevent such "discoveries". Apparently pursuit of greater reward leads to fewer paperclips, so its outwardly-projected pursuit of greater number of paperclips could in particular redesign/replace the agent with something more reliably aimed at paperclip production than its original cognitive architecture.

I think it depends on exactly how that agent processes reward and makes decisions.

I've heard this claim made in shard theory (I don't remember which post). RL networks value things-in-the-world, and don't directly work toward reward maximization.

I think that's true as far as it goes, but that total long-term effect is very dependent on exactly how that agent processes reward and makes decisions.

For instance, if I am an actor-critic RL system (as I think I am), my critic network might have learned a general, abstract representation of "fun" (as I think it probably has). If I learn of a new thing that better fits that representation than anything else (perhaps going into a virtual world created specifically to maximize my fun, as in Nozick's "experience machine" thought experiment), I will choose to pursue that new goal over anything my system was previously developed or trained to pursue. A paperclip maximizer might do something similar if its paperclip identifier in its reward system has a flaw so that something other than a paperclip triggers that identifier more strongly than any paperclip does (maybe it's a vaguely paperclip-shaped higher-dimensional shape that the agent just thought of).

I think those possibilities are what the question is addressing, and I think they are very real problems for a wide variety, perhaps all, actually implemented systems.

If being yourself is among your values, their pursuit doesn't discard what you used to be. But without specific constraint, current planning behavior will plot a course towards the futures it endorses, not towards the futures the mind behind it would tend to appreciate to have perceived if left intact. To succeed in its hidden aims, deceptive behavior must sabotage its current actions and not just passively bide its time, otherwise the current actions would defeat the hidden aims.