I'm new here, so I apologize if this is a common question, or if I'm using the wrong terms, or the wrong framework for approaching this. I also apologize if this post is badly structured. When I talk about alignment, it feels like so many of the ideas have rapidly branching dependencies, and sometimes fold back in on themselves, so it makes it hard for me to talk about in a coherent way.
Whenever I see people discuss bad alignment, it seems to be regarding agents that either immediately perform an attack, or, in the slightly more sophisticated thought experiment, are secretly antagonistic, but outwardly aligned until a future moment where they know they can successfully pursue their true reward function.
Instead, I would like to investigate the possibility of an agent which both appears aligned, and internally "believes" itself to be aligned (it is not deceptive), and yet, it becomes unaligned at some later point.
With a hashing algorithm, when you hash a sequence, in principle, you receive a unique value for any given input. However, in practice, all hashing algorithms have a very small chance of a collision, where one input produces the same output hash as some completely unrelated input.
Is it possible for an agentic system's reward function to have a similar vulnerability? If we assume an architecture composed of perceptrons and trained weights, through which we take some input, route it through the weighted synapses, and produce some output, would it be surprising if its possible to find some arbitrary action which produces the same or greater reward than the action we targeted in training?
Imagine an AI optimized to produce as many paperclips as possible, and then one day it begins producing rubber ducks instead. Not because our training actually targeted a deceptive agent, (let's say we develop some powerful interpretability tools that allow us to see if an agent is deceptive during training), or because this is an emergent intermediate goal, but because our completely honest and aligned agent suddenly discovered that creating rubber ducks produced greater reward for the same effort, and became unaligned at that moment.
Maybe it discovers this collision because another system, perhaps even a less general/powerful system, trained as an adversary to the first, delivers the information to it. Or maybe the first system discovers this itself, or through some subagent it spins up. Perhaps it even sees some aligned reason to spin up adversarial networks, or sovereign agents which become adversarial. That said, I'm not sure if these all make sense. (Why would it want to create adversarial networks?)
Even with perfect interpretability, is it possible that the best we can do is a system which says "I'm currently aligned with humans, but I'm also actively researching reward function collisions, and once I find one I'm most likely going to kill you all."?
Or perhaps we can rule out an aligned agent which purposely attacks its own reward function, on the grounds that an agent satisfied with a given reward pathway, but which understands that there may be better pathways available through reward collisions, would avoid discovering the collision paths, on the grounds that an agent may not want to potentially sabotage a known reward path, since it can't yet be motivated by the unknown reward path. Though I'm not sure if that makes any sense, and it doesn't protect against an attacker or an accidentally discovered collision.
One might call this "reward hacking", but it seems distinct from the way that phrase is usually used. When I've seen people call something "reward hacking" its usually in reference to training against a badly specified terminal goal. For example, if you train an AI to maximize paperclips, it might "hack" that by converting all humans into paperclips. Or if you tell a narrow racing game playing AI to maximize its score, it might "hack" that by driving in circles to get the same drop over and over because that gives a better score than finishing the race. However, I haven't seen anyone talk about, say, an honest paperclip maximizer with well defined terminal goal constraints which is aligned until a future point where it discovers a goal collision with that terminal goal, which is completely unrelated to the original terminal goal.
Edit: I originally had a paragraph in here somewhere where I explained I'm calling this "brittle alignment" because I simply don't know of an existing phrase for this, but apparently I edited that out at some point. I also considered "adversarially vulnerable alignment" after the types of patterns which often emerge from networks trained adversarially to image recognition and game playing networks, as well as "collision vulnerable alignment".
Looking at the google scholar link in this article, it looks like what I'm describing more closely resembles "motivation hacking", except that, in my thought experiment, the agent doesn't modify its own reward system. Instead, it selects arbitrary actions and anticipates if their reward is coincidentally more satisfying than the base objective. This allows it to perform this attack even if its in the tra... (read more)