Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(Related to Stuart Armstrong’s post last summer on Model Splintering.)

Let’s say I have a human personal assistant. Call him Ahmed. Ahmed is generally trying to help me. Sometimes a situation comes up where Ahmed feels surprised and confused. Other times, a situation comes up where Ahmed feels torn and conflicted about what to do next. In both those types of situations, I would want Ahmed to act conservatively (only do things which he considers clearly good in all respects)—or better yet, stop and ask me for help. Maybe Ahmed would say something to me like "On the one hand ... On the other hand...". That would be great. I would be very happy for him to do that.

So, we should try to instill a similar behavior in AGIs!

I think something like this is likely to be feasible, at least in the neocortex-like AGIs that I’ve been thinking about.

More specifically: OK sure, maybe we need new breakthroughs in interpretability tools in order to understand what an AGI is confused or conflicted about, and why. But it seems quite tractable to develop methods to recognize that the AGI is confused or conflicted, and then to run some special-purpose code whenever that happens. Or more simply, just have it not take any action about which it feels confused or conflicted.

To start with, I’ll elaborate on what I think it would look like to have a circuit that detects when an AGI is confused or conflicted. Start with confusion.

What are the algorithmic correlates of feeling confused?

Confusion is pretty straightforward in my picture. You have a strongly-held generative model—one that has made lots of correct predictions in the past—and it predicts X under some circumstance. You have another strongly-held generative model which predicts NOT-X under the same circumstance. The message-passing algorithm flails around, trying in vain to slot these two contradictory expectations into the same composite model, or to deactivate one of them.

That’s confusion.

(I guess in logical induction, confusion would look like two very successful traders with two very different assessments of the value of an asset, or something like that, maybe. I don't know much about logical induction.)

Either way, confusion seems like it should be straightforwardly detectable by monitoring the algorithm.

What are the algorithmic correlates of feeling conflicted?

In broad outline, the brain has a world-model, and it's stored mainly in the neocortex. The basal ganglia has a dense web of connections all across the frontal lobe of the neocortex. (The frontal lobe is the home of plans and actions, but not sensory processing.) The basal ganglia's job is to store a database with a reward prediction (well, technically a reward prediction probability distribution) for each different pattern of activity in the frontal lobe of the neocortex. This database is updated using the TD learning algorithm, using the reward-prediction-errors that are famously signaled by dopamine. The reward itself is calculated elsewhere.

There's no point in building a reward prediction database if you're not going to use it, so that's the basal ganglia’s other job: use the database to bias neocortical activity in favor of activity associated with higher reward predictions.

Now, our thoughts are generally composed of lots of different little generative models snapped together on the fly. Like, “I will put on a sock” involves hand motions and foot motions and a model of how a sock stretches and the idea that your foot will be less cold in the future and the idea of finding your sock in the drawer, etc. etc. etc.

The reward prediction entries in the basal ganglia database do not look like "one single entry for the snapped-together model". Instead they look like lots of entries, one for each individual piece of the snapped-together model. It just has to be that way. How do I know? Because you’ve seen each of the individual pieces many times before, whereas for every snapped-together model, it’s the very first time in your life that that exact model has ever been assembled. After all, you’ve never before put on this particular sock in this particular location with this particular song stuck in your head, etc. etc. You can't have a reward prediction already stored in the database for a whole assembled model, if this is the first time you've ever seen it.

So a key question is: how do you combine a bunch of reward predictions (or reward prediction probability distributions), one from each of the component pieces, to get the aggregate reward prediction for the snapped-together model? Because at the end of the day, you do need a single aggregate reward prediction to drive behavior.

So this immediately suggests an idea for what it looks like to feel conflicted: you are entertaining a plan / generative model which combines at least one very-high-reward-predicting piece with at least one very-negative-reward-predicting piece. Semantically and logically, these two pieces snap together perfectly well into a composite model. But from a reward perspective, it’s self-inconsistent.

For example, in the trolley problem, the "I will save lives" model carries a very high reward prediction, but the "I will take an action that directly causes someone to die" model carries a very negative reward prediction. The same plan activates both of those mental models. Hence we feel conflicted. (More details in the appendix—I think it's a bit more complicated than I'm letting on here.)

As above, it seems to me that it should be straightforward to build an AGI component that watches the models being formed, and watches the queries to the reward prediction database, and thus can flag when the AGI feels conflicted.

What do we do when the "AGI is confused" detector circuit activates?

OK. So we build an AGI with a circuit that detects when it feels confused. (I'll do conflict in the next section.) So far so good. But now what?

For one thing, when the circuit activates, that's a possible warning that the AGI is having an ontological crisis. Or more generally, it could flag what Stuart Armstrong calls Model Splintering. It could also be something benign, like the AGI has some narrow misunderstanding about protein dynamics.

So OK, the detector circuit just activated. The AGI is very confused about something. Honestly, I'm not really sure what the next step is.

Hmm, maybe instead of looking for confusion in general, you find and flag some safety-critical entries in the world model—"corrections from my supervisor", "my own programming", etc.—and if the AGI has major confusions involving those entries, then you should be concerned, because that could be about to precipitate an ontological crisis that will undermine our control systems. So you automatically pause AGI execution in those cases, and try to figure out what's going on using interpretability tools, or suppress that thought so that the AGI goes and thinks about something else. Or something.

Well, anyway, I think the "AGI is conflicted" part is more important than the "AGI is confused" part. So I'll switch to that.

What do we do when the "AGI is conflicted" detector circuit activates?

My broad idea is: you (somehow) start in a good state, with the AGI having a suite of intuitions and ideas that’s compatible with what you were going for. I’m imagining something like the psychology of a helpful human—there’s a big cloud of values and habits and norms that tend to work together under normal circumstances.

Then the AGI thinks of something weird—“Oh, hey, what if I tile the universe with Hedonium?”. Presumably this is a great idea according to some of its values / habits / norms, and a terrible idea according to other of its values / habits / norms. So the “AGI is conflicted” detector circuit activates.

Now what?

A simple answer is: We set up the reward system such that this kind of idea is not appealing. Like, we get to decide how the reward prediction aggregation works, right? So we can just say: every little piece of the model gets a veto, so that if the AGI has any intuition suggesting that this idea is bad, the idea as a whole just sounds unappealing—as opposed to whatever more even weighting function we humans use (see Appendix). I'll call that idea conservatism.

In other words, you can imagine a continuum from:

RP of the snapped-together model = min(RP of each component piece)

RP of the snapped-together model = mean(RP of each component piece)

RP of the snapped-together model = max(RP of each component piece)

(RP = Reward Prediction.) And that continuum would go smoothly from most conservative (it's a good thought if all of your intuitions support it) to most aggressive (it's a good thought if any of your intuitions support it).

(I think you also need to somehow set up the system so that "do nothing" is the automatically-acceptable default operation when every possibility is unpalatable.)

(By the way, I think the reward prediction aggregation algorithm is actually more complicated than this, at least in humans, although it could presumably work exactly like this in an AGI. So maybe this is oversimplified, although I still think the main point is valid. See appendix for further discussion.)

So that's one possible answer. It doesn't actually require the conflict-detection circuit I mentioned above, and doesn't require any manual intervention. This kind of conservatism is just built into the AGI's thinking / decision rule.

Another possible answer is: When we detect internal conflict, we pause the AGI's execution. First we suppress one side of the conflict and see (using whatever interpretability methods we have) the thought that comes into the AGI's mind. Then we rewind, and suppress the other side of the conflict, and see the thought that comes into the AGI's mind. Then whichever one we like better, we imprint that choice as a strong update to the reward prediction database.

Maybe there are other possible answers too.

What could go wrong?

Well, this is obviously quite vague, and just one piece of a very incomplete story. You'll notice I've been mentioning that throughout!

Another thing is: there's a need to set a threshold for the detector circuits, or a threshold for the "conservative" thought-selection rule. Too low and the AGI stops being able to do anything, too high and you're missing important things. How do you get it right?

Another thing is: How close can these mechanisms get to our intuitive notions of ontological crises, conservatism, internal conflict, etc.? Any divergence presents a risk.

Another thing is: We want the AGI to like having these mechanisms, so that it doesn't try to subvert them. But that's nothing new; it's equally true of every safety feature of a self-aware AGI. I imagine doing something like finding "the idea of having this mechanism" in the AGI's world-model (somehow), and making sure the corresponding reward-prediction entry in the database stays positive. Or something. I dunno.

Appendix: Speculations on how reward aggregation works in humans / animals

Try to imagine a "falling stationary rock". You can't. The "falling" model and the "stationary" model make incompatible predictions about the same Boolean variables. It just doesn't snap together into a composite model. I'll say that those two models are "semantically incompatible."

In the same way, my best guess right now is that your brain treats it as equally impossible to snap together a high-reward-predicting model with a low-reward-predicting model. They are, let's say, "reward incompatible". The brain demands consensus on the reward prediction from all the pieces that get snapped together, I think.

And moreover, the very fact that a particular high-reward-predicting model is semantically compatible with a low-reward predicting model is automatically counted as evidence that they are, in fact, reward compatible. So just by thinking about the snapped-together composite model, the high-reward-predicting model might lose points, and the low-reward-predicting model might gain points. Try it. Imagine Hitler being very kind to his dog Blondi. Just hold the image in your head for a minute. (I don't know if it's true.) You notice yourself intuitively thinking a little less ill of Hitler, and/or thinking a little less well of dog-lovers in general.

Hence we get the halo effect, and Scott Alexander's "ethnic tension" graphical model ("flow diagram thing") algorithm, etc. When I think about the trolley problem, I think I actually oscillate back and forth between the "killing is bad" thought and the "saving lives is good" thought. I don't think I can hold both aspects in my head at exactly the same time.

Now, on the one hand, this seems utterly insane. Why can't you have two model pieces that are semantically-compatible but reward-incompatible? Plans can have both good and bad aspects, right? So, even if humans work this way, why on earth would we make our AGIs that way???

On the other hand, if we do have an architecture with a single scalar reward, then each piece of the composite model is either predicting that reward correctly, or it's not. So in that sense, we really should be updating the reward predictions of different parts of a snapped-together model to bring them closer to consensus.

(Also, whenever I say to myself, "well that's a just-plain-stupid aspect of the human brain algorithm", I tend to eventually wind up coming around to feeling that it was a necessary design feature after all.)

An additional complication is that the reward is underdetermined by the situation—the reward is a function of your thoughts, not your situation. You can think about how you killed that person with your trolley and thus experience a negative reward, or you can think about how you saved lives and thus experience a positive reward. So you can go back and forth, and never get consensus. You just have two ways to think about it, and each is correctly predicting the future rewards, in a self-fulfilling, self-consistent kind of way.

Anyway, I don't think any of this undermines the main part of this blog post. I think what I said about catching conflict and being conservative should still work, even if the AGI has a "reward-prediction-consensus" architecture rather than a "reward-prediction-aggregation" architecture. But I'm not 100% sure. Also, everything in this appendix is half-baked speculation. As is the rest of the post. :-P


Ω 10

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 3:28 PM

As you're aware, I'm very much exploring this approach using a multi-objective decision-making approach, with conservativism through only acting when an action is non-negative on the whole set of objective functions that an actor regards.

The alternative, Bayesian AGI approach is also worth thinking about too. A conservative Bayesian AGI might not need multiple objectives. For each action, it just needs a single probability distribution of outcomes. If there are multiple theories of how to translate consequences of its actions into its single utility function, each of those theories might be given some weight, and then they'd be combined into the probability distribution. Then a conservative Bayesian AGI only acts if an action's utility function doesn't exceed below zero. Or maybe there's always some remote possibility of going below zero, and programming this sort of behavior would be absolutely paralyising. In that case maybe we just make it loss-averse rather than strictly avoiding any possibility of a negative outcome.

Why does this approach only need to be implemented in neo-cortex like AGIs? If we have a factored series of value functions in an RL agent then we should be able to take the same approach? But I guess you are thinking that the basal ganglia learning algorithms already do this for us so it is a convenient approach?

Side note. I found the distinction between confusion and conflict a bit... confusing! Confusion here is the agent updating a belief while conflict is the agent deciding to take an action?

Why does this approach only need to be implemented in neo-cortex like AGIs?

Oh, I wasn't saying that. I wouldn't know either way, I haven't thought about it. RL is a very big and heterogeneous field. I only know little bits and pieces of it. It's a lot easier to make a specific proposal that applies to a specific architecture—the architecture that I happen to be familiar with—than to try to make a more general proposal. So that's all I did.

What do you mean by "factored series of value functions"? If you're thinking of my other post, maybe that's possible, although not what I had in mind, because humans can feel conflicted, but my other post is talking about a mechanism that does not exist in the human brain.

Confusion here is the agent updating a belief while conflict is the agent deciding to take an action?

Yeah, that's what I was going for.

Thanks, I find your neocortex-like AGI approach really illuminating.

Random thought:

(I think you also need to somehow set up the system so that "do nothing" is the automatically-acceptable default operation when every possibility is unpalatable.)

I was wondering if this is necessarily the best „everything is unpalatable“ policy. I could imagine that the best fallback option could also be something like „preserve your options while gathering information, strategizing and communicating with relevant other agents“, assuming that this is not unpalatable, too. I guess we may not yet trust the AGI to do this, option preservation might cause much more harm than doing nothing. But I still wonder if there are cases in which every option is unpalatable but doing nothing is clearly worse.

Yeah I was really only thinking about "not yet trust the AGI" as the main concern. Like, I'm somewhat hopeful that we can get the AGI to have a snap negative reaction to the thought of deceiving its operator, but it's bound to have a lot of other motivations too, and some of those might conflict with that. And it seems like a harder task to make sure that the latter motivations will never ever outbid the former, than to just give every snap negative reaction a veto, or something like that, if that's possible.

I don't think "if every option is bad, freeze in place paralyzed forever" is a good strategy for humans :-P and eventuality it would be a bad strategy for AGIs too, as you say.