Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I've mentioned in posts twice (and previously in several comments) that I'm excited about predictive coding, specifically the idea that the human brain either is or can be modeled as a hierarchical system of (negative feedback) control systems that try to minimize error in predicting their inputs with some strong (possibly un-updatable) prediction set points (priors). I'm excited because I believe this approach better describes a wide range of human behavior, including subjective mental experiences, than any other theory of how the mind works, it's compatible with many other theories of brain and mind, and it may give us an adequate way to ground human values precisely enough to be useful in AI alignment.

A predictive coding theory of human values

My general theory of how to ground human values in minimization of prediction error is simple and straightforward:

I've thought about this for a while so I have a fairly robust sense in my mind of how this works that allows me to verify it against a wide variety of situations, but I doubt I've conveyed that to you already. I think it will help if I give some examples of what this theory predicts happens in various situations that accounts for the behavior people observe and report in themselves and others.

  • Mixed emotions/feelings are the result of a literal mix of different control systems under the same hierarchy receiving positive and negative signals as a result of producing less or more prediction error.
  • Hard-to-predict people are perceived as creepy or, stated with less nuance, bad.
  • Familiar things feel good by definition: they are easy to predict.
    • Similarly, there's a feeling of loss (bad) when familiar things change.
  • Mental illnesses result from failures of neurons to set good/bad thresholds appropriately, to update set points at an appropriate rate to match current rather than old circumstances, and from sensory input issues causing either prediction error or internally correct predictions that are poorly correlated with reality (this broadly including issues related both to sight, sound, smell, taste, touch and to mental inputs from long term memory, short term memory, and otherwise from other neurons).
  • Desire and aversion are what it feels like to notice prediction error is high and for the brain to take actions it predicts will lower it either by something happening (seeing sensory input) or not happening (not seeing sensory input), respectively.
  • Good and bad feel like natural categories because they are, but ones that are the result of a brain interacting with the world rather than features of the externally observed world.
  • Etc.

Further exploration of these kinds of cases will help in verifying the theory via whether or not adequate and straightforward applications of the theory can explain various phenomena (I view it as being in a similar epistemic state to evolutionary psychology, including the threat of misleading ourselves with just-so stories). It does to some extent hinge on questions I'm not situated to evaluate experimentally myself, especially whether or not the brain actually implements hierarchical control systems of the type described, but I'm willing to move forward because even if the brain is not literally made of hierarchical control systems the theory appears to model what the brain does well enough that whatever theory replaces it will also have to be compatible with many of its predictions. Hence I think we can use it as a provisional grounding even as we keep an eye out for ways in which it may turn out to be an abstraction that we will have to reconsider in the light of future evidence, and that work we do based off of it will be amendable to translation to whatever new, more fundamental grounding we may discover in the future.

Relation to AI alignment

So that's the theory. How does it relate to AI alignment?

First note that this theory is naturally a foundation of axiology, or the study of values, and by extension a foundation for the study of ethics, to the extent that ethics is about reasoning about how agents, each with their own (possibly identical) values, interact. This is relevant for reasons I and more recently Stuart Armstrong have explored:

Stuart has been exploring one approach by grounding human values in an improvement on the abstraction for human values used in inverse reinforcement learning that I think of as a behavioral economics theory of human values. My main objection to this approach is that it is behaviorist: it appears to me to be grounded in what can be observed from external human behavior by other agents and has to infer the internal states of agents across a large inferential gap, true values being a kind of hidden and encapsulated variable an agent learns about via observed behavior. To be fair this has proven an extremely useful approach over the past 100 years or so in a variety of fields, but it also suffers an epistemic problem in that it requires lots of inference to determine values, and I believe this makes it a poor choice given the magnitude of Goodharting effects we expect to be at risk from with superintelligence-levels of optimization.

In comparison, I view a predictive-coding-like theory of human values as offering a much better method of grounding human preferences. It is

  • parsimonious: the behavioral economics approach to human values allows comparatively complicated value specifications and requires many modifications to make it reflect a wide variety of observed human behavior, whereas this theory lets them be specified in simple terms that become complex by recursive application of the same basic mechanism;
  • requires little inference: if it is totally right, only the inference of measuring neuron activity creates room for epistemic error within the model;
  • captures internal state: true values/internal state is assessed as directly as possible rather than inferred from behavior;
  • broad: works for both rational and non-rational agents without modification;
  • flexible: even if the control theory model is wrong, the general "Bayesian brain" approach is probably right enough for us to make useful progress over what is possible with a behaviorist approach such that we could translate work that assumes predictive coding to another, better model.

Thus I am quite excited about the possibility that predictive coding approach may allow us to ground human values precisely enough to enable successfully aligning AI with human values.


This is a first attempt to explain what has been my "big idea" for the last year or so now that it has finally come together enough in my head that I'm confident presenting it, so I very much welcome feedback, questions, and comments that may help us move towards a more complete evaluation and exploration of this idea.

New Comment
42 comments, sorted by Click to highlight new comments since: Today at 3:13 AM

It seems like people are talking in circles around each other in these comments, and I think the reason is that Gordon and other people who likes predictive processing theory are misusing the world "prediction"

By misuse I mean clearly deviating from common use. I don't really care about sticking to common use, but if you deviate from the expected meaning of a word it is good to let people know.

Lets say I have a model of the future in my head. If I try to adjust the model to fit reality this model is a prediction. If I try to fit reality to my model, this model is an intention.

If you have a control system that tries to minimise "prediction error" with respect to a "prediction" that it is not able to chance, so that the system resort to change reality instead, then that is not really a prediction anymore.

As I understand it predictive processing theory suggest that both updating predictions and executing intentions are optimising for the same thing, which is aligning reality with my internal model. However there is an important difference with is what is variables and what is constants in solving that problem. Gordon is mentioning at some places that sometimes "predictions" can't be updated.

This means that it won't always be the case that a control system is globally trying to minimize prediction error, but instead is locally trying to minimize prediction error, although it may not be able to become less wrong over time because it can't change the prediction to better predict the input.

There are probably some actual disagreement here (in this comment section) too, but we will not figure that out if we don't agree on what words mean first.

Yeah, this has been a really good comment section for figuring out how my internal models are not as easily conveyed to others as I had hoped. I'll likely write a follow up post trying to explain this idea again with some revised language to make the point possibly clearer and lean more on specifics from existing research on these models since there seem to be some inferential gaps that I have forgotten about such that what feels like the exciting new part to me (prediction error signal = valence = ground of value) is maybe the least interesting and least important aspect to evaluate for others who lack the same beliefs I have about how I think what I'm gesturing at with "predictive coding" and "minimization of prediction error" work.

Do you agree with my clarification?

Because what you are trying to say makes very much sense to me, if and only if I replace "prediction" with "set point value" for cases when the so called prediction is fixed.

Set point (control system vocabulary) = Intention/goal (agent vocabulary)

From my understanding I'm happy to talk just in terms of set points if that helps avoid confusion. Things like predictions, goals, intentions, learning, etc. seem to me like ways of talking about control systems with set points and possible set point update mechanisms that function in particular ways that we identify with those things. My original use of "prediction" seems to be confusing to many so I guess I should just stick to "set point" or make more clear what "prediction" means here since I assume (although can't remember) that I picked up using "prediction" to mean "set point" from the relevant literature.

I agree that

  • there's something to the hierarchy thing;
  • if we want, we can always represent values in terms of minimizing prediction error (at least to a close approximation), so long as we choose the right predictions;
  • this might turn out to be the right thing to do, in order to represent the hierarchy thing elegantly (although I don't currently see why, and am somewhat skeptical).

However, I don't agree that we should think of values as being predictable from the concept of minimizing prediction error.

The tone of the following is a bit more adversarial than I'd like; sorry for that. My attitude toward predictive processing comes from repeated attempts to see why people like it, and all the reasons seeming to fall flat to me. If you respond, I'm curious about your reaction to these points, but it may be more useful for you to give the positive reasons why you think your position is true (or even just why it would be appealing), particularly if they're unrelated to what I'm about to say.

Evolved Agents Probably Don't Minimize Prediction Error

If we look at the field of reinforcement learning, it appears to be generally useful to add intrinsic motivation for exploration to an agent. This is the exact opposite of predictability: in one case we add reward for entering unpredictable states, whereas in the other case we add reward for entering predictable states. I've seen people try to defend minimizing prediction error by showing that the agent is still motivated to learn (in order to figure out how to avoid unpredictability). However, the fact remains: it is still motivated to learn strictly less than an unpredictability-loving agent. RL has, in practice, found it useful to add reward for unpredictability; this suggests that evolution might have done the same, and suggests that it would not have done the exact opposite. Agents operating under a prediction-error penalty would likely under-explore.

It's Easy to Overestimate The Degree to which Agents Minimize Prediction Error

I often enjoy variety -- in food, television, etc -- and observe other humans doing so. Naively, it seems like humans sometimes prefer predictability and sometimes prefer variety.

However: any learning agent, almost no matter its values, will tend to look like it is seeking predictability once it has learned its environment well. It is taking actions it has taken before, and steering toward the environmental states similar to what it always steers for. So, one could understandably reach the conclusion that it is reliability itself which the agent likes.

In other words: if I seem to eat the same foods quite often (despite claiming to like variety), you might conclude that I like familiarity when it's actually just that I like what I like. I've found a set of foods which I particularly enjoy (which I can rotate between for the sake of variety). That doesn't mean it is familiarity itself which I enjoy.

I'm not denying that mere familiarity has some positive valence for humans; I'm just saying that for arbitrary agents, it seems easy to over-estimate the importance of familiarity in their values, so we should be a bit suspicious about it for humans too. And I'm saying that it seems like humans enjoy surprises sometimes, and there's evolutionary/machine-learning reasoning to explain why this might be the case.

We Need To Explain Why Humans Differentiate Goals and Beliefs, Not Just Why We Conflate Them

You mention that good/bad seem like natural categories. I agree that people often seem to mix up "should" and "probably is", "good" and "normal", "bad" and "weird", etc. These observations in themselves speak in favor of the minimize-prediction-error theory of values.

However, we also differentiate these concepts at other times. Why is that? Is it some kind of mistake? Or is the conflation of the two the mistake?

I think the mix-up between the two is partly explained by the effect I mentioned earlier: common practice is optimized to be good, so there will be a tendency for commonality and goodness to correlate. So, it's sensible to cluster them together mentally, which can result in them getting confused. There's likely another aspect as well, which has something to do with social enforcement (ie, people are strategically conflating the two some of the time?) -- but I'm not sure exactly how that works.

The tone of the following is a bit more adversarial than I'd like; sorry for that. My attitude toward predictive processing comes from repeated attempts to see why people like it, and all the reasons seeming to fall flat to me. If you respond, I'm curious about your reaction to these points, but it may be more useful for you to give the positive reasons why you think your position is true (or even just why it would be appealing), particularly if they're unrelated to what I'm about to say.

I'll reply to your points soon because I think doing that is a helpful way for me and others to explore this idea, although it might take me a little time since this is not the only thing I have to do, but first I'll respond to this request that I seemingly left out.

I have two main lines of evidence that come together to make me like this theory.

One is that it's elegant, simple, and parsimonious. Control systems are simple, they look to me to be the simplest thing we might reasonably call "alive" or "conscious" if we try to redefine those terms in ways that are not anchored on our experience here on Earth. I think the reason it's so hard to answer questions about what is alive and what is conscious is because the naive categories we form and give those names are ultimately rooted in simple phenomena involving information "pumping" that locally reduce entropy but there are many things that do this that are outside our historical experience of what we could observe to generate information which historically made more sense to think of as "dead" than "alive". In a certain sense this leads me to a position you might call "cybernetic panpsychism", but that's just fancy words for saying there's nothing so special going on in the universe that makes us different from rocks and stars than (increasingly complex) control systems creating information.

Another is that it fits with a lot of my understanding of human psychology. Western psychology doesn't really get down to a level where it has a solid theory of what's going on at the lowest levels of the mind, but Buddhist's psychology of the abhidharma does, and it says that right after "contact" (stuff interacting with neurons) comes "feeling/sensing", and this is claimed to always contain a signal of positive, negative, or neutral judgement. My own experience with meditation showed me something similar such that when I learned about this theory it seemed like an obviously correct way of explaining what I was experiencing. This makes me strongly believe that any theory of value we want to develop should account for this experience of valence showing up and being attached to every experience.

In light of this second reason, I'll add to my first reason that it seems maximally parsimonious that if we were looking for an origin of valence it would have to be about something simple that could be done by a control system, and the simplest thing it could do that doesn't simply ignore the input is test how far off an observed input is from a set point. If something more complex is going on, I think we'd need an explanation for why sending a signal indicating distance from a set point is not enough.

I briefly referenced these above, but left it all behind links.

I think there are also some other lines of evidence that are less compelling to me but seem worth mentioning:

  • People have managed to build AI out of control systems minimizing prediction error, albeit doing, like I propose is necessary, by having some fixed set points that prevent dark room problems.
  • Neurons do seem to function like simple control systems, though I think we have yet to determine with sufficient certainty that is all that is going on.
  • Predictive coding admits explanations for many phenomena, but this risks just-so stories of the sort we see when evolutionary psychology tries to claim more than it can.
One is that it's elegant, simple, and parsimonious.

I certainly agree here. Furthermore I think it makes sense to try and unify prediction with other aspects of cognition, so I can get that part of the motivation (although I don't expect that humans have simple values). I just think this makes bad predictions.

Control systems are simple, they look to me to be the simplest thing we might reasonably call "alive" or "conscious" if we try to redefine those terms in ways that are not anchored on our experience here on Earth.

No disagreement here.

and this is claimed to always contain a signal of positive, negative, or neutral judgement.

Yeah, this seems like an interesting claim. I basically agree with the phenomenological claim. This seems to me like evidence in favor of a hierarchy-of-thermostats model (with one major reservation which I'll describe later). However, not particularly like evidence of the prediction-error-minimization perspective. We can have a network of controllers which express wishes to each other separately of predictions. Yes, that's less parsimonious, but I don't see a way to make the first work without dubious compromises.

Here's the reservation which I promised -- if we have a big pile of controllers, how would we know (based on phenomenal experience) that controllers attach positive/negative valence "locally" to every percept?

Forget controllers for a moment, and just suppose that there's any hierarchy at all. It could be made of controller-like pieces, or neural networks learning via backprop, etc. As a proxy for conscious awareness, let's ask: what kind of thing can we verbally report? There isn't any direct access to things inside the hierarchy; there's only the summary of information which gets passed up the hierarchy.

In other words: it makes sense that low-level features like edge detectors and colors get combined into increasingly high-level features until we recognize an object. However, it's notable that our high-level cognition can also purposefully attend to low-level features such as lines. This isn't really predicted by the basic hierarchy picture -- more needs to be said about how this works.

So, similarly, we can't predict that you or I verbally report positive/negative/neutral attaching to percepts from the claim that the sensory hierarchy is composed of units which are controllers. A controller has valence in that it has goals and how-it's-doing on those goals, but why should we expect that humans verbally report the direct experience of that? Humans don't have direct conscious experience of everything going on in neural circuitry.

This is not at all a problem with minimization of prediction error; it's more a question about hierarchies of controllers.

So, similarly, we can't predict that you or I verbally report positive/negative/neutral attaching to percepts from the claim that the sensory hierarchy is composed of units which are controllers. A controller has valence in that it has goals and how-it's-doing on those goals, but why should we expect that humans verbally report the direct experience of that? Humans don't have direct conscious experience of everything going on in neural circuitry.

Yeah this is s good point and I agree it's one of the things that I am looking for others to verify with better brain imaging technology. I find myself in the position of working ahead of what we can completely verify now because I'm willing to take the bet that it's right or at least right enough that however it's wrong won't throw out the work I do.

In light of this second reason, I'll add to my first reason that it seems maximally parsimonious that if we were looking for an origin of valence it would have to be about something simple that could be done by a control system, and the simplest thing it could do that doesn't simply ignore the input is test how far off an observed input is from a set point. If something more complex is going on, I think we'd need an explanation for why sending a signal indicating distance from a set point is not enough.

I more or less said this in my other comment, but to reply to this directly -- it makes sense to me that you could have a hierarchy of controllers which communicate via set points and distances from set points, but this doesn't particularly make me think set points are predictions.

Artificial neural networks basically work this way -- signals go one way, "degree of satisfaction" goes the other way (the gradient). If the ANN is being trained to make predictions, then yeah, "predictions go one way, distance from set point goes the other" (well, distance + direction). However, ANNs can be trained to do other things as well; so the signals/corrections need not be about prediction.

People have managed to build AI out of control systems minimizing prediction error, albeit doing, like I propose is necessary, by having some fixed set points that prevent dark room problems.

I've seen some results like this. I'm guessing there are a lot of different ways you could do it, but iirc what I saw seemed reasonable if what you want to do is build something like an imitation learner but also bias toward specific desired results. However, I think in that case "minimizing prediction error" meant a different thing than what you mean. So, what are you imagining?

If I take my ANN analogy, then fixing signals doesn't seem to help me do anything much. A 'set-point' is like a forward signal in the analogy, so fixing set points means fixing inputs to the ANN. But a fixed input is more or less a dead input as far as learning goes; the ANN will still just learn to produce whatever output behavior the gradient incentivises, such as prediction of the data. Fixing some of the outputs doesn't seem very helpful either.

Also, how is this parsimonious?

I find speaking in terms of minimization of prediction error useful to my own intuitions, but it does increasingly look like what I'm really thinking of are just generic homeostatic control systems. I like talking in terms of prediction error because I think it makes the translation to other similar theories easier (I'm thinking other Bayesian brain theories and Friston's free energy theory), but I think it's right to think I'm just thinking about a control system sending signals to hit a set point, even if some of those control systems do learn in a way that looks like Bayesian updating or minimization of prediction error and others don't.

The sense in which I think of this theory as parsimonious is that I don't believe there is a simpler mechanism that can explain what we see. If we could talk about these phenomena in terms of control systems without using signals about distance from set points I'd prefer that, and I think the complexity we get from having to build things out of such simple components is the right move in terms of parsimony rather than having to postulate additional mechanisms. As long as I can explain things adequately without having to introduce more moving parts I'll consider it maximally parsimonious as far as my current knowledge and needs go.

I'm still interested if you can say more about how you view it as minimizing a warped prediction. I mentioned that of you fix some parts of the network, they seem to end up getting ignored rather than producing goal-directed behaviour. Do you have an alternate picture in which this doesn't happen? (I'm not asking you to justify yourself rigorously; I'm curious for whatever thoughts or vague images you have here, though of course all the better if it really works)

Ah, I guess I don't expect it to end up ignoring the parts of the network that can't learn because I don't think error minimization, learning, or anything else is a top level goal of the network. That is, there are only low-level control systems interacting, and parts of the network get not ignored by their being more powerful in various ways, probably by being positioned such that they are located in the network such that they have more influence on behavior than other parts of the network that perform Bayesian learning. This does mean I expect those parts of the network don't learn or learn inefficiently, but they do that because it's adaptive.

For example, I would guess something in humans like the neocortex is capable of Bayesian learning, but it only influences the rest of the system through narrow channels that prevent it from "taking over" and making humans true prediction error minimizers, instead forcing them to do things that satisfy other set points. In buzz words you might say human minds are "complex, adaptive, emergent systems" built out of neurons with most of the function coming bottom up from the neurons or "from the middle", if you will, in terms of network topology.

We Need To Explain Why Humans Differentiate Goals and Beliefs, Not Just Why We Conflate Them
You mention that good/bad seem like natural categories. I agree that people often seem to mix up "should" and "probably is", "good" and "normal", "bad" and "weird", etc. These observations in themselves speak in favor of the minimize-prediction-error theory of values.
However, we also differentiate these concepts at other times. Why is that? Is it some kind of mistake? Or is the conflation of the two the mistake?
I think the mix-up between the two is partly explained by the effect I mentioned earlier: common practice is optimized to be good, so there will be a tendency for commonality and goodness to correlate. So, it's sensible to cluster them together mentally, which can result in them getting confused. There's likely another aspect as well, which has something to do with social enforcement (ie, people are strategically conflating the two some of the time?) -- but I'm not sure exactly how that works.

This seems like an important question: if all these phenomena really are ultimately the same thing and powered by the same mechanisms, why do we make distinctions between them and find those distinctions useful?

I don't have an answer I'm satisfied with, but I'll try to say a few words about what I'm thinking and see if that moves us along.

My first approximation would be that we're looking at things that we experience by different means and so give them different names because when we observe them they present in different ways. Goals (I assume by this you mean the cluster of things we might call desires, aversions, and generally intention towards action) probably tend to be observed by noticing the generation of signals going out that usually generate observable actions (movement, speech, etc.) whereas beliefs (the cluster of things that includes thoughts and maybe emotions) are internal and not sending out signals to action beyond mental action.

I don't know enough to be very confident in that, though, and think like you that it could be due to numerous reasons why it might make sense to think of them as separate even if they are fundamentally not very different.

On my understanding of how things work, goals and beliefs combine to make action, so neither one is really mentally closer to action than the other. Both a goal and a belief can be quite far removed from action (eg, a nearly impossible goal which you don't act on, or a belief about far-away things which don't influence your day-to-day). Both can be very close (a jump scare seems most closely connected to a belief, whereas deciding to move your hand and then doing so is more goal-like -- granted both those examples have complications).

If, in conversation, the distinction comes up explicitly, it is usually because of stuff like this:

  • Alice makes an unclear statement; it sounds like she could be claiming A or wanting A.
    • Bob asks for clarification, because Bob's reaction to believing A is true would be very different from his reaction to believing A is good (or, in more relative terms, knowing Alice endorses one or the other of those). In the first case, Bob might plan under the assumption A; in the second, Bob might make plans designed to make A true.
  • Alice is engaging in wishful thinking, claiming that something is true when really the opposite is just too terrible to consider.
    • Bob wants to be able to rely on Alice's assertions, so Bob is concerned about the possibility of wishful thinking.
    • Or, Bob is concerned for Alice; Bob doesn't want Alice to ignore risks due to ignoring negative possibilities, or fail to set up back-up plans for the bad scenarios.

My point is that it doesn't seem to me like a case of people intuitively breaking up a thing which is scientifically really one phenomena. Predicting A and wanting A seem to have quite different consequences. If you predict A, you tend to restrict attention to cases where it is true when planning; you may plan actions which rely on it. If you want A, you don't do that; you are very aware of all the cases where not-A. You take actions designed to ensure A.

It's Easy to Overestimate The Degree to which Agents Minimize Prediction Error
I often enjoy variety -- in food, television, etc -- and observe other humans doing so. Naively, it seems like humans sometimes prefer predictability and sometimes prefer variety.
However: any learning agent, almost no matter its values, will tend to look like it is seeking predictability once it has learned its environment well. It is taking actions it has taken before, and steering toward the environmental states similar to what it always steers for. So, one could understandably reach the conclusion that it is reliability itself which the agent likes.
In other words: if I seem to eat the same foods quite often (despite claiming to like variety), you might conclude that I like familiarity when it's actually just that I like what I like. I've found a set of foods which I particularly enjoy (which I can rotate between for the sake of variety). That doesn't mean it is familiarity itself which I enjoy.
I'm not denying that mere familiarity has some positive valence for humans; I'm just saying that for arbitrary agents, it seems easy to over-estimate the importance of familiarity in their values, so we should be a bit suspicious about it for humans too. And I'm saying that it seems like humans enjoy surprises sometimes, and there's evolutionary/machine-learning reasoning to explain why this might be the case.

I've replied about surprise, its benefits, and its mechanism a couple times now. My theory is that surprise is by itself bad but can be made good by having control systems that expect surprise and send a good signal when surprise is seen. Depending on how this gets weighted, this creates a net positive mixed emotion where surprise is experienced as something good and serves many useful purposes.

I think this mostly dissolves the other points you bring up that I read as contingent on thinking the theory doesn't predict humans would find variety and surprise good in some circumstances, but if not please let me know what the remaining concerns are in light of this explanation (or possibly object to my explanation of why we expect surprise to sometimes be net good).

I think this mostly dissolves the other points you bring up that I read as contingent on thinking the theory doesn't predict humans would find variety and surprise good in some circumstances, but if not please let me know what the remaining concerns are in light of this explanation (or possibly object to my explanation of why we expect surprise to sometimes be net good).

Yeah, I noted that I and other humans often seem to enjoy surprise, but I also had a different point I was trying to make -- the claim that it makes sense that you'd observe competent agents doing many things which can be explained by minimizing prediction error, no matter what their goals.

But, it isn't important for you to respond further to this point if you don't feel it accounts for your observations.

Evolved Agents Probably Don't Minimize Prediction Error
If we look at the field of reinforcement learning, it appears to be generally useful to add intrinsic motivation for exploration to an agent. This is the exact opposite of predictability: in one case we add reward for entering unpredictable states, whereas in the other case we add reward for entering predictable states. I've seen people try to defend minimizing prediction error by showing that the agent is still motivated to learn (in order to figure out how to avoid unpredictability). However, the fact remains: it is still motivated to learn strictly less than an unpredictability-loving agent. RL has, in practice, found it useful to add reward for unpredictability; this suggests that evolution might have done the same, and suggests that it would not have done the exact opposite. Agents operating under a prediction-error penalty would likely under-explore.

I ended up replying to this in a separate post since I felt like similar objections kept coming up. My short answer is: minimization of prediction error is minimization of error at predicting input to a control system that may not be arbitrarily free to change its prediction set point. This means that it won't always be the case that a control system is globally trying to minimize prediction error, but instead is locally trying to minimize prediction error, although it may not be able to become less wrong over time because it can't change the prediction to better predict the input.

From an evolutionary perspective my guess is that true Bayesian updating is a fairly recent adaptation, and most minimization of prediction error is minimization of error of mostly fixed prediction set points that are beneficial for survival.

I left a reply to this view at the other comment. However, I don't feel that point connects very well to the point I tried to make.

Your OP talks about minimization of prediction error as a theory of human value, relevant to alignment. It might be that evolution re-purposes predictive machinery to pursue adaptive goals; this seems like the sort of thing evolution would do. However, this leaves the question of what those goals are. You say you're not claiming that humans globally minimize prediction error. But, partly because of the remarks you made in the OP, I'm reading you as suggesting that humans do minimize prediction error, but relative to a skewed prediction.

Are human values well-predicted by modeling us as minimizing prediction error relative to a skewed prediction?

My argument here is that evolved creatures such as humans are more likely to (as one component of value) steer toward prediction error, because doing so tends to lead to learning, which is broadly valuable. This is difficult to model by taking a system which minimizes prediction error and skewing the predictions, because it is the exact opposite.

Elsewhere, you suggest that exploration can be predicted by your theory if there's a sort of reflection within the system, so that prediction error is predicted as well. The system therefore has an overall set-point for prediction error and explores if it's too small. But I think this would be drowned out. If I started with a system which minimizes prediction error and added a curiosity drive on top of it, I would have to entirely cancel out the error-minimization drive before I started to see the curiosity doing its job successfully. Similarly for your hypothesized part. Everything else in the system is strategically avoiding error. One part steering toward error would have to out-vote or out-smart all those other parts.

Now, that's over-stating my point. I don't think human curiosity drive is exactly seeking maximum prediction error. I think it's more likely related to the derivative of prediction error. But the point remains that that's difficult to model as minimization of a skewed prediction error, and requires a sub-part implementing curiosity to drown out all the other parts.

Instead of modeling human value as minimization of error of a skewed prediction, why not step back and model it as minimizing "some kind of error"? This seems no less parsimonious (since you have to specify the skew anyway), and leaves you with all the same controller machinery to propagate error through the system and learn to avoid it.

I have not read all the comments yet, so maybe this is redundant, but anyway...

I think it is plausible that humans and other life forms, are mostly made up of layers of control systems, stacked on each other. However it does not follow from this that humans are trying to minimise prediction error.

There are probably some part of the brain that is trying to minimise prediction error. Possibly organised as a control system that tries to keep expectations in line with reality. Because it is useful to be able to accurately predict the world.

But if we are a stack of control systems, then I would expect other parts of the brain to be control systems for other things. E.g. Having the correct level of blood sugar, having a good amount of social interaction, having a good amount of variety in our lives.

I can imagine someone figuring out more or less how the prediction control system works and what it is doing, then looking at everything else, noticing the similarity (becasue it is all types of control systems and evolution tend to reuse structures) and thinking "Hmm, maybe it is all about predictions". But I also think that would be wrong.

In other words: if I seem to eat the same foods quite often (despite claiming to like variety), you might conclude that I like familiarity when it's actually just that I like what I like. I've found a set of foods which I particularly enjoy (which I can rotate between for the sake of variety). That doesn't mean it is familiarity itself which I enjoy.

Agents trade off exploring and exploiting, and when they're exploiting they look like they're minimizing prediction error?

Agents trade off exploring and exploiting, and when they're exploiting they look like they're minimizing prediction error?

That's one hypothesis in the space I was pointing at, but not particularly the thing I expect to be true. Or, maybe I think it is somewhat true as an observation about policies, but doesn't answer the question of how exactly variety and anti-variety are involved in our basic values.

A model which I more endorse:

We like to make progress understanding things. We don't like chaotic stuff with no traction for learning (like TV fuzz). We like orderly stuff more, but only while learning about it; it then fades to zero, meaning we have to seek more variety for our hedonic treadmill. We really like patterns which keep establishing and then breaking expectations, especially if there is always a deeper pattern which makes sense of the exceptions (like music); these patterns maximize the feeling of learning progress.

But I think that's just one aspect of our values, not a universal theory of human values.

I think this is sort of sideways. It's true, but I think it also misses the deeper aspects of the theory I have in mind.

Yes, from easily observed behavior that's what it looks like: exploitation is about minimizing prediction error and exploration is about, if not maximizing it, then at least not minimizing it. But the theory says that if we see exploration and the theory is correct, then exploration must somehow to built of out things that are ultimately trying to minimize prediction error.

I hope to give a more precise, mathematical explanation of this theory in the future, but for now I'll give the best English language explanation I can of how exploration might work (keeping in mind we should be able to eventually find out exactly how it works if this theory is right with sufficient brain scanning technology).

I suspect exploration happens because a control system in the brain takes as input how much error minimization it observes as measured by how many good and bad signals get sent in other control systems. It then has a set point for some relatively stable and hard to update amount of bad signals it expects to see, and if it has not been seeing enough surprise/mistakes then it starts sending its own bad signals encouraging "restlessness" or "exploration". This is similar to my explanation of creativity from another comment.

It sure seems like it's possible for something to be both unpredictable and good (creativity, agenty people who positively surprise you, children, etc). Or predictable and bad (boring routines, highly derivative art, solitary confinement).

If that doesn't falsify the theory, then what would?

This is a problem if we attempt to explain things only in terms of minimization of prediction error, usually considered in the form of the "dark room" problem. The solution to this is allow the system to have, as I mention, set points that are slow to update or never update. These ensure humans keep doing things they would otherwise not do because they would be surprising.

To consider your cases of surprising and good and predictable and bad, I believe I have plausible explanations of these phenomena that may explain what's going on, although I will also freely admit that these are just plausible explanations border on being just-so stories because we currently lack the evidence to verify them ground up.

  • surprising and good:
    • creativity: In order to explore the space of possible solutions without getting stuck in local maxima, creativity seems valuable. My theory is that the drive to do surprising things that we call creativity is powered by control systems looking at other control systems and predicting they will generate negative signals indicating error. This makes the systems causing creativity something like financial derivatives.
    • agenty people who positively surprise you: I expect this is a kind of mixed emotion. Surprise is by itself bad, but when the surprise is mixed in with lots of other things that cause other control systems to send good signals because their set points are validated and can produce a net positive experience, even leading us to learn to expect surprise, via secondary control systems monitoring the output of other control systems to see when they are surprised, to directly end up thinking of surprise as secondarily good.
    • children: I expect much of the reasoning around children will be grounded in systems that intentionally don't track truth but instead use set points that are evolutionarily adaptive to get people to do things that are good for differential reproduction and bad for truth.
  • predictable and bad:
    • boring routines: Not all people find boring routines bad, but among those who do I expect the mechanism to be set points encouraging more error ("creativity") as described above.
    • highly derivative art: Derivative art probably looks a lot like boring routines: some people like them because they are predictable, others don't because they are "restless" in that they control systems expecting more error (I don't think this is exactly openness to experience but it does overlap that psychometric).
    • solitary confinement: This is disallowing many set points to come true that are not about predicting reality but about survival and are minimally mutable, so people experience solitary confinement as bad because they keep predicting they will be outside, see friends, etc. because to give those up is to give up import set points that enable survival and so it's a kind of continual hell of being disappointed in every moment with the knowledge that it's not going to change.

It's probably helpful to note that when I say "prediction" I'm often equivocating with "set point" in that I don't think of predictions in this theory as necessarily meant to always be predictions of what will actually be seen, even if they often are, but instead as set points in control systems that get them to combine in particular ways by predicting inputs, even if those predictions are sometimes forced by biology and evolution to be consistently wrong or right via observing other signals, etc.

I'm somewhat confused if you are claiming something else than Friston's notion that everything what brain is doing can be described as minimizing free energy/prediction error, this is important for understanding what human values are, and needs to be understood for ai alignment purposes.

If this is so, it sounds close to a restatement of my 'best guess of how minds work' with some in my opinion unhelpful simplification - ignoring the signal inserted into predictive processing via interoception of bodily states, which is actually important part of the picture, -ignoring the emergent 'agenty' properties of evolutionary encoded priors, +calling it theory of human values.

(I'm not sure how to state it positively, but I think it would be great if at least one person from the LW community bothered to actually understand my post, as "understanding each sentence".)

(I'm not sure how to state it positively, but I think it would be great if at least one person from the LW community bothered to actually understand my post, as "understanding each sentence".)

FWIW I'm not actually sure this is possible without you writing a sequence explaining the model. There are tooany sentences loaded with inferential distance that I couldn't cross, and didn't know the relevant places to start to begin to cross them.

It looks like I read your post but forgot about it. I'll have to look at it again.

I am building this theory in a way that I think is highly compatible with Friston, although I also don't have a gears-level understanding of Friston, so I find it easier to think in terms of control systems which appear to offer an equivalent model to me.

(My sense was that Abram engaged pretty extensively with the post, though I can't fully judge since I've historically bounced off of a lot of the predictive processing stuff, including your post)

If you define what humans want in terms of states of the brain, and you don't want the AI to just intervene directly on peoples' brains, there's a lot of extra work that has to happen, which I think will inevitably "de-purify" the values by making them dependent on context and on human behavior. Here's what I think this might look like:

You have some model ("minimize prediction error") that identifies what's good, and try to fit the brain's actual physiology to this model, in order to identify what's going on physically when humans' values are satisfied. But of course what humans want isn't a certain brain state, humans want things to happen in the world. So your AI needs to learn what changes in the world are the "normal human context" in which it can apply this rating based on brain states. But heroin still exists, so if we don't want it to learn that everyone wants heroin, this understanding of the external world has to start getting really value-laden, and maybe value-laden in a way based on human behaviors and not just human brain states.

One further thing to think about: this doesn't engage with meta-ethics. Our meta-ethical desires are about things like what our desires should be, what are good procedural rules of decision-making (simple decision-making procedures often fail to care about continuity of human identity), and how to handle population ethics. The learn-from-behavior side puts these on equal footing with our desire for e.g. eating tasty food, because they're all learned from human behavior. But if you ground our desire for tasty food in brain structure, this at the very least puts opinions on stuff like tasty food and opinions on stuff like theory of identity on very different footings, and might even cause some incompatibilities. Not sure.

Overall I think reading this post increased how good of an idea I think it is to try to ground human liking in terms of a model of brain physiology, but I think this information has to be treated quite indirectly. We can't just give the AI preferences over human brain states, it needs to figure out what these brain states are referring to in the outside world, which is a tricky act of translation / communication in the sense of Quine and Grice.

I appreciate this sentiment, and do think there's a dangerous, bad reduction of values to valence grounded in the operation of the brain that ignores much of what we care about, and that that extra stuff that we care about is also expressed as valence grounded in the operation of the brain. All the concerns you bring up must be computed somewhere, that somewhere is human brains, and if what those brains do is "minimize prediction error" then those concerns are also expressions of prediction error minimization. This to me is what's exciting about a grounding like the one I'm considering: it's embedded in the world in a way that means we don't leave anything out (unless there's some kind of "spooky" physics happening that we can't observe, which I consider unlikely) such that we naturally capture all the complexity you're concerned about, though it may take quite a bit to compute it all.

The difficulty is that we want to take human values and put them into an AI that doesn't do prediction error minimization in the human sense, but instead does superhumanly competent search and planning. But if you have a specific scheme in mind that could outperform humans without leaving anything out, I'd be super interested.

As of yet, no, although this brings up an interesting point, which is that I'm looking at this stuff to find a precise grounding because I don't think we can develop a plan that will work to our satisfaction without it. I realize lots of people disagree with me here, thinking that we need the method first and the value grounding will be worked out instrumentally by the method, but I dislike this because it makes it hard to verify the method than by observing what an AI produced by that method does, and this is a dangerous verification method due to the risk of a "treacherous" turn that isn't so much treacherous as it is the one that could have been predicted if we bothered to have a solid theory of what the method we were using really implied in terms of the thing we cared about, if we had bothered to know what the thing we cared about fundamentally was.

Also I suspect we will be able to think of our desired AI in terms of control systems and set points, because I think we can do this for everything that's "alive", although it may not be the most natural abstraction to use for its architecture.

One point of confusion that keeps coming up seems worth clarifying in a top level comment to this post:

minimization of prediction error is minimization of error at predicting input to a control system that may not be arbitrarily free to change its prediction set point

This means that it won't always be the case that a control system is globally trying to minimize prediction error, but instead is locally trying to minimize prediction error, although it may not be able to become less wrong over time because it can't change the prediction to better predict the input.

My suspicion is that in humans the neocortex mostly is normatively Bayesian, made out of lots of identical control systems that get specialized to do different things and each one of them can freely update in a Bayesian manner to optimally minimize prediction error. The rest of it is probably a lot less Bayesian, with harder or impossible to update prediction set points that serve purposes ultimately necessary for survival and reproduction that got set via evolutionary processes.

I have two points of confusion about this:

  • How does it work? I made some remarks in this other comment, and more extensive remarks below.
  • How is minimizing error from a fixed/slow-moving set-point different from pursuing arbitrary goals? What's left of the minimization-of-prediction-error hypothesis?

When I think of minimizing prediction error, I think of minimizing error of something which is well-modeled as a predictor. A set-point for sex, say, doesn't seem like this -- many organisms get far less than their satiation level of sex, but the set-point evolved based on genetic fitness, not predictive accuracy. The same is true for other scarce resources in the ancestral environment, such as sugar.

Is your model that evolution gets agents to pursue useful goals by warping predictive circuitry to make false-but-useful predictions? Or is it that evolution would fix the ancient predictive circuitry if it were better at modifying old-but-critical subsystems in big jumps, but can't? I find the second unlikely. The first seems possible, but strains my credulity about modeling the warped stuff as "prediction".

As for the how-does-it-work point: if we start with a predictive hierarchy but then warp some pieces to fix their set-points, how do we end up with something which strategically minimizes the prediction error of those parts? When I think of freezing some of the predictions, it seems like what you get is a world-model which is locked into some beliefs, not something which strategizes to make those predictions true.

As I mentioned in the other comment, I have seen other work which gets agents out of this sort of thing; but it seems likely they had different definitions of key ideas such as minimizing prediction error, so your response would be illuminating.

  • Well-working Bayesian systems minimize prediction error in the sense that they tweak their own weights (that is, probabilities) so as to reduce future error, in response to stimuli. They don't have a tendency to produce outputs now which are expected to reduce later prediction error. This is also true of small parts in a Bayesian network; each is individually responsible for minimizing its own prediction error of downstream info, using upstream info as helpful "freebie" information which it can benefit from in its downstream predictions. So, if you freeze a small part, its downstream neighbors will simply stop using it, because its frozen output is not useful. Upstream neighbors get the easy job of predicting the frozen values. So a mostly-bayesian system with some frozen parts doesn't seem to start trying to minimize the prediction error of the frozen bit in other ways, because each part is responsible for minimizing their own error.
  • Similarly for artificial neural networks: freezing a sub-network makes its feedforward signal useless to downstream neurons, and its backprop information little more interesting than that. Other systems of predictive hierarchies seem likely to get similar results.

The problem here is that these systems are only trying to minimize prediction error on the current step. A predictive system may have long-term models, but error is only back-propagated in a way which encourages each individual prediction to be more accurate for the time-step it was made, not in a way which encourages outputs to strategically make future inputs easier to predict.

So, the way I see it, in order for a system to strategically act so as to minimize future prediction error of a frozen sub-part, you'd need a part of the system to act as a reinforcement learner whose reward signal was the prediction error of the other part. This is not how parts of a predictive hierarchy tend to behave. Parts of a predictive hierarchy learn to reduce their own predictive error -- and even there, they learn to produce outputs which are more similar to their observations, not to manipulate things so as to better match predictions.

Abram—I've gone back and forth a few times, but currently think that "gradient descent is myopic" arguments don't carry through 100% when the predictions invoke memorized temporal sequences (and hierarchies or abstractions thereof) that extend arbitrarily far into the future. For example, if I think someone is about to start singing "Happy birthday", I'm directly making a prediction about the very next moment, but I'm implicitly making a prediction about the next 30 seconds, and thus the prediction error feedback signal is not just retrospective but also partly prospective.

I agree that we should NOT expect "outputs to strategically make future inputs easier to predict", but I think there might be a non-myopic tendency for outputs that strategically make the future conform to the a priori prediction. See here, including the comments, for my discussion, trying to get my head around this.

Anyway, if that's right, that would seem to be the exact type of non-myopia needed for a hierarchical Bayesian prediction machine to also be able to act as a hierarchical control system. (And sorry again if I'm just being confused.)

I appreciate your thoughts! My own thinking on this is rapidly shifting and I regret that I'm not producing more posts about it right now. I will try to comment further on your linked post. Feel encouraged to PM me if you write/wrote more in this and think I might have missed it; I'm pretty interested in this right now.

It's not necessarily a Gordon's view/answer in his model, but my answers are

  • yes, evolution inserts these 'false predictions'; (Friston calls them fixed priors, which I think is somewhat unfortunate terminology choice)
  • if you put on Dennet's stances lense #3 (looking at systems as agents), these 'priors' are likely described as 'agents' extracting some features from the p.p. world-modelling apparatus and inserting errors accordingly; you correctly point out that in some architectures such parts would just get ignored, but in my view what happens in humans is more like a board of bayesian subagetns voting
  • note: its relatively easy to turn p.p. engine to something resembling reinforcement learning by warping it to seek 'high reward' states, where by states you should not imagine 'states of the world', but 'states of the body'; evolution designed the chemical control circuitry of hormones before - in some sense the predictive processing machinery is built on top of some older control systems, and is seeking goal states defined by them
  • (pure guess) consciousness and language and this style of processing is another layer, where the p.p. machinery is 'predicting' something like a stream of conscious thoughts, which somehow has it's own consistency rules and can implement verbal reasoning.

Overall I'm not sure to what extent you expect clean designs from evolution. I would expect messy design, implementing predictive processing for hierarchical world-modelling/action generation, mess of subagents + emotions + hacked connection to older regulatory systems to make the p.p. engine seek evolution's goals, and another interesting thing going on with language and memes.

you correctly point out that in some architectures such parts would just get ignored, but in my view what happens in humans is more like a board of bayesian subagetns voting

How does credit assignment work to determine these subagents' voting power (if at all)? I'm negative about viewing it as 'prediction with warped parts ("fixed priors"), but setting that aside, one way or the other there's the concrete question of what's actually going on at the learning algorithm level. How do you set something up which is not incredibly myopic? (For example, if subagents are assigned credit based on who's active when actual reward is received, that's going to be incredibly myopic -- subagents who have long-term plans for achieving better reward through delayed gratification can be undercut by greedily shortsighted agents, because the credit assignment doesn't reward you for things that happen later; much like political terms of office making long-term policy difficult.)

Overall I'm not sure to what extent you expect clean designs from evolution.

I wasn't talking about parsimony because I expect the brain to be simple, but rather because a hypothesis which has a lot of extra complexity is less likely to be right. I expect human values to be complex, but still think a desire for parsimony such as sometimes motivates PP to be good in itself -- a parsimonious theory which matched observations well would be convincing in a way a complicated one would not be, even though I expect things to be complicated, because the complicated theory has many chances to be wrong.

Based on

(For example, if subagents are assigned credit based on who's active when actual reward is received, that's going to be incredibly myopic -- subagents who have long-term plans for achieving better reward through delayed gratification can be undercut by greedily shortsighted agents, because the credit assignment doesn't reward you for things that happen later; much like political terms of office making long-term policy difficult.)

it seems to me you have in mind a different model than me (sorry if my description was confusing). In my view, you have the world-modelling, "preference aggregation" and action generation done by the "predictive processing engine". The "subagenty" parts basically extract evolutionary relevant features of this (like:hunger level), and insert error signals not only about the current state, but about future plans. (Like: if the p.p. would be planning a trajectory which is harmful to the subagent, it would insert the error signal.).

Overall your first part seems to assume more something like reinforcement learning where parts are assigned credit for good planning. I would expect the opposite: one planning process which is "rewarded" by a committee.

parsimonious theory which matched observations well

With parsimony... predictive processing in my opinion explains a lot for a relatively simple and elegant model. On the theory side it's for example

  • how you can make a bayesian approximator using local computations
  • how hierarchical models can grow in an evolutionary plausible way
  • why predictions, why actions

On the how do things feel for humans from the inside, for example

  • some phenomena about attention
  • what is that feeling when you are e.g. missing the right word, or something seems out of place
  • what's up with psychedelics
  • & more

On the neuroscience side

  • my non-expert impression is the evidence that at least cortex is following the pattern that neurons at higher processing stages generate predictions that bias processing at lower levels is growing

I don't think predictive processing should try to explain all about humans. In one direction, animals are running on predictive processing as well, but are missing some crucial ingredient. In the opposite direction, simpler organisms had older control systems (eg hormones),we have them as well, and p.p. must be in some sense be stacked on top of that.

I don't think predictive processing should try to explain all about humans. In one direction, animals are running on predictive processing as well, but are missing some crucial ingredient. In the opposite direction, simpler organisms had older control systems (eg hormones),we have them as well, and p.p. must be in some sense be stacked on top of that.

For what it's worth, I actually do expect that something like predictive processing is also going on with other systems built out of stuff that is not neurons, such as control systems that use steroids (which include hormones in animals) or RNA or other things for signaling and yet other things for determining set points and error distances. As I have mentioned, I think of living things as being in the same category as steam engine governors and thermostats, all united by the operation of control systems that locally decrease entropy and produce information. Obviously there are distinctions that are interesting and important for in various ways, but also important ways in which these distinctions are distractions from the common mechanism powering everything we care about.

We can't literally call this predictive coding since that theory is about neurons and brains, so a better name with appropriate historical precedence might be something like a "cybernetic" theory of life, although unfortunately cybernetics has been cheapened over the years in ways that make that ring of hokum, so maybe there is some other way to name this idea that avoids that issue.

Desire and aversion are what it feels like to notice prediction error is high and for the brain to take actions it predicts will lower it either by something happening (seeing sensory input) or not happening (not seeing sensory input), respectively.

What? It seems like there's more to "good" and "bad" than "familiar" and "surprising", respectively. Things can be unpleasant and familiar, or surprising and good.