Minimization of prediction error as a foundation for human values in AI alignment

12


I've mentioned in posts twice (and previously in several comments) that I'm excited about predictive coding, specifically the idea that the human brain either is or can be modeled as a hierarchical system of (negative feedback) control systems that try to minimize error in predicting their inputs with some strong (possibly un-updatable) prediction set points (priors). I'm excited because I believe this approach better describes a wide range of human behavior, including subjective mental experiences, than any other theory of how the mind works, it's compatible with many other theories of brain and mind, and it may give us an adequate way to ground human values precisely enough to be useful in AI alignment.

A predictive coding theory of human values

My general theory of how to ground human values in minimization of prediction error is simple and straightforward:

I've thought about this for a while so I have a fairly robust sense in my mind of how this works that allows me to verify it against a wide variety of situations, but I doubt I've conveyed that to you already. I think it will help if I give some examples of what this theory predicts happens in various situations that accounts for the behavior people observe and report in themselves and others.

  • Mixed emotions/feelings are the result of a literal mix of different control systems under the same hierarchy receiving positive and negative signals as a result of producing less or more prediction error.
  • Hard-to-predict people are perceived as creepy or, stated with less nuance, bad.
  • Familiar things feel good by definition: they are easy to predict.
    • Similarly, there's a feeling of loss (bad) when familiar things change.
  • Mental illnesses result from failures of neurons to set good/bad thresholds appropriately, to update set points at an appropriate rate to match current rather than old circumstances, and from sensory input issues causing either prediction error or internally correct predictions that are poorly correlated with reality (this broadly including issues related both to sight, sound, smell, taste, touch and to mental inputs from long term memory, short term memory, and otherwise from other neurons).
  • Desire and aversion are what it feels like to notice prediction error is high and for the brain to take actions it predicts will lower it either by something happening (seeing sensory input) or not happening (not seeing sensory input), respectively.
  • Good and bad feel like natural categories because they are, but ones that are the result of a brain interacting with the world rather than features of the externally observed world.
  • Etc.

Further exploration of these kinds of cases will help in verifying the theory via whether or not adequate and straightforward applications of the theory can explain various phenomena (I view it as being in a similar epistemic state to evolutionary psychology, including the threat of misleading ourselves with just-so stories). It does to some extent hinge on questions I'm not situated to evaluate experimentally myself, especially whether or not the brain actually implements hierarchical control systems of the type described, but I'm willing to move forward because even if the brain is not literally made of hierarchical control systems the theory appears to model what the brain does well enough that whatever theory replaces it will also have to be compatible with many of its predictions. Hence I think we can use it as a provisional grounding even as we keep an eye out for ways in which it may turn out to be an abstraction that we will have to reconsider in the light of future evidence, and that work we do based off of it will be amendable to translation to whatever new, more fundamental grounding we may discover in the future.

Relation to AI alignment

So that's the theory. How does it relate to AI alignment?

First note that this theory is naturally a foundation of axiology, or the study of values, and by extension a foundation for the study of ethics, to the extent that ethics is about reasoning about how agents, each with their own (possibly identical) values, interact. This is relevant for reasons I and more recently Stuart Armstrong have explored:

Stuart has been exploring one approach by grounding human values in an improvement on the abstraction for human values used in inverse reinforcement learning that I think of as a behavioral economics theory of human values. My main objection to this approach is that it is behaviorist: it appears to me to be grounded in what can be observed from external human behavior by other agents and has to infer the internal states of agents across a large inferential gap, true values being a kind of hidden and encapsulated variable an agent learns about via observed behavior. To be fair this has proven an extremely useful approach over the past 100 years or so in a variety of fields, but it also suffers an epistemic problem in that it requires lots of inference to determine values, and I believe this makes it a poor choice given the magnitude of Goodharting effects we expect to be at risk from with superintelligence-levels of optimization.

In comparison, I view a predictive-coding-like theory of human values as offering a much better method of grounding human preferences. It is

  • parsimonious: the behavioral economics approach to human values allows comparatively complicated value specifications and requires many modifications to make it reflect a wide variety of observed human behavior, whereas this theory lets them be specified in simple terms that become complex by recursive application of the same basic mechanism;
  • requires little inference: if it is totally right, only the inference of measuring neuron activity creates room for epistemic error within the model;
  • captures internal state: true values/internal state is assessed as directly as possible rather than inferred from behavior;
  • broad: works for both rational and non-rational agents without modification;
  • flexible: even if the control theory model is wrong, the general "Bayesian brain" approach is probably right enough for us to make useful progress over what is possible with a behaviorist approach such that we could translate work that assumes predictive coding to another, better model.

Thus I am quite excited about the possibility that predictive coding approach may allow us to ground human values precisely enough to enable successfully aligning AI with human values.


This is a first attempt to explain what has been my "big idea" for the last year or so now that it has finally come together enough in my head that I'm confident presenting it, so I very much welcome feedback, questions, and comments that may help us move towards a more complete evaluation and exploration of this idea.

12