Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

With thanks to Rebecca Gorman for helping develop this idea.

I've been constructing toy examples of agents with preferences and biases. There turns out to be many, many different ways of doing this, and none of the toy examples seem very universal. The reason for this is that what we call "bias" can correspond to objects of very different type signatures.

Defining preferences

Before diving into biases, a brief detour into preferences or values. We can talk about revealed preferences (which look at the actions of an agent and deduce preferences by adding the assumption that the agent is fully rational), stated preferences (which adds the assumption that the stated preferences are accurate), and preferences as internal mental judgments.

There are also various versions of idealised preferences[1], extrapolated from other types of preferences, with various consistency conditions.

The picture can get more complicated that this, but that's a rough overview of most ways of looking at preferences. Stated preferences and preferences-as-judgements are of type "binary relations": they allow one to say things like "x is better than y". Revealed preferences and idealised preferences are typically reward/utility functions: they allow one to say "x is worth a, y is worth b".

So despite the vast amount of different preferences out there, their type signatures are not that varied. Meta preferences are preferences over one's own preferences, and have a similar type signature.


In the Occam's razor paper, an agent has a reward function (corresponding to its preferences) and a "planer" , which is a map from reward functions to the agent's policies. The agent's bias, in this context, is the way in which differs from the perfectly rational planer.

It seems that any bias could be fit into this general formalism, but most biases are narrower and more specific than that. It is also useful, and sometimes necessary, to talk about biases independently from preferences. Many idealised preferences are defined by looking at preferences after some biases are removed[1:1]. Defining these biases via the preferences would be completely circular. With that in mind, here are various biases with various type signatures. These examples are not exhaustively analysed; most of these biases can be re-expressed in slightly different ways, with different type signatures:

  • Bounded rationality: instead of figuring out all the logical consequences of some decision, the agent only figures out those that it compute in reasonable time. Type signature: replacing part of the consequence-prediction algorithm.
  • Noisy/Boltzmann rationality: the agent selects actions in a complicated way that is proportional to the exponent of their expected reward. Type signature: replacing part of the action-selection algorithm.
  • Availability bias: some relevant information is overweighed compared with other information. Type signature: change in the use of known information.
  • Anchoring bias: I've argued in the past that the anchoring bias can be seen as a preference itself. Alternatively, this could be a systematic flaw in heuristics used to estimate various quantities. Type signature: same as for preferences, or a heuristic flaw.
  • Prejudice: often prejudices are seen as instinctive reactions that we don't consciously endorse and would prefer to be rid of. Type signature: same as for preferences.
  • Akrasia: this bias increases the weight of some short term-preferences over longer term ones. Type signature: transformation of weights of some preferences within the reward function.
  • Endorsed preference: since we have contradictory preferences, resolving them all into a single reward function will sacrifice some of our current preferences, including some that we endorse and agree with. So endorsed preferences can themselves become biases. Type signature: same as endorsed preferences.
  • Priming: priming, if it works, changes the salience of various aspects of the decision-making algorithm. Type signature: change of salience of specific components in the human decision making process (this is obviously a vague and broad concept).
  • Bias-preference feedback loop: recommender systems can cause radicalisation by directing people to sites/videos that they agree with, in ways that cause their views to become more and more extreme. We might see this as an algorithm exploiting biases (availability and confirmation biases, mostly). But the whole feedback loop is also a bias, in that it departs from perfect rationality. Type signature: change in preferences.

In conclusion

The point of this is that the term "bias" is extremely broad, covering many different types of objects with different type signatures. So when using the term formally or informally, be sure that whoever you're talking to uses the term the same way. And when you're trying to specify biases within a model, be aware that your formalism may not allow you to express many forms of "bias".

  1. For instance, the paper "Libertarian Paternalism Is Not an Oxymoron" defines preferences as the choices people would make 'if they had complete information, unlimited cognitive abilities and no lack of self-control'. ↩︎ ↩︎

New Comment