Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

[Metadata: crossposted from First completed November 19, 2022.]

What are an agent's values? An answer to this question might be a good description of the agent's external behavior and internal workings, without showing how one could modify the agent's workings or origins so that the agent pushes the world in a specific different direction.

Descriptive values

There's some discussion of what can be inferred about the values of an agent based on its behavior and structure. E.g. see Daniel Dennett's intentional stance, and "Occam's razor is insufficient to infer the preferences of irrational agents" by Stuart Armstrong, Sören Mindermann (arxiv), and this post by Vanessa Kosoy.

One could describe an agent as having certain values: the agent's behavior is a boundedly rational attempt to push the world in certain directions. For some purposes, it's useful to have a parsimonious description of an agent's behavior or internal workings in terms of values. For example, such a description could be useful for helping the agent out: to help the agent out, you push the world in the same direction that the agent is trying to push the world.

Specifiable values

A distinct purpose in describing an agent as having values is to answer questions about values in counterfactuals:

  • What determined that the agent would have those values and not other values?
  • Under what circumstances will the agent continue to have those values? E.g., will the agent rewrite itself so that its behavior is no longer well-described as boundedly pursuing those values?
  • How could the agent's values be modified? How could the values be modified in a specific direction, or to a specific state, so that that the modified agent has some specific effect on the world?
  • How could the agent's ontogeny--the process that made it what it is--be altered so that it ends up with some other specific values?

To make these questions more likely to have answers, and to not rely too much on assumptions about what values are, replace the notion of "values" with the notion "what directions a mind ends up pushing the world in".

Quasi-example: explicit utility maximization

An auxiliary question: how, mechanistically, do "the values" determine the behavior? This question might not have an answer, because there might not be some component in the agent that constitutes "the values". For example, in humans, there's no clear value component; there are many in-built behavior-determiners, but they don't fully constitute what we call our values. But, in cases where we clearly understand the mechanism by which an agent's values determine its behavior, answers to other questions about values in counterfactuals might follow.

For example, there's the classic agent model: a system that searches for actions that it predicts will lead in expectation to the most highly-scored world according to its utility function box. The mechanism is explicit in this model. The utility function is embodied, in a box, as an input-output function, and it determines the agent's effects on the world by providing the criterion that the agent uses to select actions. Some answers to the above questions follow. E.g., it's clear at least qualitatively how to modify the agent's values to a specific state: if you want to make the agent cause a certain kind of world, just change the utility function to score that kind of world highly.

Even this example is not so clear cut, and relies on background assumptions. See problems with embedded agency. For example, if we assume that there's already a fixed world (that is, an understanding of what's possible) about which to define the utility function, we sweep under the rug that the understanding behind having such a world had to be gained, and that the gaining of understanding might also change an agent's values.

New Comment
2 comments, sorted by Click to highlight new comments since:

I think AI should threat value function as probabilistic.  I.e. instead of thinking "this world has value of exactly N" it could think something like "I 90% sure that this world has value N+-M, but there is 10% possibility that it could actuall have value -ALOT". And would avoid that world, because it would give a very low expected value on averager.

It's a reasonable idea. See here though: