(A Failed Approach) From Precedent to Utility Function

LESSWRONG
LW

(A Failed Approach) From Precedent to Utility Function — LessWrong

(Sorry this post has many errors, so I rewrite this chapter here)

In the previous chapter, we have introduced how ACI circumvents the is-ought problem: an ought can’t be derived from an is, but what is the right thing to do can be derived from examples of doing the right thing.

In evolutionary history, the histories are strained through a sieve called natural selection, thus organisms receive and remember the right things. That's how living organisms, as examples of natural intelligence, can learn to behave in the right way from evolutionary history.

This chapter will provide a more detailed analysis of ACI. We argue that it is possible to derive utility functions from ACI's first principle, so that ACI can be compared with other intelligent models, such as the rational agent, active inference, and value learning approaches. We will also unpack how ACI acquires the motivation to construct models of the world and explore it.

The optimal policies’ distribution

ACI learns about policy from history. ACI argues that if the past actions and observations are known, it is possible to establish a posterior probability distribution for an agent’s policy. However, before this, it is necessary to assign a prior probability distribution.

According to the ACI model, an agent’s policy can be of various forms, ranging from simple lookup tables and goal-directed policies to more complex ones that aim to maximize expected total utility or reward. Because ACI is a universal intelligence model, not bound by the limitations of rational behavior, all these types of policies should be considered.

Without loss of generality, a policy , can be defined as a map from histories h to a probability distribution over actions $a \in A$ (following Stuart Armstrong):

$π : H \to Δ A$

A history $h \in H$ , is a sequence of observations and actions of an agent, while $H$ is the set of histories.

Using Solomonoff induction, we can have a universal prior that assigns a probability distribution to the policy of an unknown agent, denoted as $P (π)$ , with a sample space consisting of all possible policies. Roughly speaking, this universal prior is related to the length of the shortest Turing machine that can output the policy.

After that, we can have a posterior probability distribution of the policy given the agent’s history is known:

Definition 2 (Optimal Policies’ Distribution, OPD) If there is an agent G following an unknown policy, while its actions and observations are known, in other words, its history $h \in H$ is known, we can have a posterior probability distribution of the policy followed by the agent G within this history, denoted as $P (π | h)$ , and can be termed as Optimal Policies’ Distribution, OPD.

However, this definition only applies to semi-static situations, because the agent may follow different policies during h . Nevertheless, OPD may serve as a reasonable approximation in many cases.

If within this history h , the agent G is always doing the right thing, the OPD of any $π \in Π$ will be the OPD given G is doing the right thing, can be denoted as OPD*. OPD* is everything we can know about the meaning behind doing the right thing in history h, and can be used to derive goals and utility functions.

From ACI to utility function

People understand and prefer thinking in goals and utilities. Utility is the measure of the rightness of things. Selecting the possible world with the higher expected utility is considered the right choice. Utility function U is defined as a function from world to real numbers:

$U : W \to R$

In ACI, the utility or the rightness of a possible world can be derived from being compared with the right history, because the right history provides a standard of doing the right things, constructing a metric that always grants the right possible world the highest expected utility.

Unfortunately, there’s no way to compare a history and a world directly.

In the previous paragraph, we have defined the OPD and OPD* given certain history. OPD* can be used as a representation of how an agent performs the right actions. By comparing the OPD*s of two different histories, we can plausibly infer how similar they are in terms of being right.

In other words, if a future scenario replicates the past, following the right policy from the past will most likely get the right outcome in that future. Thus the Kullback-Leibler divergence between the OPD*s of given history and possible world, could be a possible candidate for the definition of utility function.

Definition 3 (Utility function of ACI) If there is an agent G following an unknown policy, and its history of doing right things h* is known, the utility function of its possible future world $w$ should be:

$U (w, h^{*}) = - D_{K L} [P (π | h^{*}) | | P (π | w)]$

As the future world is yet to arrive, and its rightness remains unknown, this possible world $w$ is supposed to be a “right” history when the utility is being calculated.

The fluctuation of utility functions

Unlike the utility function of the rational agent model, the utility function of ACI depends on the history of the agent, and cannot be optimized in the same manner, because it may fluctuate as the history evolves.

As new histories are created, the agent will continuously gain information on how to behave correctly. The same possible world may not always have the same utility. The preference of one agent is constantly evolving. Moreover, changes in the environment can also alter the utility function. The agent itself is the river that can’t be stepped in twice.

However, because the worlds are stratified by histories, all histories contained in a world is known once the world is known, thus we can know how the utility function changes if we know that all the histories are doing the right thing.

But how can we know if a new action or a new history is right or not? The answer lies in the question: from the utility function.

It has been asserted that survival and reproduction can confirm the rightness of an action or history, but relying solely on evolution to justify everything would be very low efficiency. An agent requires some alternate methods to find out the correctness of actions as quickly as possible. Here we have the utility function, which can grant its information of rightness to other worlds.

In the next chapter, we will demonstrate how ACI employs meta-policies to assign rightness to selected samples, and then to sub-policies. Models of the world and goals can also be derived from this process.

Value learning upside down

ACI and the value learning approach are closely related, both aim to derive policy from behavior. The difference is that value learning is trying to model a human as an imperfect rational agent, and extract the real values which the human is imperfectly optimizing . However, Stuart Armstrong has proved this approach is ineffective for humans unless you make strong assumptions about their rationality.

According to ACI, irrational intelligence should not be viewed as an imperfect version of rational agents. Rather, the rational agent model is an imperfect representation of actual intelligence in the real world, because there are no coherence arguments that say you must have goal-directed behavior. ACI believes that the utility function should be viewed as an estimation of the optimal policies' distribution, rather than the opposite. ACI is trying to turn value learning upside down.

There is concern that without the assumption of rational agent, an agent may not be able to exceed the level of performance displayed in the examples, and will be taken exactly to human mimicry and no farther. This question will be discussed in the next 2 or 3 chapters.

LESSWRONG
LW

LESSWRONG
LW

0

(A Failed Approach) From Precedent to Utility Function

0

0

The optimal policies’ distribution

From ACI to utility function

The fluctuation of utility functions

Value learning upside down