Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Summary: we examine a game involving hash functions and show that an algorithm based on resource-bounded Solomonoff induction is outperformed by a simple strategy. This game has some parallels to Vingean reflection.


Betting environments

Consider the following family of environments (which we might call "betting environments"):

  1. For iteration :
  2. Draw hidden state from unknown distribution .
  3. Output first observation
  4. The agent chooses some action .
  5. Output second observation .
  6. The agent observes and receives reward , where is an unknown function that always returns a number between 0 and 1.

The idea of these environments is that the agent's action affects nothing except reward, so acting well in these environments only requires predicting the next reward correctly given the agent's action. The second observation has no apparent effect on what the agent's strategy should be, but we'll see later why it is included. First, we will see how one general approach to problems like this fails.

A resource-bounded algorithm for betting environments

Consider the following resource-bounded variant of Solomonoff induction. Each hypothesis is represented as a program that takes as input a bit string and returns a probability (interpreted as the probability that the next bit will be 1). As in ordinary Solomonoff induction, the prior probability for a hypothesis is . The probability that hypothesis assigns to observation history is , i.e. the product of the conditional probabilities for each bit given previous bits. We restrict attention to values that run in , where is the length of the input.

Using this variant of Solomonoff induction, we can develop an AIXI-like algorithm for acting in betting environments. Specifically, when choosing an action, this algorithm does the following:

  1. Use resource-bounded Solomonoff induction to predict the distribution of future observations and rewards, given each action.
  2. Most of the time, pick the action that maximizes the expected next reward; otherwise pick a random action

The purpose of the random action is to learn the correct distribution for each counterfactual reward in the long run. For example, when choosing action we could pick a random with probability . This way, the rate of exploration approaches 0 over time, but the total number of exploration steps approaches infinity.

Failure in an unpredictable environment

Here's an example of a betting environment that this algorithm fails on. Suppose we have some family of hash functions for each natural , such that returns strings of length . Assume that inverting takes time . Now consider the following betting environment:

  • consists of pairs of binary strings
  • works as follows:
    • The distribution of the second string is uniformly random, with length .
    • 90% of the time,
    • Otherwise is uniformly random with length .

Immediately, it is clear that always choosing action 1 will yield 0.9 expected reward per iteration. This is the optimal strategy for an agent that lacks the computational resources to tell anything about given . I believe that if is an appropriate family of hash functions, this will be true for all polynomial-time algorithms in the long run.

Now let's see how our resource-bounded algorithm does on this problem. After a large number of iterations, any winning hypothesis for our resource-bounded Solomonoff induction variant will satisfy:

  1. Distribution of is approximately uniformly random
  2. Given observations and action , predict the correct reward with high probability

This is because these conditions state that the hypothesis has close to the correct conditional distributions for and . If a hypothesis fails to satisfy one of these properties, we can replace it with a variant that does satisfy the property, yielding a hypothesis that assigns infinitely higher log-probability to the data (in the long run) without being much more complicated.

This only leaves the distribution of given . Correctly predicting would require reversing the hash function. Given that the hypothesis must run in polynomial time, if we take many samples of given using the hypothesis, we should expect an (exponentially) small fraction to satisfy . So any hypothesis will predict that, with high probability, .

Given this, our algorithm will take action 0, because it achieves expected reward 0.5, while action 1 achieves an exponentially small expected reward. Obviously, this behavior is highly pathological.

Conclusion

As an alternative to this algorithm, consider the same algorithm except that it completely ignores observation . This new algorithm will correctly predict that, about 90% of the time, action 1 yields reward 1. Therefore, it will take action 1. Intuitively, this seems like the correct behavior in general. Since winning hypotheses will predict reward given action and well subject to resource bounds, they can be used in a straightforward way for estimating expected reward for each action.

The problem with our original algorithm parallels a problem with naïve approaches to Vingean reflection. If the agent's environment contains a hash function inverter, then it may encounter a situation similar to the betting environment in this post: it will see the hash code of a string before the string itself. Since the agent is unable to predict the behavior of the hash function inverter, it must use abstract reasoning to decide how much utility to expect from each action, rather than actually simulating the hash function inverter. The same reasoning sometimes applies to smart agents other than hash function inverters.

For future research, it will be useful to see how strategies for solving the hash function betting environment (i.e. predict expected reward 0.9 from action 1) might translate to strategies for Vingean reflection.

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 7:54 PM

When describing the failure mode, you have the approximate-Solomonoff agent try to predict (by assigning approximately uniform probability since it can't invert the hash in O(n^2) time), and then plug that distribution into the reward function to compute the "expected reward" of action 1 (very small, since only one can be correct).

However, this problem would be circumvented if the agent did things the opposite way - first try to predict (by assigning probability 0.9 to based on the frequency data), and only then trying to invert the hash and failing.

There might be a more general rule here: the order of prediction where you fail later rather than earlier gives better results. Or: the more approximate an answer, the less precedence it should have in the final prediction.

This is still a bit unsatisfying - it's not abstract reasoning (at least not obviously) - but I think the equivalent would look more like abstract reasoning if the underlying predictor had to use smarter search on a smaller hypothesis space than Solomonoff induction.

I think the issue presented in the post is that the Solomonoff hypothesis cannot be sampled from, even though we can determine the probability density function computationally. If we were to compute the expected value of the reward based on our action, we run into the curse of dimensionality: there is a single point contributing most of the reward. A Solomonoff inductor would correctly find the probability density function that h(s_2)=s_1 with high probability.

However, I think that if we ask the Solomonoff predictor to predict the reward directly, then it will correctly arrive at a model that predicts the rewards. So we can fix the presented agent.

Yes, I think part of the issue is the gap between being able to sample given and evaluating the density of given . However, I am not sure how we should score hypotheses that assign density to future observations (instead of predicting bits one at a a time). We will have difficulty computing .

Predicting the rewards directly seems to fix this issue. I don't know if this solution can be generalized to environments other than betting environments.

Typo: you write "90% of the time, " instead of "90% of the time, "

Fixed, thanks.