How to make AIXI-tl incapable of learning

by itaibn0 6y27th Jan 20142 min read5 comments

4


Consider a simple game: You are shown a random-looking 512-bit string h. You may then press one of two buttons, labeled '0' and '1'. No matter which button you press, you will then be shown a 256-bit string s such that SHA512(s) = h. In addition, if you pressed '1' you are given 1$.

This game seems pretty simple, right? s and h are irrelevant, and you should simply press '1' all the time (I'm assuming you value recieving money). Well, let's see how AIXI and AIXI-tl fare at this game.

Let's say the machine already played the game many times. Its memory is h0, b0, r0, s0, h1, ..., b_(n-1), r_(n-1), s_(n-1), h_n, where the list is in chronological order, inputs are unbolded while decisions are bolded, and r_i is the reward signal. It is always the case that r_i=b_i and h_i=SHA512(s_i).

First let's look at AIXI. It searches for models that compress and extrapolate this history up to the limit of its planning horizon. One class of such models is this: there is a list s0, ..., s_N of random 256-bits strings and a list b0, ..., b_N of (possibly compressible) bits. The history is SHA512(s0), b0, b0, s0, ..., b_(N-1), s_(N-1), SHA512(s_N), b_N. Here s_i for i<n must match the s_i in its memory, and s_n must be the almost certainly unique value with SHA512(s_n) = h_n. While I don't have a proof, it intuitively seems like this class of models will dominate the machines probability mass, and repeated arg-max should lead to the action of outputting 1. It wouldn't always do this due to exploration/exploitation considerations and due to the incentive to minimize K (b0, ... b_N) built into its prior, but it should do it most of the time. So AIXI seems good.

Now let's consider AIXI-tl. It picks outputs by having provably correct programs assign lower bounds to the expected utility, and picking the one with the best lower bound, where expected utility is as measured by AIXI. This would include accepting the analysis I just made with AIXI if that analysis can be made provably accurate. Here lies a problem: the agent has seen h_n but hasn't seen s_n. Therefore, it can't be certain that there is an s_n with SHA512(s_n)=h_n. Therefore, it can't be certain that the models used for AIXI actually works (this isn't a problem for AIXI since it has infinite computational power and can always determine that there is such an s_n).

There is an ad hoc fix for this for AIXI-tl: Take the same model as before, but h_n is a random string rather than being SHA512(s_n). This seems at first to work okay. However, it adds n, the current time, as an input to the model, which adds to its complexity. Now other models dependent on the current time also need to be considered. For instance, what if h_n was a random string, and in addition r_n=not(b_n). This models seems more complicated, but maybe in the programming language used for the Solomonoff prior it is shorter. The key point is that the agent won't be able to update past that initial prior no matter how many times it plays the game.  In that case AIXI-tl may consistently prefer 0, assuming there aren't any other models it considers that make its behavior even more complicated.

The key problem is that AIXI-tl is handling logical uncertainty badly. The reasonable thing to do upon seeing h_n is assuming that it, like all h_i before it, is a hash of some 256-bit string. Instead, it finds itself unable to prove this fact and is forced into assuming the present h_n is special. This makes it assume the present time is special and makes it incapable of learning from experience.

4