Mentioned in

Intuitive Explanation of Solomonoff Induction

1st Dec 2011

9cousin_it

0timtyler

8lessdazed

4lukeprog

0hairyfigment

0lessdazed

1lukeprog

0lessdazed

1lukeprog

0lessdazed

7[anonymous]

3[anonymous]

1[anonymous]

0Bogdan

0[anonymous]

3kilobug

3Daniel_Burfoot

0timtyler

1timtyler

1lavalamp

0moshez

0timtyler

-2JoshuaZ

9cousin_it

1antigonus

2cousin_it

3antigonus

5cousin_it

-2timtyler

-2JoshuaZ

0timtyler

New Comment

31 comments, sorted by Click to highlight new comments since: Today at 4:42 PM

As it turns out, there is an exact recipe for finding truth. It was discovered in the 1960s.

At our current state of knowledge, we cannot have any mathematical theorem about Solomonoff induction finding "truth" in our world, because we don't know the mathematical underpinnings of our world. We can only have theorems about SI outperforming certain other methods of inference on all possible inputs.

The simplest of such theorems says SI outperforms all computable algorithms at the log-score game. This is not a very difficult feat because SI is allowed to be uncomputable. (SI can be constructed as a "mixture" of all computable algorithms, and arbitrarily changing the weights in the "mixture" will give you different but equally powerful versions of SI.)

A more interesting theorem says that SI is also optimal in its own class: possibly uncomputable distributions that can be specified as "lower-semicomputable semimeasures". That's genuinely surprising because most classes of distributions don't have an optimal member within the class itself (see p.11 of Hutter's slides, with only one "Yes" on the main diagonal). But it doesn't seem very relevant to practice and somewhat undermines the special status of SI, because you can make universal mixtures of an even wider and more uncomputable class that will outperform SI the same way SI outperforms computable algorithms.

Thinking further along these lines, finding truth is likely to be an instrumental value rather than a terminal one, so we probably need to look at games other than the log-score game and see if SI can still make money when the rules for rewards and punishments are changed. This road leads to interesting unsolved questions, I wrote about some of them :-)

Right now I'm not sure there's any "deep truth" about applying SI to our world. In particular, there's the troubling possibility that SI's answers to your questions will depend on how you intend to use these answers!

When you describe a problem as conclusively solved when it's not, you risk making your readers lose their sense of wonder. Maybe you could take a page from Feynman? He wasn't afraid of giving the reader a glimpse of unsolved problems now and then. Eliezer also wrote about the perils of presenting a topic to students as if it were set in stone.

Right now I'm not sure there's any "deep truth" about applying SI to our world.

Solomonoff induction isn't *particularly* optimal at real tasks. Its performance on real-world tasks will depend somewhat on the choice of reference machine.

In particular, there's the troubling possibility that SI's answers to your questions will depend on how you intend to use these answers!

I don't *think* so. I put my criticism of that idea there.

As it turns out, there is an exact recipe for finding truth. It was discovered in the 1960s.

The problem is that you don't have time to follow the recipe. To find the truth to even a simple question using this recipe would require you to follow one step after another until long after the heat death of the universe, and you can't do that.

I'm confused. You don't need SI to find the exact truth if you have all information, but without all information, SI finds only the probable. *But even then* it's based on Kolmogorov complexity which is not computable, so SI is not computable, so "exactness" isn't ideal. Isn't it just that other approaches do worse on average?

But this doesn't

solveour problem. It merely points out that other statistical techniques don't fare any better.

Remember that crazy comic book/cartoon spin off *Bayes-Factor*? Where Razor, Hyperparameter, Conjugate, and the gang fight Null, Two-Tails, Propensity, and all the other epic villains? The *Bayes-Factor* team could handle this problem!

You don't need SI to find the exact truth if you have all information, but without all information, SI finds only the probable.

Yes of course. We are always engaged in a probabilistic search for truth.

But even then it's based on Kolmogorov complexity which is not computable, so SI is not computable, so "exactness" isn't ideal. Isn't it just that other approaches do worse on average?

The ideal, if we had infinite computing power, would be to discover what is most probably true. But the ideal is uncomputable, so we seek out approximations of that ideal that *are* computable, especially ones that are computable within a certain domain of problems, like the Bayesian information criterion.

We are always engaged in a probabilistic search for truth.

And a good thing, too, if I understand Tarski's undefinability theorem correctly.

While you don't have to mention Tarski, you do need to remember that some smart people will object to the claim that (in principle) there exists a recipe for truth.

As it turns out, there is a best recipe for finding truth from uncertainty (although it won't outperform recipes that do things like call for the truth as an ingredient and say not to cook anything). And a recipe that's almost as good for every problem was discovered in the 1960s. It was also discovered that no one could find a better one and know that they had discovered it.

To find a slightly better one and know you had would require you to follow one step after another for an infinite amount of time. That's impossible, so the best you can do in theory is to follow almost as good recipe we have.

The problem is that you don't have time to follow this recipe either. To find the truth to even a simple question using this recipe would require you to follow one step after another until long after the heat death of the universe. While this is less than an infinite amount of time, you can't wait that long either.

It was meant as a draft of an alternative, based on my limited understanding. I see two thresholds where I think you see one. The recipe is uncomputable, so it would take longer than "long after the heat death of the universe" or any other finite amount of time to finish. Also, the computable functions most similar to it would take more steps than there is time to do.

I'm going to write an (introductory) class paper on Kolmogorov complexity and Solomonoff induction soon-ish (within the next 2-3 months), and I already intended to write a less serious article or two about it while I'm at it (though not necessarily for LW). I'm not committing to anything, but I might well take you up on this, if it's still open by then.

I just wrote up my understanding of Solomonoff induction. Unfortunately it's a PDF, since I wanted to try out GNU TeXmacs. This might work for the "Binary Sequence Prediction" section:

http://www.ocf.berkeley.edu/~ehy/solomonoff_lightsaber.pdf

It's my first time doing this kind of thing, so please tell me how I can improve! Also, I tried to be careful, but I'm only a freshman, so I may have seriously misunderstood some things...

It was said by others (like lessdazed) but I insist on the difference between "very long to compute" and "not computable". Finding the winning moves in chess may require longer than the universe lifetime, but is theoricaly possible given enough computer power. Kolmogorov complexity is not computable, due to the impossibility of having an halting oracle. A full paragraph about those issues and how to get around them would be worth it IMHO. Reading the text without knowing that makes it feel that SI is just the same kind of "too long to be directly used" problem like winning at chess, while in fact it's a "level" harder (not even computable).

The problem is that you don't have time to follow the recipe.

In my view this is not the main practical problem with SI. In practice it doesn't matter much that you can't find the Kolmogorov complexity; you can still get closer and closer to it and as you do so you will get closer and closer to the truth.

The main practical problem with SI as a principle of statistics is that the Kolmogorov complexity is not an absolute quantity *in the limited-data regime* where most statistical analysis takes place. Imagine a psychologist who hands out N=400 surveys with 10 questions apiece, and each question has 8 possible answers. Then a naive encoding of the data set is 12,000 bits; this quantity is small compared to the Turing machine translation constants and so your estimate of the KC is entirely dependent on your choice of Turing machine.

The *main* practical problem with Solomonoff Induction surely is our lack of good quality computable approximations to it. If you don't have much data, you can *usually* just *gather some more data* - there's a whole world of it out there. It's a trivial solution, but it resolves an awful lot of problems.

I have a question about Solomonoff Induction:

Solomonoff Induction does not deal *directly* with sense data that is known to be uncertain. Uncertainty in sense data can affect predictions. So, for example, if I *know* apriori that my bit detector has a 40% chance of producing a random bit - and so far I have seen: 11101101... - then my estimate of the next symbol being a "0" would be higher as a result of my apriori knowledge.

Is there a "recognised" way of converting sensory biases into "plausible prefixes" - to deal with this issue? Or is there some other way of handling it?

Is it really necessary to explain the concept of *algorithm* in an "intuitive" explanation? This looks like it will turn into a decent explanation for the (IQ > 115 && curious) crowd. I think it's going to have to sacrifice a lot more detail if you want to get down to the (IQ ~ 100 && not particularly curious) crowd. Maybe I will make an attempt.

Luke, can you fix the link to the "Rathmanner & Hutter" article in the references section? It's missing an "http://" in front of it.

I made an introduction to Solomonoff induction a while ago.

I have an introduction to sequence prediction and an introduction to machine forecasting also.

These are in the public domain - so feel free to reuse.

In addition to the issues raised by cousin_it I'd like to raise another related issue: If one uses a strict Solomonoff prior, then one can't ever have a non-zero probability for a sequence to be uncomputable. So say for example we look at the first few thousand digits of the fine-structure constant in binary and find that they correspond exactly to some easily specifiable uncomputable sequence (say involving the Halting problem or the Busy Beaver function). An entity using Solomonoff induction cannot, no matter how many digits it sees, assign the obvious hypothesis any likelyhood.

I think this issue is a red herring. SI can be viewed as a purely predictive device that doesn't assign probabilities to hypotheses, only to observable events. You will find that even if you're endowed with the privileged knowledge that the fine structure constant is a halting oracle, that knowledge provably can't help you win a prediction game against SI, because your computable brain is not very good at predicting a halting oracle. When you're losing to a machine, saying it can't "really entertain a hypothesis" is like saying a submarine can't "really swim" or Rybka can't "really play chess".

You will find that even if you're endowed with the privileged knowledge that the fine structure constant is a halting oracle, that knowledge provably can't help you win a prediction game against SI

We can frequently compute the first several terms in a non-computable sequence, so this statement seems false.

When people talk about the impossibility of "winning" against SI, they usually mean it's impossible to win by more than a constant in the long run.

In that case, I'd say that your response involves special pleading. SI priors are uncomputable. If the fine structure constant is uncomputable, then any uncomputable prior that assigns probability 1 to the constant having its actual value will beat SI in the long run. What is illicit about the latter sort of uncomputable prior that doesn't apply to SI priors? Or am I simply confused somehow? (I'm certainly no expert on this subject.)

SI belongs to a class of priors that could be described as "almost computable" in a certain technical sense. The term is *lower-semicomputable semimeasure*. An interesting thing about SI is that it's also optimal (up to a constant) within its own class, not just better than all puny computable priors. The uncomputable prior you mention does not belong to that class, in some sense it's "more uncomputable" than SI.

An entity using Solomonoff induction cannot, no matter how many digits it sees, assign the obvious hypothesis any likelyhood.

Uncomputable sequences have computable approxiomations. Solomonoff induction could do very well at predicting such sequences. However, it *wouldn't* assume that they are uncomputable. Why should it? That is a bizarre hypothesis - not an "obvious" one. It is surely more likely that the observed finite prefix was generated by some computable method.

Solomonoff induction only deals with finite sequences. It *doesn't* assign p=0 to *any* sequence consistent with its observations so far. Uncomputable sequences are necessarily inifinite - and though Solomonoff induction can't handle them, neither can the observable universe. I think that the case that they matter remains to be made.

Update: Alex Altair has now finished this tutorial!A while ago I began writing a Solomonoff Induction tutorial, but now it's one of those Less Wrong articles I probably will never have time to write.

But, maybe a few of you can take what I've started, team up, and finish the thing. I'm basically just following along with the Solomonoff induction tutorials from Shane Legg (2008, 2004), but making them simpler and readable by a wider audience. I think this would be a valuable thing for Less Wrong to have, and one that would draw even more traffic to our community, but I don't have time to write it myself anymore!

Who wants to be a hero and finish this thing? The original markdown source is available here.

This is a tutorial for beginners. Those familiar with Solomonoff induction may prefer to read Rathmanner & Hutter (2011).

People disagree about things. Some say that vaccines cause autism; others say they don't. Some say everything is physical; others believe in a god. Some say that complicated financial derivatives are essential to a modern competitive economy; others think a nation's economy will do better without them. It's hard to know what is true.

And it's hard to know

how to figure outwhat is true. Some think you should assume the things you are most certain about and then deduce all other beliefs from your original beliefs. Others think you should accept at face value the most intuitive explanations of your personal experience. Others think you should generally agree with the scientific consensus until it is disproven.Wouldn't it be nice if finding out what's true was like baking a cake? What if there was a

recipefor finding out what is true? All you'd have to do isfollow the written instruction exactly, and after the last instruction you'd inevitably find yourself with some sweet, tastytruth!As it turns out, there

isan exact recipe for finding truth. It was discovered in the 1960s.The problem is that you don't have

timeto follow the recipe. To find the truth to even a simple question using this recipe would require you to follow one step after another until long after the heat death of the universe, and you can't dothat.But we can find shortcuts. Suppose you know the

exactrecipe for baking a cake requires you to count out one molecule of H2O at a time until you haveexactly0.5 cups of water. If you did that, you might not finish before the heat death of the universe. But you couldapproximatethat part of the recipe by measuring out something very close to 0.5 cups of water, and you'd probably still end up with a pretty good cake.Similarly, once we know the exact recipe for finding truth, we can try to

approximateit in a way that allows us to finish all the steps sometimebeforethe heat death of the universe.This tutorial explains the exact recipe for finding truth, Solomonoff induction, and also explains some attempts to approximate it in ways that allow us to figure out

nowwhat is (probably) true. Fear not; I shall not assume you know anything beyond grade-school math.Like my Crash Course in the Neuroscience of Human Motivation, this tutorial is

long. You may not have time for it; that's fine. But if youdoread it, I recommend you read it in sections.Contents:## Algorithms

At an early age you learned a set of precisely-defined steps — a 'recipe' or, more formally, an

algorithm— that you could use to find the largest number in an unsorted list of numbers like this:The algorithm you learned probably looked something like this:

Other algorithms could be used to solve the same problem. For example, you could work your way from right to left instead of from left to right. But the point is that if you follow this algorithm exactly, and you have enough time to complete the task, you can't

failto solve the problem. You can't get confused about what one of the steps means or what the next step is. Every instruction tells you exactly what to do next, all the way through to the answer.You probably learned other algorithms, too, like how to find the greatest common divisor of any two integers (see image on right).

But not just

anyset of instructions is a precisely-definedalgorithm. Sometimes, instructions are unclear or incomplete. Consider the following instructions based on an article about the scientific method:This is not an algorithm.

First, many of the terms are not clearly defined. What counts as an observation? What counts as a hypothesis? What would a hypothesis need to be like in order to ‘explain’ the observation? What counts as an experiment that will ‘test’ the hypothesis? What does it mean for experimental results to ‘confirm’ or ‘disconfirm’ a hypothesis?

Second, the instructions may be incomplete. What do we do if we reach step 4 and the experimental results neither ‘confirm’ nor ‘disconfirm’ the hypothesis under consideration, but instead are in some sense ‘neutral’ toward the hypothesis? These instructions don’t tell us what to do in that case.

An algorithm is a well-defined procedure that takes some value or values as input and, after a finite series of steps, generates some value or values as output.

For example, the ‘find the largest number’ algorithm above could take the input {21, 18, 4, 19, 55, 12, 30} and would, after 13 steps, produce the following output: {55}. Or it could take the input {34} and, after 1 step, produce the output: {34}.

What we’re looking for is a precise algorithm that will produce truth as its output.

## Induction

The problem of inference is this: We have a collection of observations (data), and we have a collection of hypotheses about the underlying causes of those observations. We’d like to know which hypothesis is correct, so we can use that knowledge to predict future events.

Suppose your data concern a large set of stock market price changes and other events in the world. You’d like to know the processes responsible for the stock market price changes, because then you can predict what the stock market will do in the future, and make some money.

Or, suppose you are a parent. You come home from work to find a chair propped against the refrigerator, with the cookie jar atop the fridge a bit emptier than before. One hypothesis that leaps to mind is that your young daughter used the chair to reach the cookies. However, many other hypothesis explain the data. Perhaps a very short thief broke into your home and stole some cookies. Perhaps your daughter put the chair in front of the fridge because the fridge door is broken and no longer stays shut, and you forgot that your friend ate a few cookies when he visited last night. Perhaps you moved the chair and ate the cookies yourself while sleepwalking the night before.

All these hypotheses are possible, but intuitively it seems like some hypotheses are more likely than others. If you’ve seen your daughter access the cookies this way before but have never been burgled, then the ‘daughter hypothesis’ seems more plausible. If some expensive things from your bedroom and living room are missing and there is hateful graffiti on your door at the eye level of a very short person, then then ‘short thief’ hypothesis seems more plausible than before. If you suddenly remember that your friend ate a few cookies and broke the fridge door last night, the ‘broken fridge door’ hypothesis gains credibility. If you’ve never been burgled and your daughter is out of town and you have a habit of moving and eating things while sleepwalking, the ‘sleepwalking’ hypothesis is less bizarre.

But the weight you give to each hypothesis depends greatly on your prior knowledge. What if you had just been hit on the head and lost all past memories, and for some reason the most urgent thing you wanted to do was to solve the mystery of the chair in front of the fridge door?

Thenhow would weigh the likelihood of the available hypotheses?## Occam’s Razor

Consider a different inductive problem. A computer program outputs the following sequence of numbers:

Which number comes next? If you guess correctly, you’ll win $500.

In order to predict the next number in the sequence, you make a hypothesis about the process the computer is using to generate these numbers. One obvious hypothesis is that it is simply listing all the odd numbers in ascending order from 1. If that’s true, you should guess 9 to win the $500.

But perhaps the computer is using a different algorithm to generate the numbers. Suppose that

nis the step in the sequence, so that n=1 when it generated ‘1’, n=2 when it generated ‘3’, and so on. Maybe the computer used this equation to calculate each number in the sequence:If so, the next number in the sequence will be 33. (Go ahead, check the calculations.) But doesn’t the first hypothesis seem more likely?

The principle behind this intuition, which goes back to William of Occam, could be stated:

The principle is called Occam’s razor because it ‘shaves away’ unnecessary assumptions.

For example, think about the case of the missing cookies again. In most cases, the ‘daughter’ hypothesis seems to make fewer unnecessary assumptions than the ‘short thief’ hypothesis does. You already know you have a daughter that likes cookies and knows how to move chairs to reach cookies. But in order for the short thief hypothesis to be plausible, you have to assume that (1) a thief found a way to break into your home, that (2) the thief wanted cookies more than anything else from your home, that (3) the thief was, unusually, too short to reach the top of the fridge without the help of a chair, and many other unnecessary assumptions.

Occam’s razor

soundsright, but can it be made more precise, and can it be justified? We will return to those questions later. For now, let us consider another important part of induction, that of updating our beliefs in response to new evidence.## Updating Beliefs

You’re a soldier in combat, crouching in a trench. You know for sure there is just one enemy soldier left on the battlefield, about 400 yards away. You also know that if the remaining enemy is a regular army troop, there’s only a small chance he could hit you with one shot from that distance. But if the remaining enemy is a sniper, then there’s a very good chance he can hit you with one shot from that distance. But snipers are rare, so it’s probably just a regular army troop.

You peek your head out of the trench, trying to get a better look.

Bam!A bullet glance off your helmet and you duck down again.“Okay,” you think. “I know snipers are rare, but that guy just hit me with a bullet from 400 yards away. I suppose it

mightstill be a regular army troop, but there’s a seriously good chance it’s a sniper, since he hit me from that far away.”After another minute, you dare to take another look, and peek your head out of the trench again.

Bam!Another bullet glances off your helmet! You duck down again.“Damn,” you think. “It’s definitely a sniper. No matter how rare snipers are, there’s no way that guy just hit me

twice in a rowfrom that distance if he’s a regular army troop. He’s gotta be a sniper. I’d better call for support.”This is an example of updating beliefs in response to evidence, and we do it all the time.

You start with some

priorbeliefs, and all of them are uncertain. You are 99.99% certain the Earth revolves around the sun, 90% confident your best friend will attend your birthday party, and 40% sure that the song on the radio you’re listening to was played by The Turtles.Then, you encounter new evidence — new observations — and update your beliefs in response.

Suppose you start out 85% confident the one remaining enemy soldier is not a sniper, leaving only 15% credence to the hypothesis that he

isa sniper. But then, a bullet glances off your helmet — an event far more likely if the enemy soldier is a sniper than if he is not. So now you’re only 40% confident he’s not a sniper, and 60% confident heis. Another bullet glances off your helmet, and you update again. Now you’re only 2% confident he’s not a sniper, and 98% confident heisa sniper.Now, I’m about to show you some

math, but don’t run away. The math isn’tsupposedto make sense if you haven’t studied it. Don’t worry; I’m going to explain it all.There’s a theorem in probability theory that tells you how likely one observation is given some other observations. It’s called Bayes’ Theorem.

At this point you may want to take a break and read either tutorial #1, tutorial #2, tutorial #3, or tutorial #4 on Bayes’ Theorem. I’ll only say a

littlebit more about Bayes’ Theorem in this tutorial.The short form of Bayes’ Theorem looks like this:

Let’s unpack this. The H refers to some hypothesis, and the E responds to some evidence. p(x) refers to the probability of x. The pipe symbol | means ‘given’ or ‘assuming’. So, Bayes’ Theorem reads:

You can see how this tells us what we’d like to know. We want to know the probability of some hypothesis — some model of the world that, if correct, will allow us to successfully predict the future — given the evidence that we have. And that’s what Bayes’ Theorem tells us. Now we just plug the numbers in, and solve for p(H|E).

Of course, it’s not easy to “just plug the numbers in.” You aren’t an all-knowing god. You don’t know

exactlyhow likely it is that the enemy soldier would hit your helmet if he’s a sniper, compared to how likely that is if he’s not a sniper. But you can do your best. With enough evidence, it will become overwhelmingly clear which hypothesis is correct.At this point, I'll let the tutorials on Bayes' Theorem above do the heavy lifting on how to update beliefs. Let me get back to the part of induction those tutorials

don'texplain: the choice of priors.## The Problem of Priors

In the example above where you're a soldier in combat, I gave you your starting probabilities: 85% confidence that the enemy soldier was a sniper, and 15% confidence he was not. But what if you don't know your "priors"? What then?

When using Bayes' Theorem to calculate your probabilities, your choice of prior can influence your results greatly. But how can we determine the probability of a hypothesis

beforeseeing any data? What does that evenmean?Legg (2008) explains:

But this doesn't

solveour problem. It merely points out that other statistical techniques don't fare any better.What we'd like is to reduce the potential for abusing one's opportunity to select priors, and to use agreed-upon priors when possible. Thus, many standard "prior distributions" have been developed. Generally, they distribute probability equally across hypotheses.

But to solve the problem of priors once and for all, we'd like to have an acceptable,

universalprior distribution, so that there's no fuzziness about the process of induction. We need arecipe, andalgorithm, for finding out the truth. And there can't be any fuzziness in an algorithm.## Binary Sequence Prediction

[intro to binary sequence prediction]

## Solomonoff and Kolmogorov

[summarize the contributions of Solomonoff and Kolmogorov]

## Solomonoff Lightsaber

[explain the result: Solomonoff Induction]

## Approximations

[survey a few of the approximations of Solomonoff Induction in use today]

## Philosophical Implications

[survey of some philosophical implications of unviersal induction]

## Notes

^{a}This quote and some of the examples to follow are from Legg (2008).## References

Legg (2008).

Machine Superintelligence. PhD thesis, Department of Informatics, University of Lugano.Rathmanner & Hutter (2011). A philosophical treatise of universal induction.