Open Problems Related to Solomonoff Induction

6th Jun 2012

11Incorrect

5Wei Dai

6Adele_L

5Wei Dai

3Kawoomba

4Mitchell_Porter

0Vladimir_Nesov

2Manfred

-1skepsci

2Incorrect

7Wei Dai

1Armok_GoB

9Eliezer Yudkowsky

1Wei Dai

0Eliezer Yudkowsky

0Wei Dai

0Eliezer Yudkowsky

0Wei Dai

0Eliezer Yudkowsky

0Wei Dai

0Eliezer Yudkowsky

7[anonymous]

2endoself

2jsalvatier

5[anonymous]

5Wei Dai

0cousin_it

0Wei Dai

4Patrick

3jacobt

2TimFreeman

2Eliezer Yudkowsky

0Wei Dai

2Eliezer Yudkowsky

0Wei Dai

13p1cd3m0n

3pengvado

23p1cd3m0n

0Kindly

03p1cd3m0n

0Kindly

03p1cd3m0n

2gjm

1Kindly

03p1cd3m0n

0Kindly

03p1cd3m0n

0Kindly

0gjm

1Eliezer Yudkowsky

1Patrick

0G0W51

1Cole Wyeth

03p1cd3m0n

0Username

03p1cd3m0n

0kokotajlod

0TimFreeman

0roll

0koning_robot

-2koning_robot

0Steven_Bukal

1pengvado

0Steven_Bukal

3pengvado

2Steven_Bukal

0timtyler

0timtyler

0Username

0timtyler

0torekp

0timtyler

0torekp

0timtyler

0witzvo

0Incorrect

0Incorrect

-1Incorrect

0Will_Newsome

1Incorrect

3Will_Newsome

1Incorrect

0Will_Newsome

0Will_Newsome

1Incorrect

2Will_Newsome

-1Will_Newsome

0timtyler

-2Will_Newsome

2Will_Newsome

-2Eugine_Nier

4Will_Newsome

0timtyler

0Viliam_Bur

0Will_Newsome

0Viliam_Bur

-2private_messaging

3Wei Dai

-2private_messaging

3Mitchell_Porter

1private_messaging

0Plasmon

0private_messaging

-3[anonymous]

New Comment

104 comments, sorted by Click to highlight new comments since: Today at 8:05 PM

Some comments are truncated due to high volume. (⌘F to expand all)

Consider an arbitrary probability distribution P, and the smallest integer (or the lexicographically least object) x such that P(x) < 1/3^^^3 (in Knuth's up-arrow notation). Since x has a short description, a universal distribution shouldn't assign it such a low probability, but P does, so P can't be a universal distribution.

Solomonoff Induction would not consider this a description of x as it cannot be used to compute x.

512y

I guess I failed to bridge the inferential gap here. You're right "Solomonoff Induction would not consider this a description of x as it cannot be used to compute x." The point I'm trying to make is that a correct formalization of induction/Occam's Razor ought (in order to satisfy our intuitions) to assign x some probability > 1/3^^^3 since it has a short (even if uncomputable) description, but a formalization based on this P would not, therefore it can't be a correct formalization. But this is the case no matter what P we use (with different x depending on P), hence the "unformalizability" problem.

612y

Could you explain further? Why "ought" it assign it such a probability? As stated, this seems more convincing as an argument that it "ought not" to assign a probability > 1/3^^^3, despite the short "description".

512y

The idea is, however we define P, how can we be that sure that there isn't some kind of uncomputable physics that would allow someone to build a device that can find the lexicographically least object x such that P(x) < 1/3^^^3 and present it us?

312y

Maybe we should just work off of the assumption that there's no relevant uncomputable physics, because if there were, we should probably give up our endeavors anyways, unless we knew how to model an uncomputable reality within a computable AGI-program. As Schmidhuber ever so aptly wrote on his homepage:

412y

You could start out by trying to understand how an AI might invent the concept of uncomputability, and how it might then proceed to the possibility of uncomputable physics. And one way to get started here is by thinking as a cognitive historian, and asking how humans came up with the concept of uncomputability.

0[anonymous]12y

"Give up" is neither a well-defined strategy, nor one that was shown to be preferable to its alternatives.

212y

That gives you a probability inversely proportional to the integral of e^-(the description length) of each number from 0 to infinity. Complexity grows like log(n)-ish. It's an improper prior.
So I agree with Adele that this may be a modus tollens / modus ponens moment.

-111y

If there's some uncomputable physics that would allow someone to build such a device, we ought to redefine what we mean by computable to include whatever the device outputs. After all, said device falsifies the Church-Turing thesis, which forms the basis for our definition of "computable".

212y

Are you essentially saying Solomonoff Induction could be inferior to some other meta-algorithm (or whatever you call it) in uncomputable universes?

712y

Yes. ETA: I'm also making the stronger point that any formalization of induction we can come up with would have a similar problem. (Or at least the kinds of formalizations covered by my arguments.)

112y

Technically not if P is sufficiently long and complex.

Suppose we define a generalized version of Solomonoff Induction based on some second-order logic. The truth predicate for this logic can’t be defined within the logic and therefore a device that can decide the truth value of arbitrary statements in this logical has no finite description within this logic. If an alien claimed to have such a device, this generalized Solomonoff induction would assign the hypothesis that they're telling the truth zero probability, whereas we would assign it some small but positive probability.

It seems to me that the paradox...

111y

I'm not sure why you say this, because the device is supposed to output the truth of statements about some second-order logic, not about the universe. The device is not describable by the second-order logic via Tarski, so if the device is in the universe, the universe must only be describable by some meta-logic, which implies the device is outputting truth of statements about something strictly simpler than the universe. The agent ought to be able to conceive of this...

011y

In that case I'm not sure what you mean by "second-order analogue of Solomonoff induction". Solomonoff induction is about the universe and proceeds from sensory experiences. What does it mean to induct that something may produce the truth of sentences "about a second-order logic"?

011y

I mean suppose you encounter a device that outputs "true" or "false" whenever you feed it a sentence in some second-order logic, and after doing a lot of tests, you think maybe the device outputs "true" if and only if the input sentence is actually true, regardless of what input it receives. This seems like a straightforward analogue of inducting that something may be a Halting Oracle, which I thought you agreed an agent should be able to do?
I'm not sure if this answers your question. If not, feel free to find me on Skype or Google Chat to talk more.

011y

By "true" do you mean that the physical universe semantically entails the sentence, or that the sentence is a semantic tautology of second-order logic? I'm assuming the latter, since the former case is what I was assuming when I gave my Tarski reply.
So... I'm pretty sure you can represent a set's semantic entailment of a Henkin interpretation of a second-order sentence in a first-order set theory. I'm less sure that you can represent entailment of a second-order sentence inside a second-order set theory, but I'm having trouble seeing why you wouldn't be able to do that. I also think that second-order set theories are supposed to be finitely axiomatizable, though I could be wrong about this. But then I'm not quite sure why we don't get into Tarskian trouble.
Let ST be the finite axiomatization of a second-order set theory. Then in second-order logic, why doesn't the sentence (ST->"All sets: Set entails S") form a truth predicate for S? My guess is that there's just no theorem which says, "X is true (entailed by the mathematical universal set / model we're currently operating inside) iff 'X' is entailed by all sets (inside the current model)".
If this is so, then it looks to me like for any given set of axioms corresponding to a second-order set theory, second-order logic can represent the idea of a device that outputs sentences which are semantically entailed by every individual set inside that theory. So the new answer would be that you are welcome to hypothesize that a device prints out truths of second-order logic, for any given background second-order set theory which provides a universe of models against which those sentences are judged universally semantically true.
In which case the indefinite extensibility gets packed into the choice of set theory that we think this device is judging second-order validity for, if I haven't dropped the bucket somewhere along the way.

011y

Let S="ST -> false". Then S is false in second-order logic (assuming ST is consistent), but (ST->"All sets: Set entails S") is true, because ST's model has to be bigger than any set (I think it's usually taken to be the proper class of all sets), so every set entails "Not ST".
On input S="ST -> false", your device prints out "true", while my device prints out "false". I still want to be able to hypothesize my device. :)

011y

There are totally models of ZFC containing sets that are models of ZFC. See "Grothendieck universe". Is there a reason why it'd be different in second-order logic? I don't think a second-order set theory would pin down a unique model, why would it? Unless you had some axiom stating that there were no more ordinals past a certain point in which case you might be able to get a unique model. Unless I'm getting this all completely wrong, since I'm overrunning my expertise here.
So in retrospect I have to modify this for us to somehow suppose that the device is operating in a particular model of a second-order theory. And then my device prints out "true" (if it's in one of the smallest models) or the device prints out "false" (if it's in a larger model), unless the device is against the background of an ST with an upper bound imposing a unique model, in which case the device does print out "true" for ST -> false and from the outside, we think that this device is about a small collection of sets so this result is not surprising.
Then the question is whether it makes sense to imagine that the device is about the "largest relevant" model of a set theory - i.e., for any other similar devices, you think no other device will ever refer to a larger model than the current one, nor will any set theory successfully force a model larger than the current one - I think that's the point at which things get semantically interesting again.

011y

Second-order set theory is beyond my expertise too, but I'm going by this paper, which on page 8 says:
So I was taking the "obvious alternative" of proper class of all sets to be the standard model for second order set theory. I don't quite understand the paper's own proposed model, but I don't think it's a set either.

011y

I'm not sure I believe in proper classes and in particular, I'm not sure there's a proper class of all sets that could be the model of a second-order theory such that you could not describe any set larger than the model, and as for pinning down that model using axioms I'm pretty sure you shouldn't be able to do that. There are analogues of the Lowenheim-Skolem theorem for sufficiently large infinities in second-order logic, I seem to recall reading.

Is complexity objective?

This one bother me as well. Turing machine basis for complexity favours certain types of theories over others. For example, how many bits of turing machine does it take to predict, say, maxwells equations, which involve *continuous fields*?

(warning: original "research" following)

One temptation is to sweep this under the rug by saying that the complexity required to talk about continuous math is just a constant-sized interpreter that you put on the turing machine, but this can't get around the fact that some thoeries *won't*...

212y

The thing that you're describing is a lot like finding the stationary distribution of a Markov chain. Not all Markov chains have stationary distributions and I can't tell whether this one would, though if it does have one it would be unique. It's an interesting idea.
I should also note that it is not necessary to do anything like this to preform induction over multiple theories. Instead, we can just require a program that outputs known theories in addition to the new theory. We can ask them to be output in any form, such as predictions or source code, since converting source code to predictions is O(1) in program size. Then, if the different theories have more in common, the program computing them all will be shorter than one that must seperately specify elements of very different theories. A slightly different implementation of the same general principle is Schmidhuber's Optimal Ordered Problem Solver.

212y

Some continuous math is commutable, some is not. There are uncomputable real numbers. Computable number.

This is the second mention of second-order logical Solomonoff Induction, but I can't imagine what such a thing would look like.

512y

It's Luke and Eliezer's term, but I guess the idea is similar to the one I had. You take some second-order theory, and let each string in the formal language that constitutes a description of X contribute n^-l to the probability of X, where n is the size of the alphabet, and l is the length of the string. Use the resulting distribution in place of the universal prior in Solomonoff Induction.

012y

Let's say the theory is ZFC. Does the string "0 if the continuum hypothesis if true, otherwise 1" contribute to the probability of 0, or the probability of 1?

012y

(As you most likely know, in the standard interpretation of second order ZFC the continuum hypothesis has a definite truth value, but it can't be proved or disproved in any of the deductive systems that have been proposed for second order ZFC.) My suggestion is that the string "0 if the continuum hypothesis if true, otherwise 1" contribute to the probability of 0 if the continuum hypothesis is true, and the probability of 1 if the continuum hypothesis is false. The problem that we don't know how to deal with a probability distribution like this seems similar to the problem of not knowing how to use the universal prior in the original Solomonoff Induction, both of which seem to be instances of logical/mathematical uncertainty, and there might be a general solution to that problem.
The reason I think this is plausible is that for example P=NP may not be provable or disprovable in any formal deductive system that have been proposed, and yet any agent would still have to make decisions that are affected by its truth value, such as whether to search for a polynomial solution to 3-SAT. If there is some sort of principled way to make those decisions, perhaps the same methods can be used to make decisions involving the continuum hypothesis?

All universal Turing machines can simulate each other with logarithmic slowdown. Saying that the parameter means that complexity "subjective" is like saying the time complexity of Quicksort is "subjective" because the algorithm doesn't specify which programming language to implement it in.

For aliens with a halting oracle:

Suppose the aliens have this machine that may or may not be a halting oracle. We give them a few Turing machine programs and they decide which ones halt and which ones don't. Then we run the programs. Sure enough, none of the ones they say run forever halt, and some of them they say don't run forever will halt at some point. Suppose we repeat this process a few times with different programs.

Now what method should we use to predict the point at which new programs halt? The best strategy seems to be to ask the aliens whi...

Consider an arbitrary probability distribution P, and the smallest integer (or the lexicographically least object) x such that P(x) < 1/3^^^3 (in Knuth's up-arrow notation). Since x has a short description, a universal distribution shouldn't assign it such a low probability, but P does, so P can't be a universal distribution.

The description of x has to include the description of P, and that has to be computable if a universal distribution is going to assign positive probability to x.

If P has a short computable description, then yes, you can conclude ...

Suppose we define a generalized version of Solomonoff Induction based on some second-order logic. The truth predicate for this logic can’t be defined within the logic and therefore a device that can decide the truth value of arbitrary statements in this logical has no finite description within this logic. If an alien claimed to have such a device, this generalized Solomonoff induction would assign the hypothesis that they're telling the truth zero probability, whereas we would assign it some small but positive probability

Actually, what Tarski seems to s...

012y

I would like to have a method of induction that for any formal language, assigns a non-zero prior to the existence of a device that can enumerate or decide all true sentences in that language, or alternatively an explanation based on reasonable principles for which such devices should have zero probabilities. Right now we do not have either, and your research program for improving SI (i.e., to base it on second-order logic) will not give us either even if it's successful. So while I'm not sure it makes sense to say I "blame" Solomonoff induction (what could that mean?), you could say that I'm not satisfied with either the status quo or any improvements to it that we can currently foresee.

212y

Give me a set of formal languages over which you can say the phrase "for any formal language", and the truth predicate for the union of the set won't be in any language in the set. I'm still trying to understand how to deal with this inside AI, but I'm not sure that blaming it on second-order logical induction is putting the blame in the right place.

012y

Again, I'm not sure what you mean by "blame" here. If you're saying that Tarski's result represents a problem that affects more than just attempts to generalize Solomonoff induction, then I agree.
BTW, while I have your attention, what's your evaluation of Paul Christiano's FAI design idea, which sort of tries to punt as many philosophical problems as possible (including this one)? I noticed that you didn't comment in that discussion.

If I understand Solomonoff Induction correctly, for all n and p the the sum of the probabilities of all the hypotheses of length n equals the sum of the probabilities of hypotheses of length p. If this is the case, what normalization constant could you possibility use to make all the probabilities sum to one? It seems there are none.

39y

Use a prefix-free encoding for the hypotheses. There's not 2^n hypotheses of length n: Some of the length-n bitstrings are incomplete and you'd need to add more bits in order to get a hypothesis; others are actually a length <n hypothesis plus some gibberish on the end.
Then the sum of the probabilities of all programs of all lengths combined is 1.0. After excluding the programs that don't halt, the normalization constant is Chaitin's Omega.

29y

Unfortunately Chaitin's Omega's incomputable, but even if it wasn't I don't see how it would work as a normalizing constant. Chaitin's Omega is a real number, there is an infinite number of hypotheses, and (IIRC) there is no real number r such that r multiplied by infinite equals one, so I don't see how Chaitin's Omega could possible work as a normalizing constant.

09y

If we assign a value of 2^-n to each hypothesis (that is, each halting program) of length n, then the total value summed over all hypotheses is Chaitin's Omega.
Thus, if we assign a value of 2^-n/Omega to each hypothesis, the total is 1, and this gives us a valid probability distribution.

09y

That would assign a zero probability of hypotheses that take more than n bits to specify, would it not? That sounds far from optimal.

09y

Sorry, I mean that we do this for each value of n. So, given a hypothesis H, it is assigned a probability of 2^-length(H)/Omega.

09y

I still don't see how that would make all the hypotheses sum to 1. Wouldn't that only make the probabilities of all the hypotheses of length n sum to 1, and thus make the sum of all hypotheses sum to > 1? For example, consider all the hypotheses of length 1. Assuming Omega = 1 for simplicity, there are 2 hypotheses, each with a probability of 2^-1/1 = 0.5, making all them sum to 1. There are 4 hypotheses of length 2, each with a probability of 2^-2/1 = 0.25, making them sum to 1. Thus, the sum of the probabilities of all hypotheses of length <= 2 = 2 > 1.
Is Omega doing something I don't understand? Would all hypotheses be required to be some set length?

29y

I think you may be missing two things.
First: the hypotheses/programs (recall: we are representing hypotheses by computer programs in some suitable language) need to be written in a "prefix-free" representation, meaning one in which e.g. if 100111 is a valid program then nothing else beginning 100111 can be. So you don't get two programs of length 1, four of length 2, etc. If you do this, and if you also arrange your coding so that every infinite string of bits starts with some legal program, then the sum over all legal programs of 2^-length is exactly 1. (So Pr(program) = 2^-length(program) defines a probability distribution over programs. It's the same probability distribution as you get from the following procedure: keep generating random bits until the bits you've generated form a legal program.)
Second: you can't just take Ω=1 "for simplicity"; Ω is defined to be what you get by adding up 2^-length for all programs that halt. (Remember that hypotheses are being represented as computer programs here.) Equivalently, if you imagine picking a random well-formed program by choosing random bits, Ω is the probability that the resulting program halts. So it's certainly <1.
[EDITED to fix an ugly but hopefully not confusing typo.]

19y

The hypotheses are encoded in a prefix-free way: no hypothesis is a prefix of another hypothesis.
In particular, this eliminates your example: if both strings of length 1 ("0" and "1") are valid hypotheses, then no string of longer length can be a valid hypothesis (then we can take Omega=1 and assign a probability of 1/2 to each). Or we could have "0", "10", and "110" be valid hypotheses; then we can take Omega = 7/8 and assign probabilities of 4/7, 2/7 and 1/7 to these.
In general, the prefix-free condition ensures that the sum over all hypotheses converges, by Kraft's inequality, which is the real heavy lifter here; the constant is merely there to make the sum over all hypotheses converge to 1 instead.
You might imagine that hypotheses are required to be grammatically correct English sentences (I guess encoded in ASCII or something). In that case, no hypothesis is a prefix of another, because each hypothesis ends with a period.

09y

Right; I forgot that it used a prefix-free encoding. Apologies if the answer to this is painfully obvious, but does having a prefix-free encoding entail that there is a finite number of possible hypotheses?

09y

It doesn't: for example, the set of strings 0, 10, 110, 1110, 11110, 111110, ... is infinite and prefix-free. So is the set of C-style (null-terminated) strings.

09y

I see. Does the method of normalization you gave work even when there is an infinite number of hypotheses?

09y

It does. The infinite sum will always converge, so by normalizing we can make it converge to 1.
Take gjm's approach to explaining the normalization, in which the initial weight of 2^-length assigned to a hypothesis is the probability that you obtain that hypothesis by choosing bits at random. Then the normalization is just a conditional probability: you divide by the probability that, when choosing bits at random, you do eventually end up hitting a hypothesis.

09y

Remark: these are special cases of the schema "all strings that contain exactly one instance of pattern X in possible positions Y", where in the first case X is "0" and Y is "anywhere" and in the second X is "00000000" and Y is "any position that's a multiple of 8 bits".
(Of course there are plenty of other prefix-free sets of bitstrings. For instance: interpret blocks of 8 bits as characters as in Kindly's second example, and say that a valid program is anything that begins with a "(", has properly matched parentheses, and ends with the ")" matching the first "(". This doesn't have the convenient property that every bitstring begins with a valid program; for examples that do, take the exponential Golomb coding or the Elias omega coding of natural numbers.

A kilobit of improbability requires only a kilobit of data to offset it, which isn't very crackpot at all. Proving minimum length is impossible, but proving an upper bound on length is very easy, and that proves a lower bound on probability.

I take your point re: length vs speed. The theorem that I think justifies calling Kolmogorov Complexity objective is this:

"If K1 and K2 are the complexity functions relative to description languages L1 and L2, then there is a constant c (which depends only on the languages L1 and L2) such that |K1(s) - K2(s)| <= c for all strings s."

(To see why this is true, note that you can write a compiler for L2 in L1 and vice versa)

I don't see why code modelling symmetric laws should be longer than code modelling asymmetric laws (I'd expect the reverse; ...

Here's an idea for a modification of Solomonoff induction that no longer has a subjectively-chosen machine to encode hypotheses in. One could instead simply consider how many bits it would take to encode a solution on all universal Turing machines, and making each hypotheses’ prior equal to the average of its prior on each machine. This makes somewhat intuitive sense, as one doesn’t know which “machine” the universe in “programmed” on, so it’s best to just assume a random machine. Unless I’m mistaken, there’s an infinite number of universal Turing machines, but I think algorithms could still approximate the induction.

111mo

How would you take the average over an infinite number of UTM's? You would first need to choose a distribution on them.

09y

of course not

09y

"Of course" implies that the answer is obvious. Why is it obvious?

07y

Dunno what Username was thinking, but here's the answer I had in mind: "Why is it obvious? Because the Problem of Induction has not yet been solved."

Suppose we define a generalized version of Solomonoff Induction based on some second-order logic. The truth predicate for this logic can’t be defined within the logic and therefore a device that can decide the truth value of arbitrary statements in this logical has no finite description within this logic. If an alien claimed to have such a device, this generalized Solomonoff induction would assign the hypothesis that they're telling the truth zero probability, whereas we would assign it some small but positive probability.

I'm not sure I understand you c...

What does Solomonoff Induction actually say?

I believe this one has been closed ages ago by Alan Turing, and practically demonstrated for approximations by the investigation into busy beaver function for example. We won't be able to know BB(10) from God almighty. Ever.

011y

I'm not sure what you're trying to say here, but if you consider this a relative weakness of Solomonoff Induction, then I think you're looking at it the wrong way. We will know it as well as we possibly could given the evidence available. Humans are subject to the constraints that Solomonoff Induction is subject to, and more.

-211y

Whoops, thread necromancy.

Solomonoff Induction is supposed to be a formalization of Occam’s Razor, and it’s confusing that the formalization has a free parameter in the form of a universal Turing machine that is used to define the notion of complexity.

I'm very confused about this myself, having just read this introductory paper on the subject.

My understanding is that an "ideal" reference UTM would be a universal turing machine with no bias towards any arbitrary string, but rigorously defining such a machine is an open problem. Based on our observation of UTMs, the more...

112y

Even if you do that, you're left with an infinite number of cliques such that within any given clique the languages can write short interpreters for each other. Picking one of the cliques is just as arbitrary as picking a single language to begin with. i.e. for any given class of what-we-intuitively-think-of-as-complicated programs X, you can design a language that can concisely represent members of X and can concisely represent interpreters with this special case.
It's a function of the language you're writing an interpreter of and the language you're writing it in. "Constant" in that it doesn't depend on the programs you're going to run in the new language. i.e. for any given pair of languages there's a finite amount of disagreement between those two versions of the Solomonoff prior; but for any given number there are pairs of languages that disagree by more than that.
No. There's no law against having a gigabyte-long opcode for NAND, and using all the shorter opcodes for things-we-intuitively-think-of-as-complicated.

012y

So is there then a pragmatic assumption that can be made? Maybe we assume that if I pick a turing machine blindly, without specifically designing it for a particular output string, it's unlikely to be strongly biased towards that string.

312y

What probability distribution over turing machines do you blindly pick it from? That's another instance of the same problem.
Pragmatically, if I non-blindly pick some representation of turing machines that looks simple to me (e.g. the one Turing used), I don't really doubt that it's within a few thousand bits of the "right" version of solomonoff, whatever that means.

212y

Why not?

Apparent Unformalizability of “Actual” Induction

Re: Tarski’s Indefinability of Truth - I don't get this one. Solomonoff Induction estimates probabilities of bitstring sequences - but it's uncomputable - and the best we can do is finite approximations - which of course have limitations and will be able to be improved upon - just as Tarski’s "Indefinability of Truth" says.

Re: Argument via Berry’s Paradox - but the description of x includes P - which could be complex. So the description of x is not short - if you are not given P.

Is complexity objective?

Kinda. Some notions of complexity are more useful than others. However, condition your machine on some real world sense data for a while, and *any* reasonable prior soon gets swamped by data - so in practice, this is not a big deal.

09y

If it was a matter of gathering enough data we wouldn't have to talk about Solomonoff induction at all.

How can we apply Solomonoff when our inputs are not symbol strings?

Inputs can always be represented by bit sequences. Qualia are an irrelevance. Q.E.D.

012y

Doesn't the "irrelevance" depend on an assumption about how our minds work? If we have some analog computation going on, the bit sequences might be a mere approximation to our actual inductive reasoning. Of course, once we put our conclusions into language, those conclusions can be digitized - but that might happen at a relatively late stage of the reasoning game.

012y

Analog computation is another irrelevance - analog systems can always be approximated arbitrarily closely by discrete systems. Witness the ongoing digial revolution.

012y

Sure, if you are designing a system from the ground up. I've never needed more than 12 bits of precision for any real engineering purpose, and I wouldn't pay a single dollar for more. But Wei_Dai's question was about us - or at least reads literally that way. Maybe I took it too literally.

012y

There's no credible evidence that we aren't digital. E.g. see digital physics. If nobody can tell the difference, the issue is probably in the realm of vacuous philosophy..

I'm not sure what you have in mind by basing Solomonoff induction on a second-order logic. Can you give a reference?

Here is what I'd suggest as a baseline. Use John Tromp's Binary Lambda Calculus with some variation of Levin Search to explore time/complexity bounded programs that output some initial sequence of data.

Solomonoff Induction is supposed to be a formalization of Occam’s Razor, and it’s confusing that the formalization has a free parameter in the form of a universal Turing machine that is used to define the notion of complexity. What’s the significance of the fact that we can’t seem to define a parameterless concept of complexity? That complexity is subjective?

Perhaps we could formalize some argument that for an arbitrarily poor definition of complexity, the error becomes negligible for the average sufficiently complex universe?

[This comment is no longer endorsed by its author]

Solomonoff Induction is defined over symbol strings (for example bit strings) but our perceptions are made of “qualia” instead of symbols. How is Solomonoff Induction supposed to work for us?

Our perceptions have some neurological representation.

[This comment is no longer endorsed by its author]

What does Solomonoff Induction actually say about, for example, whether we live in a creatorless universe that runs on physics?

The entirely depends on your definition of creator. Traditional creators such as the Christian god could potentially have enough explanatory power once properly defined, yet would end up horrendously complex (encoding an entire human-like mind into the primitive natural law) unless you further reduce the construction of the creator to a natural process.

Or the Simulation Argument?

Solomonoff Induction is empirical: unless the ...

012y

People are perfectly fine with fuzzy approximate explanations of phenomena, like Maxwell's equations &c. "Goddidit" is not that different. Trying to get a full causal explanation would mean finding bits of Omega. In the end, decision theory is fundamental, and epistemological abstractions like SI are cool but ultimately irrelevant. This whole "encoding a human-like mind" thing doesn't work like you think it does --- you can interpret SI that way and see some cool implications, just remember it's a useless toy model. ...Just sayin'.

112y

Physics theories import low-complexity mathematical models. "Goddidit" imports complicated human notions of agency. Approximate explanations are fine if we can reason that their implicit complexity is low relative to their explanatory power (a relatively easily satisfied metric, after which competition between theories becomes the key factor).
In Solomonoff Induction, theories that don't explain data must contain that data raw.

312y

Frankly, I think this idea is attractive but ultimately an error. It is indeed true that to an analytical mind with an interest in physics, mathematics feels a lot less complex, in some sense, than intuitive notions of agency. But no matter how much physics or psychology you know, you don't have introspective access to the universal prior --- maybe the prior privileges math over psychology, or maybe it doesn't. All we have is our evidence, often in the form of conclusions drawn from intuitive analyses of what hypotheses have or haven't tended to bear intellectual or instrumental fruit in the past --- id est, we're awfully close to talkin' 'bout pragmatics and decision theory here. And yes, mathematical explanations have been surprisingly effective. But if you look at human history, hypotheses that make use of "complicated human notions of agency" have also been pretty damn effective. It's not obvious what notion of complexity would massively privilege the former over the latter, and at any rate, we have no way of knowing, because you can't find the universal prior in your backyard.

112y

We have objective verification of the low complexity of formalized mathematical theories because we can look at the length of their formal description in say, first-order logic.
Are you really suggesting some model of computation based on human ideas might work better than say, lambda calculus for computing Kolmogorov complexity for Solomonoff Induction? I'm not sure how to argue with that but I would appreciate it if you would state it explicitly.

012y

Right, and that'll be important if we ever run into aliens that for some reason can't wrap their brains around English, but instead can figure out our category theory notation and so on. Or if we're trying to build an FAI, or collaborate with the aforementioned aliens to build FAI.
Apologies, inferential distance, and there's a few meta-level points that I think are important to communicate. But there's inferential distance on the meta level too.

012y

Also keep in mind that algorithmic information/probability theory is actually quite hard to interpret correctly --- the basic, intuitive way to read meaning into the math is not quite the way to do it. cousin_it has a post or two correcting some intuitive errors of interpretation.

112y

I found these:
Intuitive Explanation of Solomonoff Induction - lukeprog
Does Solomonoff always win? - cousin_it
K-complexity of everyday things - cousin_it
Solomonoff Induction, by Shane Legg - cousin_it
I would appreciate it if people could link me to more.

212y

Alas, none of those are the relevant ones I think. I'm actually rather busy visiting home, so I can only justify certain comments to myself, but I hope someone provides the links.
For what it's worth, I'm a little skeptical of lukeprog's understanding of SI --- no offense to him meant, it's just I so happen to believe he made a rather big error when interpreting the math. On the other hand, cousin_it seems to be really on the ball here. Those are just my impressions; I'm a pretend philosopher, not a compsci dude. At any rate I think it'd be just dandy for cousin_it to check Luke's posts and share his impression or critiques.

-112y

Here's one I was thinking of:
The prior of a hypothesis does not depend on its complexity - cousin_it
(If I recall, Nesov's comment clearly demonstrates the important point.)

012y

That post seems to mix together the concept of a prior with the concept of experience.

-212y

http://lesswrong.com/lw/328/description_complexity_an_apology_and_note_on/

212y

Mentioning it anywhere except algorithmic information theory is a sign of confusion. This includes theology and parapsychology. Use just Bayes or, if you want to be all fancy, updateless-like decision theories. I love algorithmic probability to death but it's just not something you should use casually. Too many pitfalls.

-212y

Bayes requires a prior.

412y

No one should ever need to discuss "priors". Focus on the likelihood ratio.

012y

...but that's like comparing apples and cheese!

012y

Approximate explanations have some predictive power.
What experience do you expect if "Goddidit", as opposed to if "Goddidntdoit"?
(Skeletons of angels versus skeletons of dinosaurs? People with supernatural powers versus people working with superstition? Benevolent universe versus indifferent universe?)

012y

—Twelve Virtues of Rationality
It's just, I'm having an amazing time back home, and my time is limited. I don't know your goals, but you might want to try harder to signal that you're really curious and not just asking questions that you think are rhetorical. When you reference common knowledge 'round these parts, like Eliezer's posts, you should expect that the other person is already aware of that knowledge, and that they have real, substantive reasons to think that what they said is not entirely refuted by the contents of said common knowledge.
Of course, asking rhetorical questions is a perfectly decent way to make an argument. It's just that arguments in that sense aren't quite what's called for in situations like these, I think. But that might just be a difference in our epistemic styles, especially if you're Slavic. (Gasp, racism! ;P )

012y

Good point.
Also good point about time being limited, so...
If you'd someday later feel like writing a LW article about similarities between "Goddidit" and Maxwell's equations, or something like that, I will read it.

The interesting bit is that if you could actually use Solomonoff induction as prior, I'm pretty sure you'd be an irreparable aether believer and 'relativity sceptic', with confidence that has so many nines nothing would *ever* convince you. That's because the absolute-speed irreversible aether universes like 3d versions of Conway's game of life are so much more computationally compact than highly symmetric universe with many space-time symmetries (to the point of time and space intermixing in the equations) and the fully reversible fundamental laws of physic...

312y

Can you explain why this is the case? Also, when you say "aether" universes are more computationally compact than "relativity" universes, is this before or after taking into account our observations (i.e., are you restricting your attention to universes that fit our observations, or not)?
Is it possible that what you said is true only if we want the laws of physics to run fast on current computers? I'm afraid that most software people, including myself, probably have bad intuitions about Solomonoff Induction because we only have experience with a very small subset of possible computations, namely those that can be run economically on modern computers. Perhaps the laws of physics can be implemented much more compactly if we ignored efficiency?

-212y

As intuition pump consider the reversible vs irreversible cellular automata. If you pick at random, vast majority will not be reversible. Ditto for the symmetries. (Keep in mind that in Solomonoff probability we feed infinite random tape to the machine. It is no Occam's razor. Elegant simplest deterministic things can be vastly outnumbered by inelegant, even if they are most probable. edit: that is to say you can be more likely to pick something asymmetric even if any particular asymmetric is less likely than symmetric)
There can always be a vast conspiracy explaining the observations... ideally if you could simulate whole universe (or multiverse) from big bang to today and pick out the data matching observations or the conspired lying, then maybe it'd work, but the whole exercise of doing physics is that you are embedded within universe you are studying. edit: and that trick won't work if the code eats a lot of random tape.
I don't think relaxing the fast requirement really helps that much. Consider programming Conway's game of life in Turing machine. Or vice versa. Or the interpreter for general TM on the minimal TM. It gets way worse if you want full rotational symmetry on discrete system.
Of course, maybe one of the small busy beavers is a superintelligence that likes to play with various rules like that. Then I'd be wrong. Can not rule even this possibility out. Kolmogorov/Solomonoff name drop is awesome spice for cooking proven-untestable propositions.
One could argue that second order logic could work better, but this is getting way deep into land of untestable propositions that are even proven untestable, and the appropriate response would be high expectations asian father picture with "why not third order logic?".
edit: also you hit nail on the head on an issue here: i can not be sure that there is no very short way to encode something. You can ask me if I am sure that busy beaver 6 is not anything, and I am not sure! I am not sure it is not the god al

312y

Discrete universes introduce an unobserved parameter (a fundamental length) as well as a preferred frame. Abandoning complex numbers for a countable group in QM would also be very difficult. There is a complexity theory for real-valued objects and if Solomonoff induction could be adapted to that context, continuum theories ought to be favored.

112y

Yep. Well, in any case, the point is that the Solomonoff probability is not in any way 'universal prior', you have to pick the machine, then you can't actually compute anything useful because its uncomputable and you'll never get anything non-trivial all the way down to minimum length.
If you go for changing the machine, you could use laws of physics as you know them as the machine, too (then you'll be irreparable sceptic of any new physics). In the context of laws of physics its just the emulation of one universal computation (laws of physics) with another (what ever stuff you pick). We are only discussing this because some highly self-important people heard of Solomonoff induction and Kolmogorov complexity, modelled those with something ill defined and fuzzy (lacking not only sufficient understanding of those things but the understanding of the concepts in which to understand those, and so on several layers deep, as is usually the case when you choose something really complicated without having spent a lot of time studying towards it) and then used it as universal smart sounding word to say (e.g. here or here or here ). Name-dropping and nothing more.

012y

Consider Penrose's "Angular momentum: An approach to combinatorial space-time" (math.ucr.edu/home/baez/penrose/Penrose-AngularMomentum.pdf)
It seems that euclidean space, at least, can be derived as a limiting case from simple combinatorial principles. It is not at all clear that general relativity does not have kolmogorov complexity comparable to the cellular automata of your "aether universes".

012y

Kolmogorov complexity of GR itself (text of GR or something) is irrelevant. Kolmogorov complexity of universe that has the symmetries of GR and rest of physics, is. Combinatorial principles are nice but it boils down to representing state of the universe with cells on tape of linear turing machine.

Solomonoff Induction is supposed to be a formalization of Occam’s Razor, and it’s confusing that the formalization has a free parameter in the form of a universal Turing machine that is used to define the notion of complexity. What’s the significance of the fact that we can’t seem to define a parameterless concept of complexity? That complexity is subjective?

In the Kolmogorov-Chaitin Minimum description length approach, the subject must pick a Turing machine whose operations describe the basic operations believed to represent 'simplicity' by the subject...

Solomonoff Induction seems clearly "on the right track", but there are a number of problems with it that I've been puzzling over for several years and have not made much progress on. I think I've talked about all of them in various comments in the past, but never collected them in one place.

## Apparent Unformalizability of “Actual” Induction

## Argument via Tarski’s Indefinability of Truth

Suppose we define a generalized version of Solomonoff Induction based on some second-order logic. The truth predicate for this logic can’t be defined within the logic and therefore a device that can decide the truth value of arbitrary statements in this logical has no finite description within this logic. If an alien claimed to have such a device, this generalized Solomonoff induction would assign the hypothesis that they're telling the truth zero probability, whereas we would assign it some small but positive probability.

## Argument via Berry’s Paradox

Consider an arbitrary probability distribution P, and the smallest integer (or the lexicographically least object) x such that P(x) < 1/3^^^3 (in Knuth's up-arrow notation). Since x has a short description, a universal distribution shouldn't assign it such a low probability, but P does, so P can't be a universal distribution.

## Is Solomonoff Induction “good enough”?

Given the above, is Solomonoff Induction nevertheless “good enough” for practical purposes? In other words, would an AI programmed to approximate Solomonoff Induction do as well as any other possible agent we might build, even though it wouldn’t have what we’d consider correct beliefs?

## Is complexity objective?

Solomonoff Induction is supposed to be a formalization of Occam’s Razor, and it’s confusing that the formalization has a free parameter in the form of a universal Turing machine that is used to define the notion of complexity. What’s the significance of the fact that we can’t seem to define a parameterless concept of complexity? That complexity is subjective?

## Is Solomonoff an ideal or an approximation?

Is it the case that the universal prior (or some suitable generalization of it that somehow overcomes the above "unformalizability problems") is the “true” prior and that Solomonoff Induction represents idealized reasoning, or does Solomonoff just “work well enough” (in some sense) at approximating any rational agent?

## How can we apply Solomonoff when our inputs are not symbol strings?

Solomonoff Induction is defined over symbol strings (for example bit strings) but our perceptions are made of “qualia” instead of symbols. How is Solomonoff Induction supposed to work for us?

## What does Solomonoff Induction actually say?

What does Solomonoff Induction actually say about, for example, whether we live in a creatorless universe that runs on physics? Or the Simulation Argument?