Kevin T. Kelly's Ockham Efficiency Theorem

[-]Wei Dai15y170

As Unknown pointed out, using Kolmogorov complexity is a better bet if winning means finding the truth quickly, instead of just eventually.

The rules of this game, plus the assumption that Nature is maximally adversarial, seems specifically designed to so that this particular version of Ockham would be optimal. It doesn't really seem to provide much insight into what one should do in more general/typical settings.

[-]Johnicholas15y120

The question that Kelly is trying to answer is "Why, or in what sense, does Occam's razor work?". Yes, the answer is "It works in that it is worst-case-efficient." He doesn't assume a simplicity-biased prior (which some might characterize as a circular justification of Occam's razor).

I worry that the particular example that I'm presenting here (where the Occam strategy steps in a vaguely linear fashion) is coloring your thinking about Kelly's work generally, which has little or nothing to do with stepping in a vaguely linear fashion, and everything to do with believing the simplest theory compatible with the evidence. (Which, based on your other writings, I think you would endorse.)

Please investigate further, maybe read one of Kelly's papers, I would hate for my poor writing skills to mislead you.

1Wei Dai15y

Is the particles example one of Kelly's own examples, or something you made up to explain his idea? Because in that example at least, we seem to be assuming a game with very specific rules, showing that a particular strategy is optimal for that game, and then calling that strategy "Occam". Compare this with Solomonoff induction, where we only have to assume that the input is computable and can show that Bayesian updating with a prior based on Kolmogorov complexity is optimal in a certain sense. I do endorse that, but there are various aspects of the idea that I'm still confused about, and I'm not seeing how Kelly's work helps to dissolve those confusions. I think I read some of Kelly's own writings the last time cousin_it pointed them out, but again, didn't really "get it" (in the sense of seeing why it's important/interesting/insightful). I'm hoping that I'm just missing something, and you, or someone else, can show me what it is. Or perhaps just point to a specific paper that explains Kelly's insights most clearly?

4Johnicholas15y

The particles example is derived from one of Kelly's examples (marbles in a box). Kolmogorov complexity is unavailable - only approximations it are available, and they're not unique. There are multiple reasonable grammars that we can use to do MDL, and no clear justification for why we should use one rather than another. For an extreme example, imagine measuring simplicity by "size of the patch to the Catholic Church's dogma." Kelly's notion that Occam means worst-case-efficient-in-mind-changes allows us to avoid the prior "dropping like manna from heaven". I recommend this paper: http://www.hss.cmu.edu/philosophy/kelly/papers/bonn5.pdf but if that one doesn't do it for you, there are others with more cartoons and less words: http://www.fitelson.org/few/few_05/kelly_2.pdf Or ones with more words and less cartoons: http://www.hss.cmu.edu/philosophy/kelly/papers/Ch4-Glymour%20&%20Kelly-final.pdf

7Wei Dai15y

In the first paper you cite, there is a particles example that is essentially the same example as yours, and Kelly does use it as his main example. The only difference is instead of counting the number of fundamental particles in physics, his example uses a device that we know will emit a finite number of particles. On page 33, Kelly writes about how his idea would handle a modification of the basic example: So suppose this device has been emitting one particle every 10 seconds for the last million seconds. According to Kelly's version of Ockham's Razor (perhaps we should call it Kelly's Razor instead?), we can't predict that the next particle will come 10 seconds later. What use is Kelly's idea, if I want to have a notion of complexity that can help us (or an AI) make decisions in general, instead of just playing some specific games for which it happens to apply?

3Johnicholas15y

You read the paper! Thanks for pointing out that we know somehow that only a finite number of particles will ever be found. To explain the "oneicle" problem: It seems like how a scenario is coded into a game matters. For example, if you viewed the timed particles game as having two possible worlds "The device will always emit a particle every 10 seconds." and "The device will sometimes emit a particle every 10 seconds.", then the first world cannot pretend to be the second world, but the second world can camouflage itself as the first world for a time, and so (Kelly's version of) Occam's razor says the first is simpler - we get the intuitively correct answer. The alternative coding is somewhat analogous to the color "grue" (which is green up until some date, and blue thereafter). You recode the problem to talk about "oneicles", a concept that refers to non-particles up to time 1, and particles thereafter. If you allow this sort of recoding, then you would also allow "twoticles", and the infinite hierarchy of symmetric re-codings causes a problem. I tend to think this is a technical problem that is unlikely to expand into the philosophy part of the theory, but I'm kindof an idiot, and I may be missing something - certainly we would like to avoid coding-dependence. That's a problem (the first problem Kelly mentioned in that paper), but do you really require a theory to have no problems remaining in order for it to be counted as insightful? No one else addresses the question "Where does the prior come from?".

4Wei Dai15y

It would be one thing if Kelly said that the theory currently can't predict that another particle will come in 10 seconds, but he hopes to eventually extend it so that it can make predictions like that. But instead he says that Ockham is mute on the question, and that's the right answer. Neither does Kelly. I don't see how we can go from his idea of Ockham to a Bayesian prior, or how to use it directly in decision making. Kelly's position above suggests that he doesn't consider this to be the problem that he's trying to solve. (And I don't see what is so interesting about the problem that he is trying to solve.)

1Johnicholas15y

Okay, I think we've reached a point of reflective disagreement. I agree with you that Kelly was wrong to be enamored of his formalization's output on the timed particles example; it's either a regrettable flaw that must be lived with, or a regrettable flaw that we should try to fix, and I don't understand enough of the topological math to tell which. However, the unjustified Occam prior in the standard Bayesian account of science is also a regrettable flaw - and Kelly has demonstrated that it's probably fixable. I find that very intriguing, and am willing to put some time into understanding Kelly's approach - even if it dissolves something that I previously cherished (such as MDL-based Occam priors). Reasonable people can reasonably disagree regarding which research avenues are likely to be valuable.

0[anonymous]13y

I am very late to the discussion. I have not read Kelley's papers in detail, so pardon me if my question betrays a fundamental misunderstanding of what you wrote: How can "(Kelly's version of) Occam's razor says the first [world] is simpler" and give us "the intuitively correct answer" if an infinite number of particles will be emitted in the first world, even though Kelley has already specified that the device will only emit a finite number of particles?

0Johnicholas13y

The statements, though contradictory, refer to two different thought experiments.

0Johnicholas13y

The two comments, though contradictory, refer to two different thought experiments.

0[anonymous]13y

I see. Thanks for the explanation.

[-]PhilGoetz15y50

Nice post! Couple quibbles/questions.

Ensemble learning methods are a challenge for "prevents overfitting" justifications of Occam's razor since they propose weirdly complex hypotheses, but suffer less from overfitting than the weak classifiers that they are built from.

It's not a challenge unless you think that each of the weak classifiers should be powerful enough to do the classification correctly. Occam = "Do not multiply entities beyond need", not "Do not multiply entities".

If the scientist, confronted with a long r

... (read more)

1Johnicholas15y

Can you elaborate your sentence that begins "It's not a challenge.."? My understanding is that if our real justification for "Why do we use Occam's razor?" was "Because that way we avoid overfitting." then if a future statistical technique that outperformed Occam by proposing weirdly complex hypotheses came along, we would embrace it wholeheartedly. Boosting and bagging are merely illustrative of the idea that there might be a statistical technique that achieves good performance in a statistical sense, though nobody believes that their outputs (large ensembles of rules) are "really out there", the way we might believe Schroedinger's equation is "really out there".

1DanielVarga15y

We only use boosting if our set of low-complexity hypotheses does not contain the solution we need. And instead of switching to a larger set of still-low-complexity hypotheses, we do something much cheaper, a second best thing: we try to find a good hypothesis in the convex hull of the original hypothesis space. In short, boosting "outperforms Occam" only in man-hours saved: boosting requires less thinking and less work than properly applying Occam's razor. That really is a good thing, of course.

[-]DSimon15y40

I second the OP's recommendation of Zendo. It could've been named "Induction: The Game", it's a great medium for practicing (and teaching!) some basic science and rationality skills, and it's also a heck of a lot of fun. Plus, you can use the neat little Zendo pyramids (also called Icehouse pieces) for various other games.

1adsenanim15y

I agree, I'm not sure if I should go with Treehouse or IceTowers... If I buy IceTowers I could probably play Treehouse, but not the other way round... :)

0DSimon15y

Hm, I'd never heard of Treehouse before, thanks for the link. I'll try that out next time I bring my Zendo set somewhere. You're right that trying to do Zendo with only a Treehouse set would be very frustrating. Even with a full 60 piece Zendo set, it doesn't take more than a few rounds before you have to start taking apart some of the older experiments to build new ones.

1Pavitra15y

You don't really need the pyramids, though they're definitely neat. You can do it with coins, dice, little glass beads, whatever. With words and phrases, it makes a car game; the game "Green Glass Door" can be seen as an instance. I call this game "playing Science".

[-]Daniel_Burfoot15y40

Ensemble learning methods are a challenge for "prevents overfitting" justifications of Occam's razor since they propose weirdly complex hypotheses, but suffer less from overfitting than the weak classifiers that they are built from.

I'm not sure this is true. Overfitting means that a classifier achieves good empirical performance (on the known data), but bad generalization performance (on new, previously unseen data). The weak classifiers used in Adaboost et al. don't really suffer from overfitting; they just have bad empirical performance (thi... (read more)

2Johnicholas15y

According to the MDL / approximate-Kolmogorov-complexity notion of complexity, to measure the complexity of a hypothesis, fix a notation and count bits of the description of that hypothesis in that notation. It's certainly possible to find a notation where the scenario you describe has the small increment in complexity that you expect. But why pick that notation rather than another? How do you justify your priors?

2Daniel_Burfoot15y

Because you need the right notion of complexity in order to prevent overfitting. To prevent overfitting, you need to penalize highly expressive model classes. The MDL complexity measure does not perfectly capture the notion of expressivity, especially if you use a naive encoding method. An ensemble with many subclassifiers may require many bits to specify, but still not be very expressive, if the weights are small.

2Johnicholas15y

I think we're not trying to prevent overfitting. We're talking about the problem of induction. How do we know that a method that prevented overfitting in the past will continue to prevent overfitting in the future? Appealing to "induction has always worked before" is circular. I think it was Hume who asked: How do we know that bread will continue to nourish humans? It always has before, but in what sense are past observations logical grounds for anything? We would need some magical law of nature that says "Nature definitely won't just change all the rules." But of course, there is no such law, and Nature might change all the rules at any time. Suppose there are two kinds of worlds, one where bread always nourishes, and one where bread sometimes nourishes. The first variety is strictly simpler, because the second can camouflage itself as the first, matching the givens for an arbitrary amount of time, but the converse is not true. The Occam strategy of believing the simplest hypothesis consistent with the givens has this pleasant property: It is the best strategy as measured in worst-case mind changes against an adversarial Nature. Therefore, believe that bread will continue to nourish. I worry that I'm just muddying the water here.

0PaulAlmond15y

I think Occam's razor is fairly easy to justify in general. I would say that the problem is resolved if we assume that our reference class is "every conceivable, formally describable universe" - and the description in "formally describable" doesn't just describe the universe's state at an instant: It describes the entire universe, and its history, as an object. We should assume that one of those objects corresponds to our world. Once we have that, we have some data from experiments, and we generate a partial model, which is an algorithm that accepts past data and predicts future data. The model needs to work for past data, and we want one that is as likely as possible to predict future data correctly. We are hoping that our partial model correctly describes the behavior of our universe - which is one of the ones in the reference class. The greater the information content in the partial model, the more specific it is being, and the smaller the proportion of possible universes in the reference class that will agree with it. The smaller the information content in the partial model, the more general it is being, and the greater the proportion of possible universes in the reference class that will agree it and, therefore, the greater will be the chance that one of these universes happens to be the real one (or the one in which we are living if you subscribe to some form of modal realism - not that it matters whether you do or not). This should easily deal with the issue of why, when you see ordered behavior in reality, you should expect to see continued ordered behavior. It doesn't resolve these issues: 1. Why do we see any ordered behavior in the first place? None of this imposes any ordered behavior on reality. It simply says that if you see some you should expect more to follow. Any simplicity that you observe does not imply that reality is simple: it simply means that your partial model is relying on a simple feature of reality that happened to be there - a very

0Johnicholas15y

I think the phrase you used "the proportion of possible universes" isn't, in general, well defined without a measure (effectively a probability distribution) on that space - and there isn't a unique best probability distribution.

[-]Dan_Moore15y40

Kolmogorov complexity also depends on the description language used to create strings. If the language has a recursive feature, for example, this would assign lower complexity to members of the Fibonacci sequence.

So, part of the task of Science is to induce what Nature's description language is.

[-]Daniel_Burfoot15y100

Kolmogorov complexity also depends on the description language used to create strings.

The whole point of the KC is that it doesn't depend on the specific programming language (or Turing machine) selected except up to an additive constant. If your language contains a recursive feature, it will assign lower complexity to Fibonacci numbers than my non-recursive language does, but the difference is at most the codelength required for me to simulate your machine on my machine.

3PaulAlmond15y

I wrote this some years back. It is probably not without its flaws (and got some criticism on Less Wrong), but the general idea may be relevant: http://www.paul-almond.com/WhatIsALowLevelLanguage.htm

5Sebastian_Hagen15y

AFAICT, the 'proposed solution' from WiaLLL doesn't work. The rest of this post will be devoted to proposing a specific attack on it. As such it will be longish - since WiaLLL makes a strong claim about a fundamental aspect of epistemology, it deserves a considered refutation. My attack works on the assumption that current sets count programs, not languages - i.e. two different programs that implement the same language will result in two entries in the next current set (regardless of whether they are written in the same language). Paul Almond has told me that this is what he intended in WiaLLL. Consider a family of turing-complete languages, whose members are called Xn (n ∈ N). I will use 'X-like' to refer to any language from this family. Let X := X8. Programs for any language Xn can be divided into three categories: 1. Programs that begin with n 0 bits. This prefix is solely used to encode that these programs are of type 1; it encodes no other data. The rest of the program source encodes a program in some suitable turing-complete language (the details of which don't matter for this attack). 2. Programs that have a length of 0 (i.e. empty strings). All such programs are defined to implement X. 3. Programs that don't fit either 1 or 2. All such programs are defined to implement X(n+1). It follows that the set of programs of any bounded length (>0) writable in any Xn (n>0) is dominated by interpreters for X-like languages with a greater n than their own. While all such languages are turing-complete, for any given program length the proportion of non-X-like-interpreting programs writable in them is small, and inversely exponential in n. What happens when we run the proposed solution from WiaLLL on this X? Assuming sufficiently high values of N, on each step but the first the current set will consist of a mix of X-likes and other languages. For purposes of calculating a lower bound on the long-term proportion of X-likes, assume that programs for non-X-likes

2AlephNeil15y

(Part 2) But hang on - I can see an easy way to improve on Paul's definition: He looks at all implementations of languages, of length less than or equal to N, takes a simple arithmetic mean and then lets N go to infinity. But how about we take a weighted arithmetic mean where the weight of a given implemention is 1/2^itslength (and then we normalise if necessary so that the weights sum to 1). (This does away with the need for N and taking a limit as N goes to infinity.) Then in your example above, the fact that there is always an implementation of X whose length is 0 means that a fixed proportion of the weight previously assigned to X-like languages 'bleeds away' on every iteration. I think we can use a "pathological" language to make X[0] particularly hard to implement (on average) rather than particularly easy, so it would still be false that "low-level index for Y" measured from X must be the same as measured from Y. But perhaps we can define the "low-level index" of a language Y as the infimum of all low-level indices as measured from languages X? (Needless to say, the trick of weighting the average by 1/2^length is very standard.)

2Sebastian_Hagen15y

Let's see if I understand this correctly. You're redefining current sets to a mapping from the set of all computable and turing-complete languages to their weights in [0,1]. Except for the first step of the process, you will actually have non-zero weights on each of those languages, so we're working with true countably infinite sets here. This might require a stronger hypercomputer to implement than WiaLLL, but I suppose that's fine for the purely theoretical purposes of these constructions. Weighting each program by 1/2^i probably doesn't work, since that still results in a total measure of 1 for each set of all programs with a given length n ∈ N (since there are 2^n programs in each of those sets), but that's a minor detail - weighting by e.g. 1/4^i should get you a workable result. I think this is still attackable in so far as that it's possible to produce a setup that will make X win over any non-X-like. It would lose to certain other X-likes in each round, but this does not hold in the (I → ∞) limit. It's probably not necessary for me to write up the modified attack in full detail; the basic idea is to redefine the categories of Xn programs as follows: 1. Programs that begin with 2n 0 bits. This prefix is solely used to encode that these programs are of type 1; it encodes no other data. The rest of the program source encodes a program in some suitable turing-complete language (the details of which don't matter for this attack). 2. Programs that begin with j ∈ [n,2n) 0 bits. All such programs are defined to implement X. 3. Programs that don't fit either 1 or 2. All such programs are defined to implement X(n+1). So the support of Xn for X approaches 0 in the (n → ∞) limit, which allows the X-like family to keep a large weight through arbitrarily many iterations, and even in the limit of (I → ∞). However, it approaches 0 much slower than their support for any non-X-like, which means that X still wins over every non-X-like at each step of the simplicity c

2AlephNeil15y

So essentially, the problem with Paul Almond's suggestion is that one can find languages X that are 'X[0]-pathological' in the sense that almost all of their children are both (i) good at implementing some fixed language X[0] and (ii) even more X[0]-pathological than they are (this being a recursive definition of 'pathological'). Such languages destroy any hope that Paul's "low-level index for Y" won't depend on which language X we start from.

2Johnicholas15y

Kolomogorov complexity isn't an option, and approximations of Kolmogorov complexity don't have that uniqueness property. Even if we did have that uniqueness property, it's still not all that helpful if you're trying to decide among a finite number of finite-complexity hypotheses. Choice of description language is still sufficient to completely determine your choice of which hypothesis is simplest. As far as I can tell, a MDL-ish complexity measurement that is "unique up to a constant" acts like a prior - in the limit of infinite evidence, the initial bias will wash out. Kelly is fond of analogizing this "justification" of Occam to a flat tire - it helps you get home because you can eventually fix it.

1Daniel_Burfoot15y

Right, but we can find a series of increasingly tight upper bounds to the KC, and the bounds are language-independent in the same sense that the KC itself is. Then for any sufficiently large dataset, we can say that the probability of the data is given by the model that produces the tightest upper bound for the KC. This is nicely rigorous - if you think my model is unjustified and doesn't correspond to the "real" probabilities, all you have to do is produce a new model that achieves a tighter KC-bound.

0Johnicholas15y

Can you help me with the question: "How do you decide which of a finite number of hypotheses, each of which is of finite complexity, is simplest, using approximations of Kolmogorov complexity?" Also, my understanding is that people normally approximate Kolmogorov complexity by using non-Turing-complete models of computation, such as the finite-state automata that Hutter and his co-authors recently used to play Pac-Man: http://il.youtube.com/watch?v=RhQTWidQQ8U If the model of computation is not Turing complete, then complexity measurements derived from it do not have the "uniqueness" property that we're talking about.

4cousin_it15y

Thanks for that link, I wasn't aware of Monte Carlo AIXI. It's amusing how little it took to make Nesov and Roko frightened.

3XiXiDu15y

A interposed question: Can you or someone point me to an introduction or starting point regarding the overall topic related to Kolomogorov complexity (so I am able to follow, not contribute)? Or is an explanation already contained in the sequences? This idea seems to be discussed quite often on LW. Thank you.

8cousin_it15y

I don't think it's been thoroughly explained in the sequences, but why not start with the Wikipedia page? Or this post. On second thought, this is actually a very simple idea that can be explained in a comment. Let me try: The Kolmogorov complexity of a finite string of bytes is the size (in bytes) of the shortest Python program that outputs that string and then halts. Or you could use any other language in place of Python and get a different value of complexity - but it cannot differ by more than a constant as the string grows, because you could always implement a constant-size Python interpreter in the other language to "convert" one description into another. So complexity is not unique, but all definitions of complexity sorta "agree" as we go to infinity. For example, if a string consists of "a" repeated a million times, its Kolmogorov complexity is quite small because you can write a very short program that outputs a million a's and then halts. But this is not the case for all possible strings. Here's a simple way to see why: there can't be more Python programs of length N than there are different byte strings of length N, because any program is itself a byte string. Therefore, you cannot encode all strings of length N using programs of length N/2 or something, because there's not enough of them. Actually this shows that most strings of length N must have complexity close to N. (Obviously it can't be much larger than N, because any string of length N can be output by a program of length N+c that "hardcodes" that specific string.) Another important observation. Imagine you have a huge, random-looking string of bytes. How can you calculate its Kolmogorov complexity? Turns out that you can't, because there's no easier way to do that than, essentially, simulating all programs up to length N and seeing what they output. But this is impossible because any given program can (say) hang idly for a million years and only then start outputting symbols, and you have no

0Daniel_Burfoot15y

Imagine some Oracle told you that obtaining the ultimate theory of physics - the one that unifies relativity and quantum gravity, and all that - was impossible. Then following your logic you might conclude that it was pointless to study physics, since one can never obtain the perfect theory. But that would be wrong; it would still be highly worthwhile to attempt to obtain better and better physical theories.

0XiXiDu15y

Thanks, great! I think this might be another introductory explanation.

-2XiXiDu15y

Here is more: Information vs. Meaning

0wedrifid15y

Which notion do you prefer?

2cousin_it15y

Haven't thought too hard about this question. In my mind complexity is filed away as a "mystery", on the same shelf as frequentism vs Bayesianism, decision theory and other things. I know the state of the art and am convinced that it's unsatisfactory, but how to fix it is unclear. You could carve out a nice research area for yourself if you took any of these questions and ran with it :-)

2[anonymous]15y

I have a vague idea that one shouldn't look for a single correct notion of complexity, but instead try to find some reasonable properties that any measure should have, and study them all at once. For instance, if I propose a measure that turns out to be equivalent to square-root of K-complexity, who's to say it's better or worse? More seriously, one could imagine complexity measures like "the time it would take to explain in English", "the time it would take to explain in Japanese", "the time it would take a smart person to explain", "the time it would take a stupid person to explain"... But when I try to think about "properties a measure should have" all I can come up with is a kind of monotonicity: a complexity measure is a real-valued function on strings whose value on a given string is larger than the value on an (initial?) substring. That is not even true of K-complexity. (E.g. if N is an integer with high complexity, but less than 10 to the one-hundred, then a string of N repeated zeroes will have higher complexity than a string of 10 to the one-hundred zeroes.)

0wedrifid15y

So many topics, so little time! :)

6cousin_it15y

It's amazing how much a person can do if some topic manages to interest them more than the Internet, even for a little while.

2wedrifid15y

A timely reminder. I'd better go back to obsessing about nootropics and see if I cannot amaze myself somewhat. Damn Lesswrong and its "Recent Comments:" and its orange envelope icon.

2cousin_it15y

When you reach a result, be sure to post it and make the Internet a little more alluring for all of us :-)

1wedrifid15y

This suggests a new focus - drugs to treat internet addiction and promote procrastination resistance. The experimental procedure seems obvious. The double-blind, placebo controlled T4ET (Tv Tropes - Time To Exit Test).

0wedrifid15y

This concept sounds like one that could be a useful addition to your 'could', 'probability' mini-series. Putting things into 'python' language, as you have done, makes the concepts much simpler to apply.

2cousin_it15y

That mini-series is about stuff that I make up myself. K-complexity is a well known notion, maybe you could ask JoshuaZ - he was planning to write some introductory math posts.

0wedrifid15y

Ok. I actually considered the 'could' post a well known notion just expressed in an intuitive format. Not a bad idea. Failing that I could always write one myself. But for now I'll just hope your comment above comes up in the custom search when I'm looking for a link target. You touched on most of the important details (albeit briefly) so that was about 2/3 post ready as is.

1Johnicholas15y

Sorry, I thought it was old news in LW circles. Here's the actual paper: http://jveness.info/publications/veness_rl_via_aixi_approx.pdf

-1Vladimir_Nesov15y

It's not the specific result that's frightening, but the overall research direction, where each result contributes exclusively to destroying the world. We've got a small tiny step closer to disaster, job well done.

0Dan_Moore15y

I see. Thanks. Question: How many bits does adding a recursive feature take up?

1cousin_it15y

I'm pretty sure Nature doesn't have one.

4JohannesDahlstrom15y

Obligatory XKCD

3Dan_Moore15y

In the game described by the OP, Nature must invent a law, which must be expressed in a language. So, in the game, there must be a description language. As to whether (real life) Nature can be said to have a description language: I assess the validity of this concept in terms of its success or failure in explaining phenomena. I wouldn't be 'pretty sure' of any hypothesis unless it were tested.

0adsenanim15y

So, in a true language the results of saying something could be equal to actions? I'm not sure how to phrase this... I think say and do: Heat + Air + Fuel and I get Fire? Hmm...

[-]ChrisHibbert15y30

The game of Science vs. Nature is more complicated than that, and it's the interesting structure that allows scientists to make predictions that are better than "what we've seen so far is everything there is." In particular, the interesting things in both Chemistry and particle Physics is that scientists were able to find regularities in the data (the Periodic Table is one example) that led them to predict missing particles. Once they knew what properties to look for, they usually were able to find them. When a theory predicts particles that a... (read more)

[-]JohnDavidBustard15y20

What I love about this post is that, at heart, it challenges the very ability to be rational, it hints at the possibility that what is most effective (and how we ultimately function) is as statistical learning machines whose understanding and predictive ability lie in structures very different from the ones we use to communicate. Our language and formal argument structures define a space of understanding that is more a reflection of our communication technology than the true nature of things. In this way, Occams razor and Kolmogorov complexity reflect a mo... (read more)

[-]Will_Sawin15y20

Successful analysis of uncertainty without using a prior implies some kind of hidden Bayes-structure.

Consider hypothesis 1 and hypothesis 2. Hypothesis 2 can pretend to be hypothesis 1 but not vice versa. What does this mean?

p(D1|H2)>0

p(D2|H1)=0

Observations about D2 are clearly irrelevant to our present inference problem, instead, what matters is this:

p(D1|H1)=1

p(D1|H2)<1

p(D1|H1)>p(D1|H2)

that is, D1 causes a Bayesian agent to update in favor of hypothesis 1, because hypothesis 1 fits the evidence better.

(this is only exactly true if D1 and D2 are... (read more)

0Johnicholas15y

If I understand you correctly, you're saying that reasoning "non-probabilistically" about possibilities and impossibilities can be viewed as a subset of Bayesian reasoning, where instead of numerical probabilities, one uses qualitative probabilities - zero corresponds to impossible, nonzero corresponds to possible and so on. I agree with you, that you can view it that way; but probably there are ways to view any sort of reasoning as a "subset" of some other domain of reasoning. The best analogy I have is with classical and intuitionistic logic. The implication/negation fragment of intuitionistic propositional logic can be viewed as a subset of classical logic, in that the theorems provable in intuitionistic logic are a strict subset of classical logic theorems (if you map classical implication to intuitionistic implication, and classical negation to intuitionistic negation). However, there's a slightly trickier mapping (http://en.wikipedia.org/wiki/G%C3%B6del%E2%80%93Gentzen_negative_translation) which introduces a lot of double negations, and if you use this mapping, classical propositional logic fits entirely inside of intuitionistic logic, and intuitionistic logic can be viewed as introducing a new "constructive" implication while keeping all of the old truths.

0Will_Sawin15y

What I'm saying is that effective techniques for reasoning non-probablistically derive their effectiveness from their similarity to Bayesian reasoning. In some special cases, they are identical to Bayesian reasoning, but in other cases, Bayes is superior. This is because Bayes is proved, by theorem, to be Right. Not Unbiased or Robust, but Right. http://lesswrong.com/lw/mt/beautiful_probability/ seems like the appropriate lesswrong post. So I'm not saying it's a subset, I'm saying its similar to. One way to be similar to something is to copy some but not all of it, which is taking a subset plus adding other stuff. This is almost the same concept, obviously. ` Using the two logics I know best: Classical propositional logic is Bayesian logic for probabilities of 1 and 0. All results of Bayesian logic can be derived by propositional logic and certain additional assumptions. But these are qualitatively different. I don't know how to describe this difference - which one is fundamental? I think both are, in different ways. What's missing is how you use it in real-world problems, I think. Because it is often the case that one formal system is contained in another, and the other is contained in the one. But formal systems are only useful if they model reality, or serve some other function. Classical logic is used more than intuitionistic logic - is this because it's more useful? (Classical logic is more useful than Bayesian in the world of mathematics, but less in the real world, I think.) I need to think more about this...

[-]JoshuaZ15y10

Is your additive example supposed to be the minimum number of 1s needed to represent n as a product or sum of 1s? It looks like that but if so, 6 should be ((1+1)(1+1+1)) and 11 should be (1+(1+1)(1+1+1+1+1)). 12 is still an "inversion."

ETA A few other numbers seem to be off as a result of the mistake with 6. 30 and 42 also have more compact representations.

6Johnicholas15y

No, it's one of the expressions with the minimum number of characters needed to represent the number, using this grammar: EXP ::= ( EXP + EXP ) | ( EXP * EXP ) | ( NUM ) NUM ::= 1 | NUM + 1 As you can see, reasonable people can reasonably disagree on what is an appropriate grammar to use for minimum description length, even on arbitrary examples like this one.

0adsenanim15y

An observation about the number sequence: ... 3 == (1+1+1) 4 == (1+1+1+1) 5 == (1+1+1+1+1) 6 == (1+1+1+1+1+1) 7 == (1+1+1+1+1+1+1) ... I would rewrite this as: ... 3 == (1+1+1) 4 == ((1+1)*(1+1)) 5 == (1+1+1+1+1) 6 == ((1+1)*(1+1+1)) 7 == (1+1+1+1+1+1+1) ... Where the first inversion is at 4-5. The reason I do this is because of the relation to primes and to make the pattern seem more consistent. Then of course the next step would be exponent notation: … 8 == ((1+1)^(1+1+1) … So the idea of choosing the correct method of patterning can change dramatically with small changes in rules, such that we can have: 8 == (1+1+1+1+1+1+1+1) 8 == ((1+1)*(1+1+1+1)) 8 == ((1+1)^(1+1+1) And the places of inversion will change according to method. A question in an evolutionary context: Say the operators are symbols of evolutionary complexity, increasing from addition to multiplication to exponent, wouldn't use be situational? Given specific circumstances could one "8" be more valid than another "8", even though the understanding is that they amount to the same thing? Is the pattern one method will make more valid than the pattern of another? I can’t help but think that the method is dependent on the amount of space in which we have to create the pattern, that the addition method is ok for a large posting width, and small numbers, but as the numbers increase and/or the posting width decreases, method becomes more important, maybe even requiring the creation (evolution) of more operators for more methods ...

3thomblake15y

A little off-topic, but you can greatly shorten your descriptions above. In the case where you're just adding 1's, you could just as well leave off the pluses or the 1's. So 8 == 11111111 And you don't need the parens since the operations you're actually performing always give higher precedence to addition than to multiplication. So then 8 == 11*11*11 == 11^111 And since you're never multiplying by 1 or mutiplying/adding zero, you can use pluses instead of 1's. So: 1 == 2 == + 3 == ++ 4 == +*+ 5 == ++++ 6 == +*++ 7 == ++++++ 8 == +^++ 9 == ++^++ 10 == ++++^+ And so on

1adsenanim15y

I think removing the parentheses can affect the overall pattern formation. 12 == (((1+1)^(1+1))*(1+1+1)) 12 != +^+*++ Because we are changing the order of operations: 1+1^1+1*1+1+1 2^2*3 2^6 64 Also, I don't think there is a way to use exponents and make 12 without using parentheses. So if we simplify the method of notation, then some operations cannot be used for certain numbers. and this changes the pattern, and/or the availability of possible patterns. This topic is discussed in D. Hofstadter's book Godel, Escher, Bach: an Eternal Golden Braid.

0thomblake15y

Indeed, I hadn't thought far enough ahead to worry about cases where order of operations would matter between exponentiation and multiplication. So parens would have to be acceptable where it's ambiguous. What the default order of operations should be, can be left as an exercise for the reader. as for your specific concern, if we left the order of operations as "+ then ^ then *", +^+*++ would work fine.

0DSimon15y

We could keep the abbreviated syntax but avoid the necessity of parentheses by using post-op notation, couldn't we?

0thomblake15y

For that, I think we'd need a stack with a line-termination character. In effect, we'd be removing two characters for one, which I suppose would be an improvement.

0adsenanim15y

So: * The order of operations can be variable dependent on the number * The notation method effects pattern * The pattern can change if primes are always used * Etc. How many ways can we change the rules? How many rules are there? It is amazing just how much variation in pattern can be achieved just by changing and/or adding/subtracting rules. There is the general rule of rules: The fewer the rules, the less variable the pattern, and the inverse, the more rules the more variable the pattern. * Sorry for the late edit, but, I would also add that the fewer the rules the less adaptable as well...

[-]Unknowns15y10

If there are a very large number of fundamental particles, it is likely that there is some reason for the number, rather than being a merely random number, and if so, then you would likely reach the truth quicker by using some version of Kolmogorov complexity, rather than counting the particles one by one as you discover them.

[-]cousin_it15y100

This is circular. You assume that the universe is likely to have low K-complexity and conclude that a K-complexity razor works well. Kelly's work requires no such assumption, this is why I think it's valuable.

0Unknowns15y

Yes, "It is likely that there is some reason for the number" implies a low Kolmogorov complexity. But it seems to me that if you look at past cases where we have already discovered the truth, you may find that there were cases where there were indeed reasons, and therefore low K-complexity. If so that would give you an inductive reason to suspect that this will continue to hold; this argument would not be circular. In other words, as I've said previously, in part we learn by induction which type of razor is suitable.

0Oscar_Cunningham15y

But then you've got to justify induction, which is as hard as justifying Occam.

0Unknowns15y

I have a justification for induction too. I may post it at some point.

0khafra15y

Is this a "margin is too small to contain" type of thing? Because I would be very surprised if there were a philosophically airtight location where recursive justification hits bottom.

0Oscar_Cunningham15y

Cool.

0NancyLebovitz15y

Could you expand on this? The only reason I can think of is that the particles have various qualities, and we've got all the possible combinations. I assume that there some range of numbers which suggest an underlying pattern-- it's vanishingly unlikely that there's a significance to the number of stars in the galaxy. I think there was something in Gregory Bateson about this-- that there's a difference between a number that's part of a system (he was talking about biology, not physics) as distinct from "many".

0[anonymous]15y

So you claim that a K-complexity prior will work better because our universe is likely to have low K-complexity. This is circular - a justification of Occam's razor that takes for granted your own intuitive concept of Occam's razor. Kelly's work makes no such assumption, that's why it looks valuable to me.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

43

Kevin T. Kelly's Ockham Efficiency Theorem

43

43