Summary: This post provides a brief discussion of the traditional scientific method, and mentions some areas where the method cannot be directly applied. Then, through a series of thought experiments, a set of minor modifications to the traditional method are presented. The result is a refined version of the method, based on data compression.
Related to: Changing the Definition of Science, Einstein's Arrogance, The Dilemma: Science or Bayes?
ETA: For those who are familiar with notions such as Kolmogorov Complexity and MML, this piece may have a low ratio of novelty:words. The basic point is that one can compare scientific theories by instantiating them as compression programs, using them to compress a benchmark database of measurements related to a phenomenon of interest, and comparing the resulting codelengths (taking into account the length of the compressor itself).
Notes on Traditional Method
This post proposes a refined version of the scientific method which, it will be argued later, is more directly applicable to the problems of interest in artificial intelligence. Before doing so, it is worth briefly examining the traditional method and the circumstances in which it can be applied. The scientific method is not an exact procedure, but a qualitative statement of it goes roughly as follows:
- Observe a natural phenomenon.
- Develop a theory of that phenomenon.
- Use the theory to make a prediction.
- Test the prediction experimentally.
A full discussion of the philosophical significance of the scientific method is beyond the scope of this post, but some brief remarks are in order. The power of the scientific method is in the way it links theory with experimental observation; either one of these alone is worthless. The long checkered intellectual history of humanity clearly shows how rapidly pure theoretical speculation goes astray when it is not tightly constrained by an external guiding force. Pure experimental investigation, in contrast, is of limited value because of the vast number of possible configurations of objects. To make predictions solely on the basis of experimental data, it would be necessary to exhaustively test each configuration.
As articulated in the above list, the goal of the method appears to be the verification of a single theory. This is a bit misleading; in reality the goal of the method is to facilitate selection between a potentially large number of candidate theories. Given two competing theories of a particular phenomenon, the researcher identifies some experimental configuration where the theories make incompatible predictions and then performs the experiment using the indicated configuration. The theory whose predictions fail to match the experimental prediction is discarded in favor of its rival. But even this view of science as a process of weeding out imperfect theories in order to find the perfect one is somewhat inaccurate. Most physicists will admit or disclaim that even their most refined theories are mere approximations, though they are spectacularly accurate approximations. The scientific method can therefore be understood as a technique for using empirical observations to find the best predictive approximation from a large pool of candidates.
A core component of the traditional scientific method is the use of controlled experiments. To control an experiment means essentially to simplify it. To determine the effect of a certain factor, one sets up two experimental configurations which are exactly the same except for the presence or absence of the factor. If the experimental outcomes are different, then it can be inferred that this disparity is due to the special factor.
In some fields of scientific inquiry, however, it is impossible or meaningless to conduct controlled experiments. No two people are identical in all respects, so clinical trials for new drugs, in which the human subject is part of the experimental configuration, can never be truly controlled. The best that medical researchers can do is to attempt to ensure that the experimental factor does not systematically correlate with other factors that may affect the outcome. This is done by selecting at random which patients will receive the new treatment. This method is obviously limited, however, and these limitations lead to deep problems in the medical literature. It is similarly difficult to apply the traditional scientific method to answer questions arising in the field of macroeconomics. No political leader would ever agree to a proposal in which her country's economy was to be used as an experimental test subject. In lieu of controlled experiments, economists attempt to test their theories based on the outcomes of so-called historical experiments, where two originally similar countries implemented different economic policies.
A similar breakdown of the traditional method occurs in computer vision (recall that my hypothesis asserts that perception and prediction are the major components of intelligence, implying that the study of vision is central to the study of AI). Controlled vision experiments can be conducted, but are of very little interest. The physical laws of reflection and optics that govern the image formation process are well understood already. Clearly if the same camera is used to photograph an identical scene twice under constant lighting conditions, the obtained images will be identical or very nearly so. And a deterministic computer vision algorithm will always produce the same result when applied to two identical images. It is not clear, therefore, how to use the traditional method to approach the problems of interest in computer vision, which include tasks like image segmentation and edge detection.
(The field of computer vision will be discussed in later posts. For now, the important thing to realize is that there are deep, deep problems in evaluating computer vision techniques. Given two image segmentation algorithms, how do you decide which one is better? The field has no compelling answer. The lack of empirical rigor in computer vision has been lamented in papers with titles like "Computer Vision Theory: the Lack Thereof" and "Ignorance, Myopia, and Naivete in Computer Vision Systems".)
The modifications to the scientific method are presented through a series of thought experiments related to a fictional character named Sophie.
Episode I: The Shaman
Sophie is a assistant professor of physics at a large American state university. She finds this job vexing for several reasons, one of which is that she has been chosen by the department to teach a physics class intended for students majoring in the humanities, for whom it serves to fill a breadth requirement. The students in this class, who major in subjects like literature, religious studies, and philosophy, tend to be intelligent but also querulous and somewhat disdainful of the "merely technical" intellectual achievements of physics.
In the current semester she has become aware of the presence in her class of a discalced student with a large beard and often bloodshot eyes. This student is surrounded by an entourage of similarly odd-looking followers. Sophie is on good terms with some of the more serious students in the class, and in conversation with them has found out that the odd student is attempting to start a new naturalistic religious movement and refers to himself as a "shaman".
One day while delivering a simple lecture on Newtonian mechanics, she is surprised when the shaman raises his hand and claims that physics is a propagandistic hoax designed by the elites as a way to control the population. Sophie blinks several times, and then responds that physics can't be a hoax because it makes real-world predictions that can be verified by independent observers. The shaman counters by claiming that the so-called "predictions" made by physics are in fact trivialities, and that he can obtain better forecasts by communing with the spirit world. He then proceeds to challenge Sophie to a predictive duel, in which the two of them will make forecasts regarding the outcome of a simple experiment, the winner being decided based on the accuracy of the forecasts. Sophie is taken aback by this but, hoping that by proving the shaman wrong she can break the spell he has cast on some of the other students, agrees to the challenge.
During the next class, Sophie sets up the following experiment. She uses a spring mechanism to launch a ball into the air at an angle A. The launch mechanism allows her to set the initial velocity of the ball to a value of Vi. She chooses as a predictive test the problem of predicting the time Tf that the ball will fall back to the ground after being launched at Ti=0. Using a trivial Newtonian calculation she concludes that Tf = 2 Vi sin(A)/g, sets Vi and A to give a value of Tf=2 seconds, and announces her prediction to the class. She then asks the shaman for his prediction. The shaman declares that he must consult with the wind spirits, and then spends a couple of minutes chanting and muttering. Then, dramatically flaring open his eyes as if to signify a moment of revelation, he grabs a piece of paper, writes his prediction on it, and then hands it to another student. Sophie suspects some kind of trick, but is too exasperated to investigate and so launches the ball into the air. The ball is equipped with an electronic timer that starts and stops when an impact is detected, and so the number registered in the timer is just the time of flight Tf. A student picks up the ball and reports that the result is Tf = 2.134. The shaman gives a gleeful laugh, and the student holding his written prediction hands it to Sophie. On the paper is written 1 < Tf < 30. The shaman declares victory: his prediction turned out to be correct, while Sophie's was incorrect (it was off by 0.134 seconds).
To counter the shaman's claim and because it was on the syllabus anyway, in the next class Sophie begins a discussion of probability theory. She goes over the basic ideas, and then connects them to the experimental prediction made about the ball. She points out that technically, the Newtonian prediction Tf=2 is not an assertion about the exact value of the outcome. Rather it should be interpreted as the mean of a probability distribution describing possible outcomes. For example, one might use a normal distribution with mean of 2 and standard deviation of .3. The reason the shaman superficially seemed to win the contest is that he gave a probability distribution while Sophie gave a point prediction; these two types of forecast are not really comparable. In the light of probability theory, the reason to prefer the Newtonian prediction above the shamanic one, is that it assigns a higher probability to the outcome that actually occurred. Now, plausibly, if only a single trial is used then the Newtonian theory might simply have gotten lucky, so the reasonable thing to do is combine the results over many trials, by multiplying the probabilities together. Therefore the real reason to prefer the Newtonian theory to the shamanic theory is that:
Where the k index runs over many trials of the experiment. Sophie then shows how the Newtonian probability predictions are both more confident and more correct than the shamanic predictions. The Newtonian predictions assign a very large amount of probability to the region around the outcome Tf=2, and in fact it turns out that almost all of the real data outcomes fall in this range. In contrast, the shamanic prediction assigns a relatively small amount of probability to the Tf=2 region, because he has predicted a very wide interval (1 < Tf < 30). Thus while the shamanic prediction is correct, it is not very confident. The Newtonian prediction is correct and highly confident, and so it should be prefered.
Sophie tries to emphasize that the Newtonian probability prediction only works well for the real data. Because of the requirement that probability distributions be normalized, the Newtonian theory can only achieve superior high performance by reassigning probability towards the region around Tf=2 and away from other regions. A theory that does not perform this kind of reassignment cannot achieve superior high performance.
Sophie recalls that some of the students are studying computer science and for their benefit points out the following. Information theory provides the standard equation L(x) = -log P(x) governs the relationship between the probability of an outcome and the length of the optimal code that should be used to represent it. Therefore, given a large data file containing the results of many trials of the ballistic motion experiment, the two predictions (Newtonian and shamanic) can both be used to build specialized programs to compress the data file. Using the codelength/probability conversion, the above inequality can be rewritten as follows:
This inequality indicates an alternative criterion that can be used to decide between two rival theories. Given a data file recording measurements related to a phenomenon of interest, a scientific theory can be used to write a compression program that will shrink the file to a small size. Given two rival theories of the same phenomenon, one invokes the corresponding compressors on a shared benchmark data set, and prefers the theory that achieves a smaller encoded file size. This criterion is equivalent to the probability-based one, but has the advantage of being more tangible, since the quantities of interest are file lengths instead of probabilities.
Episode II: The Dead Experimentalist
Sophie is a theoretical physicist and, upon taking up her position as assistant professor, began a collaboration with a brilliant experimental physicist who had been working at the university for some time. The experimentalist had previously completed the development of an advanced apparatus that allowed the investigation of an exotic new kind of quantum phenomenon. Using data obtained from the new system, Sophie made rapid progress in developing a mathematical theory of the phenomenon. Tragically, just before Sophie was able complete her theory, the experimentalist was killed in a laboratory explosion that also destroyed the special apparatus. After grieving for a couple of months, Sophie decided that the best way to honor her friend's memory would be to bring the research they had been working on to a successful conclusion.
Unfortunately, there is a critical problem with Sophie's plan. The experimental apparatus was extremely complex, and Sophie's late partner was the only person in the world who knew how to use it. He had run many trials of the system before his death, so Sophie had a quite large quantity of data. But she had no way of generating any new data. Thus, no matter how beautiful and perfect her theory might be, she had no way of testing it by making predictions.
One day while thinking about the problem Sophie recalls the incident with the shaman. She remembers the point she had made for the benefit of the software engineers, about how a scientific theory could be used to compress a real world data set to a very small size. Inspired, she decides to apply the data compression principle as a way of testing her theory. She immediately returns to her office and spends the next several weeks writing Matlab code, converting her theory into a compression algorithm. The resulting compressor is highly successful: it shrinks the corpus of experimental data from an initial size of 8.7e11 bits to an encoded size of 3.3e9 bits. Satisfied, Sophie writes up the theory, and submits it to a well-known physics journal.
The journal editors like the theory, but are a bit skeptical of the compression based method for testing the theory. Sophie argues that if the theory becomes widely known, one of the other experts in the field will develop a similar apparatus, which can then be used to test the theory in the traditional way. She also offers to release the experimental data, so that other researchers can test alternative theories using the same compression principle. Finally she promises to release the source code of her program, to allow external verification of the compression result. These arguments finally convince the journal editors to accept the paper.
Episode III: The Upstart Theory
After all the mathematics, software development, prose revisions, and persuasion necessary to complete her theory and have the paper accepted, Sophie decides to reward herself by living the good life for a while. She is confident that her theory is essentially correct, and will eventually be recognized as correct by her colleagues. So she spends her time reading novels and hanging out in coffee shops with her friends.
A couple of months later, however, she receives an unpleasant shock in the form of an email from a colleague which is phrased in consolatory language, but does not contain any clue as to why such language might be in order. After some investigation she finds out that a new paper has been published about the same quantum phenomenon of interest to Sophie. The paper proposes a alternative theory of the phenomenon which bears no resemblance whatever to Sophie's. Furthermore, the paper reports a better compression rate than was achieved by Sophie, on the database that she released.
Sophie reads the new paper and quickly realizes that it is worthless. The theory depends on the introduction of a large number of additional parameters, the values of which must be obtained from the data itself. In fact, a substantial portion of the paper involves a description of a statistical algorithm that estimates optimal parameter values from the data. In spite of these aesthetic flaws, she finds that many of her colleagues are quite taken with the new paper and some consider it to be "next big thing".
Sophie sends a message to the journal editors describing in detail what she sees as the many flaws of the upstart paper. She emphasizes the asthetic weakness of the new theory, which requires tens of thousands of new parameters. The editors express sympathy, but point out that the new theory outperforms Sophie's theory using the performance metric she herself proposed. The beauty of a theory is important, but its correctness is ultimately more important.
Somewhat discouraged, Sophie sends a polite email to the authors of the new paper, congratulating them on their result and asking to see their source code. Their response, which arrives a week later, contains a vague excuse about how the source code is not properly documented and relies on proprietary third party libraries. Annoyed, Sophie contacts the journal editors again and asks them for the program they used to verify the compression result. They reply with a link to a binary version of the program.
When Sophie clicks on the link to download the program, she is annoyed to find it has a size of 800 megabytes. But her annoyance is quickly transformed into enlightenment, as she realizes what happened, and that her previous philosophy contained a serious flaw. The upstart theory is not better than hers; it has only succeeded in reducing the size of the encoded data by dramatically increasing the size of the compressor. Indeed, when dealing with specialized compressors, the distinction between "program" and "encoded data" becomes almost irrelevant. The critical number is not the size of the compressed file, but the net size of the encoded data plus the compressor itself.
Sophie writes a response to the new paper which describes the refined compression rate principle. She begins the paper by reiterating the unfortunate circumstances which forced her to appeal to the principle, and expressing the hope that someday an experimental group will rebuild the apparatus developed by her late partner, so that the experimental predictions made by the two theories can be properly tested. Until that day arrives, standard scientific practice does not permit a decisive declaration of theoretical success. But surely there is some theoretical statement that can be made in the meantime, given the large amount of data available. Sophie's proposal is that the goal should be to find the theory that has the highest probability of predicting a new data set, when it can finally be obtained. If the theories are very simple in comparison to the data being modeled, then the size of the encoded data file is a good way of choosing the best theory. But if the theories are complex, then there is a risk of overfitting the data. To guard against overfitting complex theories must be penalized; a simple way to do this is simply to take into account the codelength required for the compressor itself. The length of Sophie's compressor was negligible, so the net score of her theory is just the codelength of the encoded data file: 3.3e9 bits. The rival theory achieved a smaller size of 2.1e9 for the encoded data file, but required a compressor of 6.7e9 bits to do so, giving a total score of 8.8e9 bits. Since Sophie's net score is lower, her theory should be prefered.
So, LWers, what do you think? For the present, let's leave aside questions like why this might be relevant for AI, and focus on whether or not the method is a legitimate restatement of the traditional method. If you were a physicist observing the dispute and trying to decide which theory to prefer, would you believe Sophie's, Sophie's rivals' theory, or neither? Some plausible objections are:
- A - Refuse to accept the probabilistic interpretation of the Newtonian prediction (or maybe the conversion to a codelength comparison).
- B - Refuse to accept the significance of Sophie's compression result, without knowing more about the format of the original database.
- C - Refuse to accept that Sophie's result will probably generalize to a new data set, or that penalizing large compressors is an adequate way of guarding against overfitting.
(Originality disclaimer: this post, by itself, is not Highly Original. It is heavily influenced by Eliezer's articles linked to above, as well as the ideas of Minimum Description Length and Kolmogorov Complexity, and especially an article by Matt Mahoney providing a rationale for a large text compression benchmark. The new ideas mostly involve implications of the above method, which will be discussed in later posts.)
So, now we have a second uninformative article in your series, in which you're just stating the minimum message length (MML) formalism (as you note at the end), which most people here are already familiar with, and which we already accept as a superior epistemology to traditional science.
And you took a lot more words to say it than you needed to.
Now, if you were just out to present a new, more accessible introduction to MML, that would be great: stuff that helps people understand the foundation of rationality is welcome here. But you're claiming you have a new idea, and yet you've already blown over 5,000 words saying nothing new. Commenters asked you last time to get to the point.
Please do so.
Then we can explain to you that people like Marcus Hutter (who supports the compression benchmark and advised Matt Mahoney) are well aware of this epistemology, and yet still haven't produced a being with the intelligence of a three-year-old, despite having had computable algorithms that implement this for more than three years now. A little more insight is still needed, beyond MML-type induction.
ETA: You know what? Daniel_Burfoot is still getting net positive karma for both articles, despite not advancing any seemingly promising idea. I might as well post my rival research program and corresponding desired model in a top-level article. I probably won't have all the citations to drop, but I'm sure most here will find it more insightful and promising than fluffy, delaying articles like this one. And you'll see something new as well.
I thought it was pretty well-written. (And stating the "obvious" isn't always a bad thing.)
Sure, if you're a fan of fluff and haven't been down the "Ultimate AI Idea" street too many times.
I agree, which is why I said:
But that doesn't apply here because:
Occam's razor is part of the traditional scientific method - according to:
You can't really do much useful science without employing Occam's razor.
Sure, but traditional science's dependence on Occam is only implicit.
That Wikipedia article mentions it by name, though. It says:
"One of the elements of any scientific utility function is the refutability of the model. Another is its simplicity, on the Principle of Parsimony also known as Occam's Razor."
This comment just seems really harsh to me... I understand what you're saying but surely the author doesn't have bad intentions here...
Flying by the intuitive seat of my pants, it seems that minimum message length of theory + data is too kind to moderately complex theories that explain extremely complex data, as compared to very simple theories that explain the data almost as well. Maybe the product of the two MML's, theory and data, would be more apt.
For what it's worth, some of this was new to me, despite Kolmogorov Complexity being somewhat familiar.
Kolmogorov complexity is somewhat familiar to me too.
Optimizing for the sum of the lengths of the theory and the compressed data seems to be the right thing to do because one can always store part of the theory in the compressed data. This doesn't change the sum (theory + compressed data). Optimizing for the product might reward this behavior too much.
While I don't think this is in any obvious way helpful for your formalism, the sections from Sophie are very good descriptors of a lot of the issues related to hypothesis forming and use of Occam's razor.
That said, I'm not sure that your initial summary of the traditional scientific method is accurate. Hypotheses are rarely tested in isolation. Among other problems, hypotheses generally live inside larger frameworks. Moreover, observation of data is theory-laden and often only even makes sense in such large contexts. It might help to read some classical history of science and phil sci stuff since it seems from this very short descriptor of the classic scientific method that you don't have much familiarity with that. Popper, Kuhn, Quine and Lakatos would all be highly recommended.
Edit: Overall, the sections with Sophie are well done enough I almost wonder if they should be split off and added to one of the major sequences or added as an additional essay for one of the sequences.
I think Kevin T Kelly has some slight adjustment of this "the scientific method is compression" paradigm. http://www.andrew.cmu.edu/user/kk3n/ockham/Ockham.htm
As far as I understand, the basic idea is: In order to possibly eventually become correct, you must switch among theories as you acquire more evidence, moving from simpler to more complex, because it's impossible to list theories from more complex to simpler (there are no monotonic descending functions from the natural numbers to the natural numbers, and theories can be Godel-coded).
The claim "the simplest theory that fits the data is also most likely to be correct" (or a variation of that claim regarding compression performance) is a factual claim about the world - a claim that may not be true (aggregate methods, boosting and bagging, can yield better predictors than predicting with the simplest theory).
Kevin Kelly is providing an alternative reason why we should follow simplicity in scientific method, one not based on these dubious factual claims.
I think the majority of research in machine learning indicates that this claim IS true. Certainly all methods of preventing overfitting that I am aware of involve some form of capacity control, regularization, or model complexity penalty. If you can cite a generalization theorem that does not depend on some such scheme, I would be very interested to hear about it.
Echoing what others have written, this is an excellent and well-written intro to MML. But you should be clear from the beginning that this is all standard, at least in certain circles, and that you're just laying the groundwork for something innovative to be posted later. This can only add to the rhetorical strength of the piece, since the reader won't go in with the heightened skepticism that they reserve for the completely novel.
Thanks for the constructive feedback.
This seems very well written and I'd like to complement you on that regard. I find the shaman example amusing and also very fun to read.
For Sophie, if she has a large data set, then her theory should be able to predict a data set for the same experimental configuration, and then the the two data sets would be compared. That is the obvious standard and I'm not sure why it's not permitted here. Perhaps you were trying to emphasize Sophie's desire to go on and test her theory on different experimental parameters, etc.
The original shaman example works very well for me, it is rather basic and doesn't make any very unsubstantiated claims. In the next examples, however, there needs to be more elaboration on the method in which you go from theory --> data. In the post you say,
Without knowing the details of how you go from theory to compressed end product, it's hard to say that this method makes sense. Actually, I would probably be fairly satisfied if you stopped after the second section. But when you introduce the third section, with the competition between colleagues, it implies there is some kind of unknown, nontrivial relation between fitting parameters of the theory, the theory, the compression program, the compression program data size, and the final compressed data.
It all seems pretty vague to make a conclusion like "add the compression program size and the final data size to get the final number".
Read that, voted it up, stopped. Further pages of elaboration unnecessary.
(Downvoted as a rather lengthy and non-explicit presentation of well-known or obvious ideas.)
That's a rather self-centered critique, isn't it? I didn't know about this idea, for example.
Thanks for the vocabulary add. Not worth 10 karma, but do feel free to reply for an upvote. :)
What's the pronunciation?
First "c" hard, second soft. (Compare "recalcitrant", which shares some etymology.)
Is that two syllables or three? My mind refuses to accept the two-syllable pronunciation with a soft c, [dIskalst], as English; it rejects the -lst. But putting a vowel in there, [dIskalsEd], doesn't seem right either, since the -ed suffix doesn't normally have a vowel. Could you provide a phonetic transcription or recording?