Summary: This post provides a brief discussion of the traditional scientific method, and mentions some areas where the method cannot be directly applied. Then, through a series of thought experiments, a set of minor modifications to the traditional method are presented. The result is a refined version of the method, based on data compression.
ETA: For those who are familiar with notions such as Kolmogorov Complexity and MML, this piece may have a low ratio of novelty:words. The basic point is that one can compare scientific theories by instantiating them as compression programs, using them to compress a benchmark database of measurements related to a phenomenon of interest, and comparing the resulting codelengths (taking into account the length of the compressor itself).
Notes on Traditional Method
This post proposes a refined version of the scientific method which, it will be argued later, is more directly applicable to the problems of interest in artificial intelligence. Before doing so, it is worth briefly examining the traditional method and the circumstances in which it can be applied. The scientific method is not an exact procedure, but a qualitative statement of it goes roughly as follows:
- Observe a natural phenomenon.
- Develop a theory of that phenomenon.
- Use the theory to make a prediction.
- Test the prediction experimentally.
A full discussion of the philosophical significance of the scientific method is beyond the scope of this post, but some brief remarks are in order. The power of the scientific method is in the way it links theory with experimental observation; either one of these alone is worthless. The long checkered intellectual history of humanity clearly shows how rapidly pure theoretical speculation goes astray when it is not tightly constrained by an external guiding force. Pure experimental investigation, in contrast, is of limited value because of the vast number of possible configurations of objects. To make predictions solely on the basis of experimental data, it would be necessary to exhaustively test each configuration.
As articulated in the above list, the goal of the method appears to be the verification of a single theory. This is a bit misleading; in reality the goal of the method is to facilitate selection between a potentially large number of candidate theories. Given two competing theories of a particular phenomenon, the researcher identifies some experimental configuration where the theories make incompatible predictions and then performs the experiment using the indicated configuration. The theory whose predictions fail to match the experimental prediction is discarded in favor of its rival. But even this view of science as a process of weeding out imperfect theories in order to find the perfect one is somewhat inaccurate. Most physicists will admit or disclaim that even their most refined theories are mere approximations, though they are spectacularly accurate approximations. The scientific method can therefore be understood as a technique for using empirical observations to find the best predictive approximation from a large pool of candidates.
A core component of the traditional scientific method is the use of controlled experiments. To control an experiment means essentially to simplify it. To determine the effect of a certain factor, one sets up two experimental configurations which are exactly the same except for the presence or absence of the factor. If the experimental outcomes are different, then it can be inferred that this disparity is due to the special factor.
In some fields of scientific inquiry, however, it is impossible or meaningless to conduct controlled experiments. No two people are identical in all respects, so clinical trials for new drugs, in which the human subject is part of the experimental configuration, can never be truly controlled. The best that medical researchers can do is to attempt to ensure that the experimental factor does not systematically correlate with other factors that may affect the outcome. This is done by selecting at random which patients will receive the new treatment. This method is obviously limited, however, and these limitations lead to deep problems in the medical literature. It is similarly difficult to apply the traditional scientific method to answer questions arising in the field of macroeconomics. No political leader would ever agree to a proposal in which her country's economy was to be used as an experimental test subject. In lieu of controlled experiments, economists attempt to test their theories based on the outcomes of so-called historical experiments, where two originally similar countries implemented different economic policies.
A similar breakdown of the traditional method occurs in computer vision (recall that my hypothesis asserts that perception and prediction are the major components of intelligence, implying that the study of vision is central to the study of AI). Controlled vision experiments can be conducted, but are of very little interest. The physical laws of reflection and optics that govern the image formation process are well understood already. Clearly if the same camera is used to photograph an identical scene twice under constant lighting conditions, the obtained images will be identical or very nearly so. And a deterministic computer vision algorithm will always produce the same result when applied to two identical images. It is not clear, therefore, how to use the traditional method to approach the problems of interest in computer vision, which include tasks like image segmentation and edge detection.
(The field of computer vision will be discussed in later posts. For now, the important thing to realize is that there are deep, deep problems in evaluating computer vision techniques. Given two image segmentation algorithms, how do you decide which one is better? The field has no compelling answer. The lack of empirical rigor in computer vision has been lamented in papers with titles like "Computer Vision Theory: the Lack Thereof" and "Ignorance, Myopia, and Naivete in Computer Vision Systems".)
The modifications to the scientific method are presented through a series of thought experiments related to a fictional character named Sophie.
Episode I: The Shaman
Sophie is a assistant professor of physics at a large American state university. She finds this job vexing for several reasons, one of which is that she has been chosen by the department to teach a physics class intended for students majoring in the humanities, for whom it serves to fill a breadth requirement. The students in this class, who major in subjects like literature, religious studies, and philosophy, tend to be intelligent but also querulous and somewhat disdainful of the "merely technical" intellectual achievements of physics.
In the current semester she has become aware of the presence in her class of a discalced student with a large beard and often bloodshot eyes. This student is surrounded by an entourage of similarly odd-looking followers. Sophie is on good terms with some of the more serious students in the class, and in conversation with them has found out that the odd student is attempting to start a new naturalistic religious movement and refers to himself as a "shaman".
One day while delivering a simple lecture on Newtonian mechanics, she is surprised when the shaman raises his hand and claims that physics is a propagandistic hoax designed by the elites as a way to control the population. Sophie blinks several times, and then responds that physics can't be a hoax because it makes real-world predictions that can be verified by independent observers. The shaman counters by claiming that the so-called "predictions" made by physics are in fact trivialities, and that he can obtain better forecasts by communing with the spirit world. He then proceeds to challenge Sophie to a predictive duel, in which the two of them will make forecasts regarding the outcome of a simple experiment, the winner being decided based on the accuracy of the forecasts. Sophie is taken aback by this but, hoping that by proving the shaman wrong she can break the spell he has cast on some of the other students, agrees to the challenge.
During the next class, Sophie sets up the following experiment. She uses a spring mechanism to launch a ball into the air at an angle A. The launch mechanism allows her to set the initial velocity of the ball to a value of Vi. She chooses as a predictive test the problem of predicting the time Tf that the ball will fall back to the ground after being launched at Ti=0. Using a trivial Newtonian calculation she concludes that Tf = 2 Vi sin(A)/g, sets Vi and A to give a value of Tf=2 seconds, and announces her prediction to the class. She then asks the shaman for his prediction. The shaman declares that he must consult with the wind spirits, and then spends a couple of minutes chanting and muttering. Then, dramatically flaring open his eyes as if to signify a moment of revelation, he grabs a piece of paper, writes his prediction on it, and then hands it to another student. Sophie suspects some kind of trick, but is too exasperated to investigate and so launches the ball into the air. The ball is equipped with an electronic timer that starts and stops when an impact is detected, and so the number registered in the timer is just the time of flight Tf. A student picks up the ball and reports that the result is Tf = 2.134. The shaman gives a gleeful laugh, and the student holding his written prediction hands it to Sophie. On the paper is written 1 < Tf < 30. The shaman declares victory: his prediction turned out to be correct, while Sophie's was incorrect (it was off by 0.134 seconds).
To counter the shaman's claim and because it was on the syllabus anyway, in the next class Sophie begins a discussion of probability theory. She goes over the basic ideas, and then connects them to the experimental prediction made about the ball. She points out that technically, the Newtonian prediction Tf=2 is not an assertion about the exact value of the outcome. Rather it should be interpreted as the mean of a probability distribution describing possible outcomes. For example, one might use a normal distribution with mean of 2 and standard deviation of .3. The reason the shaman superficially seemed to win the contest is that he gave a probability distribution while Sophie gave a point prediction; these two types of forecast are not really comparable. In the light of probability theory, the reason to prefer the Newtonian prediction above the shamanic one, is that it assigns a higher probability to the outcome that actually occurred. Now, plausibly, if only a single trial is used then the Newtonian theory might simply have gotten lucky, so the reasonable thing to do is combine the results over many trials, by multiplying the probabilities together. Therefore the real reason to prefer the Newtonian theory to the shamanic theory is that:
Where the k index runs over many trials of the experiment. Sophie then shows how the Newtonian probability predictions are both more confident and more correct than the shamanic predictions. The Newtonian predictions assign a very large amount of probability to the region around the outcome Tf=2, and in fact it turns out that almost all of the real data outcomes fall in this range. In contrast, the shamanic prediction assigns a relatively small amount of probability to the Tf=2 region, because he has predicted a very wide interval (1 < Tf < 30). Thus while the shamanic prediction is correct, it is not very confident. The Newtonian prediction is correct and highly confident, and so it should be prefered.
Sophie tries to emphasize that the Newtonian probability prediction only works well for the real data. Because of the requirement that probability distributions be normalized, the Newtonian theory can only achieve superior high performance by reassigning probability towards the region around Tf=2 and away from other regions. A theory that does not perform this kind of reassignment cannot achieve superior high performance.
Sophie recalls that some of the students are studying computer science and for their benefit points out the following. Information theory provides the standard equation L(x) = -log P(x) governs the relationship between the probability of an outcome and the length of the optimal code that should be used to represent it. Therefore, given a large data file containing the results of many trials of the ballistic motion experiment, the two predictions (Newtonian and shamanic) can both be used to build specialized programs to compress the data file. Using the codelength/probability conversion, the above inequality can be rewritten as follows:
This inequality indicates an alternative criterion that can be used to decide between two rival theories. Given a data file recording measurements related to a phenomenon of interest, a scientific theory can be used to write a compression program that will shrink the file to a small size. Given two rival theories of the same phenomenon, one invokes the corresponding compressors on a shared benchmark data set, and prefers the theory that achieves a smaller encoded file size. This criterion is equivalent to the probability-based one, but has the advantage of being more tangible, since the quantities of interest are file lengths instead of probabilities.
Episode II: The Dead Experimentalist
Sophie is a theoretical physicist and, upon taking up her position as assistant professor, began a collaboration with a brilliant experimental physicist who had been working at the university for some time. The experimentalist had previously completed the development of an advanced apparatus that allowed the investigation of an exotic new kind of quantum phenomenon. Using data obtained from the new system, Sophie made rapid progress in developing a mathematical theory of the phenomenon. Tragically, just before Sophie was able complete her theory, the experimentalist was killed in a laboratory explosion that also destroyed the special apparatus. After grieving for a couple of months, Sophie decided that the best way to honor her friend's memory would be to bring the research they had been working on to a successful conclusion.
Unfortunately, there is a critical problem with Sophie's plan. The experimental apparatus was extremely complex, and Sophie's late partner was the only person in the world who knew how to use it. He had run many trials of the system before his death, so Sophie had a quite large quantity of data. But she had no way of generating any new data. Thus, no matter how beautiful and perfect her theory might be, she had no way of testing it by making predictions.
One day while thinking about the problem Sophie recalls the incident with the shaman. She remembers the point she had made for the benefit of the software engineers, about how a scientific theory could be used to compress a real world data set to a very small size. Inspired, she decides to apply the data compression principle as a way of testing her theory. She immediately returns to her office and spends the next several weeks writing Matlab code, converting her theory into a compression algorithm. The resulting compressor is highly successful: it shrinks the corpus of experimental data from an initial size of 8.7e11 bits to an encoded size of 3.3e9 bits. Satisfied, Sophie writes up the theory, and submits it to a well-known physics journal.
The journal editors like the theory, but are a bit skeptical of the compression based method for testing the theory. Sophie argues that if the theory becomes widely known, one of the other experts in the field will develop a similar apparatus, which can then be used to test the theory in the traditional way. She also offers to release the experimental data, so that other researchers can test alternative theories using the same compression principle. Finally she promises to release the source code of her program, to allow external verification of the compression result. These arguments finally convince the journal editors to accept the paper.
Episode III: The Upstart Theory
After all the mathematics, software development, prose revisions, and persuasion necessary to complete her theory and have the paper accepted, Sophie decides to reward herself by living the good life for a while. She is confident that her theory is essentially correct, and will eventually be recognized as correct by her colleagues. So she spends her time reading novels and hanging out in coffee shops with her friends.
A couple of months later, however, she receives an unpleasant shock in the form of an email from a colleague which is phrased in consolatory language, but does not contain any clue as to why such language might be in order. After some investigation she finds out that a new paper has been published about the same quantum phenomenon of interest to Sophie. The paper proposes a alternative theory of the phenomenon which bears no resemblance whatever to Sophie's. Furthermore, the paper reports a better compression rate than was achieved by Sophie, on the database that she released.
Sophie reads the new paper and quickly realizes that it is worthless. The theory depends on the introduction of a large number of additional parameters, the values of which must be obtained from the data itself. In fact, a substantial portion of the paper involves a description of a statistical algorithm that estimates optimal parameter values from the data. In spite of these aesthetic flaws, she finds that many of her colleagues are quite taken with the new paper and some consider it to be "next big thing".
Sophie sends a message to the journal editors describing in detail what she sees as the many flaws of the upstart paper. She emphasizes the asthetic weakness of the new theory, which requires tens of thousands of new parameters. The editors express sympathy, but point out that the new theory outperforms Sophie's theory using the performance metric she herself proposed. The beauty of a theory is important, but its correctness is ultimately more important.
Somewhat discouraged, Sophie sends a polite email to the authors of the new paper, congratulating them on their result and asking to see their source code. Their response, which arrives a week later, contains a vague excuse about how the source code is not properly documented and relies on proprietary third party libraries. Annoyed, Sophie contacts the journal editors again and asks them for the program they used to verify the compression result. They reply with a link to a binary version of the program.
When Sophie clicks on the link to download the program, she is annoyed to find it has a size of 800 megabytes. But her annoyance is quickly transformed into enlightenment, as she realizes what happened, and that her previous philosophy contained a serious flaw. The upstart theory is not better than hers; it has only succeeded in reducing the size of the encoded data by dramatically increasing the size of the compressor. Indeed, when dealing with specialized compressors, the distinction between "program" and "encoded data" becomes almost irrelevant. The critical number is not the size of the compressed file, but the net size of the encoded data plus the compressor itself.
Sophie writes a response to the new paper which describes the refined compression rate principle. She begins the paper by reiterating the unfortunate circumstances which forced her to appeal to the principle, and expressing the hope that someday an experimental group will rebuild the apparatus developed by her late partner, so that the experimental predictions made by the two theories can be properly tested. Until that day arrives, standard scientific practice does not permit a decisive declaration of theoretical success. But surely there is some theoretical statement that can be made in the meantime, given the large amount of data available. Sophie's proposal is that the goal should be to find the theory that has the highest probability of predicting a new data set, when it can finally be obtained. If the theories are very simple in comparison to the data being modeled, then the size of the encoded data file is a good way of choosing the best theory. But if the theories are complex, then there is a risk of overfitting the data. To guard against overfitting complex theories must be penalized; a simple way to do this is simply to take into account the codelength required for the compressor itself. The length of Sophie's compressor was negligible, so the net score of her theory is just the codelength of the encoded data file: 3.3e9 bits. The rival theory achieved a smaller size of 2.1e9 for the encoded data file, but required a compressor of 6.7e9 bits to do so, giving a total score of 8.8e9 bits. Since Sophie's net score is lower, her theory should be prefered.
So, LWers, what do you think? For the present, let's leave aside questions like why this might be relevant for AI, and focus on whether or not the method is a legitimate restatement of the traditional method. If you were a physicist observing the dispute and trying to decide which theory to prefer, would you believe Sophie's, Sophie's rivals' theory, or neither? Some plausible objections are:
- A - Refuse to accept the probabilistic interpretation of the Newtonian prediction (or maybe the conversion to a codelength comparison).
- B - Refuse to accept the significance of Sophie's compression result, without knowing more about the format of the original database.
- C - Refuse to accept that Sophie's result will probably generalize to a new data set, or that penalizing large compressors is an adequate way of guarding against overfitting.
(Originality disclaimer: this post, by itself, is not Highly Original. It is heavily influenced by Eliezer's articles linked to above, as well as the ideas of Minimum Description Length and Kolmogorov Complexity, and especially an article by Matt Mahoney providing a rationale for a large text compression benchmark. The new ideas mostly involve implications of the above method, which will be discussed in later posts.)