# 20

Personal Blog

Followup To: Play for a Cause, Singularity Institute \$100k Challenge Grant

In the spirit of informal intellectual inquiry and friendly wagering, and with an eye toward raising a bit of money for SIAI, I offer the following two challenges to the LW community.

Challenge #1 - Bayes' Nets Skeptics' Challenge

Many LWers seem to be strong believers in the family of modeling methods variously called Bayes' Nets, belief networks, or graphical models. These methods are the topic of two SIAI-recommended books by Judea Pearl: "Probabilistic Reasoning in Intelligent Systems" and "Causality: Models, Reasoning and Inference".

The belief network paradigm has several attractive conceptual features. One feature is the ability of the networks to encode conditional independence relationships, which are intuitively natural and therefore attractive to humans. Often a naïve investigation of the statistical relationship between variables will produce nonsensical conclusions, and the idea of conditional independence can sometimes be used to unravel the mystery. A good example would be a data set relating to traffic accidents, which shows that red cars are more likely to be involved in accidents. But it's nearly absurd to believe that red cars are intrinsically more dangerous. Rather, red cars are preferred by young men, who tend to be reckless drivers. So the color of a car is not independent of the likelihood of a collision, but it is conditionally independent given the age and sex of the person driving the car. This relationship could be expressed by the following belief network: Where "YM" is true if the driver is a young man, "R" denotes whether the car is red, and "A" indicates an accident. The fact that there is no edge betwee "R" and "A" indicates that they are conditionally independent given the other nodes in the network, in this case "YM".

A key property of the belief network scheme of constructing probability distributions is that they can be used to achieve a good balance between expressive power and model complexity. Consider the family of probability distributions over N variables that can take on K different values. Then the most naïve model is just an N-dimensional array of numbers P(X1, X2, ... XN), requiring K^N parameters to specify. The number of parameters required for a belief network can be drastically lower, if many of the nodes are conditionally independent of one another. For example, if all but one node in the graph has exactly one parent, then the number of parameters required is basically NK^2 (N conditional distributions of the form P(Xchild|Xparent)). This is clearly a drastic improvement for reasonable values of K and N. Even though the model is much less complex, every variable is still related to every other variable - knowing the value of any Xi will change the probability of any Xj, unless some other intervening node is also known.

Another attractive feature of the belief network paradigm is the existence of a fast inference algorithm for updating the probability distributions of some unobserved variables in response to evidence. For example, given that a patient has the feature "smoker=T" and "chest pain=T", the inference algorithm can rapidly calculate the probability distribution of the unobserved variable "heart disease". Unfortunately, there is a big catch - this inference algorithm works only for belief networks that can be expressed as acyclic graphs. If the graph is not acyclic, the computational cost of the inference algorithm is much larger (IIRC, it is exponential in the size of the largest clique in the graph).

In spite of these benefits, belief networks have some serious drawbacks as well. One major flaw is the difficulty of learning networks from data. Here the goal is to obtain a belief network specifying a distribution P(x) that assigns a very high likelihood to an empirical data set X={X1,X2...XM}, where now each X is an N-dimensional vector and there are M samples. The difficulty of learning belief networks stems from the fact that the number of graphs representing relationships between N variables is exponential in N.

There is a second, more subtle flaw in the belief network paradigm. A belief network is defined based on a graph which has one node for each data variable. This is fine, if you know what the correct variables are. But very often the correct variables are unknown; indeed, finding the correct variables is probably the key problem in learning. Once you know that the critical variables are mass, force, and acceleration, it is almost trivial to determine the relation between them. As an example, consider a system composed of three variables A,B,C and X. The unknown relationship between these variables is X~F(A+B+C), where F is some complex distribution that depends on a single parameter. Now, a naïve representation will yield a belief network that looks like the following: If we reencode the variables as A'=A, B'=A'+B, C'=B'+C, then we get the following graph: If you count the number of links this might not seem like much of an improvement. But the number of links is not the important factor; the key factor is the required number of entries in the conditional probability tables. In the first graph, the CPT for X requires O(K^4) entries, where K is again the number of values A, B, and C can take on. But in the second graph the number of entries is O(K^2). So clearly the second graph should be preferred, but the basic belief network learning algorithms provide no way of obtaining it.

On the basis of these remarks I submit the following qualified statement: while the belief network paradigm is mathematically elegant and intuitively appealing, it is NOT very useful for describing real data.

The challenge is to prove the above claim wrong. This can be done as follows. First, find a real data set (see below for definition of the word "real"). Construct a belief network model of the data, in any way you choose. Then post a link to the data set, and I will then attempt to model it using alternative methods of my own choosing (probably Maximum Entropy or a variant thereof). We will then compare the likelihoods achieved by the two methods; higher likelihood wins. If there is ambiguity concerning the validity of the result, then we will compress the data set using compression algorithms based on the models and compare compression rates.  Constructing a compressor from a statistical model is essentially a technical exercise; I can provide a Java implementation of arithmetic encoding. The compression rates must take into account the size of the compressor itself.

The challenge hinges on the meaning of the word "real data". Obviously it is trivial to construct a data set for which a belief network is the best possible model, simply by building a network and then sampling from it. So my requirement is that the data set be non-synthetic. Other than that, there are no limitations - it can be image data, text, speech, machine learning sets, NetFlix, social science databases, etc.

To make things interesting, the loser of the challenge will donate \$100 (more?) to SIAI. Hopefully we can agree on the challenge (but not necessarily resolve it) before the Feb. 28th deadline for matching donations. In principle I will accept challenges until I lose so many that my wallet starts to hurt.

Challenge #2: Compress the GSS

The General Social Survey is a widely used data set in the social sciences. Most analyses based on the GSS tend to use standard statistical tools such as correlation, analysis of variance, and so on. These kinds of analysis run into the usual problems associated with statistics - how do you choose a prior, how do you avoid overfitting, and so on.

I propose a new way of analyzing the data in the GSS - a compression-based challenge as outlined above. To participate in the challenge, you build a model of the data contained in the GSS using whatever methods appeal to you. Then you connect the model to an encoding method and compress the dataset. Whoever achieves the best compression rate, taking into account the size of the compressor itself, is the winner.

The GSS contains data about a great variety of economic, cultural, psychological, and educational factors. If you are a social scientist with a theory of how these factors relate, you can prove your theory by transforming it into a statistical model and then into a compression program, and demonstrating better compression results than rival theories.

If people are interested, I propose the following scheme. There is a \$100 entry fee, of which half will go to SIAI. The other half goes to the winner. Again, hopefully we can agree on the challenge before the Feb. 28th deadline.

# 20

New Comment

We will then compare the likelihoods achieved by the two methods; higher likelihood wins. If there is ambiguity concerning the validity of the result, then we will compress the data set using compression algorithms based on the models and compare compression rates. Constructing a compressor from a statistical model is essentially a technical exercise; I can provide a Java implementation of arithmetic encoding. The compression rates must take into account the size of the compressor itself.

"Likelihood" is ambiguous: a Bayes net can be defined with a single hidden node for "object ID", and all the observed nodes dependent only on the object ID, with conditional probability tables lifted from large look-up tables. This Bayes net would have a low probability from a prior belief distribution over Bayes nets but a high likelihood to the data.

To define compression unambiguously, you should agree on a programming language or executable format, and on runtime and memory bounds on a reference computer, as in the Hutter Prize.

An alternative to a test of compression would be a test of sequential prediction of e.g. individual real-valued or integer measurements, conditional on values of previous measurements according to some ordering. For each measurement, the predictor program would generate a predictive probability distribution, and a judge program would normalize the distribution or test that it summed to 1. The predictor's score would be the log of the product of the sequential likelihoods. Compared to arithmetic compressor/decompressor pairs, sequential predictors might be less technically demanding to program (no problem of synchronizing the compressor and decompressor), but might be more technically demanding to judge (a predictive distribution from Monte Carlo sampling may depend on allowed computation time). The size of the predictor might still need to be included.

The Bayes Net contestant should be permitted to use prior belief distributions over network structures (as here). (This can be encoded as a causal belief about a hidden cause of patterns of causation.)

The Bayes Net contestant may wish to use network search algorithms or priors that hypothesize causal models with unobserved latent nodes.

The Bayes Net contestant should remember that the Skeptic contestant can use compression schemes based on reconstructability analysis, which is relatively powerful, and Markov graphical models, which include all simple Bayes nets for purposes of this problem.

[. . .] while the belief network paradigm is mathematically elegant and intuitively appealing, it is NOT very useful for describing real data.

The challenge is to prove the above claim wrong. [. . .]

The challenge hinges on the meaning of the word "real data". [. . .] Other than that, there are no limitations - it can be image data, text, speech, machine learning sets, NetFlix, social science databases, etc.

Some data are about objects whose existence or relations are causally dependent on objects' properties. Simple Bayesian networks cannot usually represent knowledge of such causal dependencies, but there are more general families of Bayesian models which can. Would it be in the spirit of the challenge for the Bayes Net contestant to use these more general models? Would it be out of the spirit of the challenge for the data to be about such a collection of objects?

"Likelihood" is ambiguous: [...] This Bayes net would have a low probability from a prior belief distribution over Bayes nets but a high likelihood to the data.

Right - this would be what I'd call "cheating" or overfitting the data. We'd have to use the compression rate in this case.

To define compression unambiguously, you should agree on a programming language or executable format, and on runtime and memory bounds on a reference computer

Sure. I'll work out the technical details if anyone wants to enter the contest. I would prefer to use the most recent stable JVM. It seems very unlikely to me that the outcome of the contest will depend on precise selection of time or memory bounds - let's say, the time bound is O(24 hours) and the memory bound in O(2 GB).

An alternative to a test of compression

It's actually not very difficult to implement a compression program using arithmetic coding once you have the statistical model. Other prediction evaluation schemes may work, but compression has methodological crispness: look at the compressed file size, check that the decompressed data matches the original exactly.

Would it be in the spirit of the challenge for the Bayes Net contestant to use these more general models? Would it be out of the spirit of the challenge for the data to be about such a collection of objects?

Basically, when I say "belief networks", what I mean is the use of graphs to define probability distributions and conditional independence relationships.

The spirit of the contest is to use a truly "natural" data set. I admit that this is a bit vague. Really my only requirement is to use a non-synthetic data set. I think I know where you're going with the "causally dependent" line of thinking, but it doesn't bother me too much. I get the feeling that I am walking into a trap, but really I've been planning to make a donation to SIAI anyway, so I don't mind losing.

Unfortunately, there is a big catch - this inference algorithm works only for belief networks that can be expressed as acyclic graphs. If the graph is not acyclic, the computational cost of the inference algorithm is much larger (IIRC, it is exponential in the size of the largest clique in the graph).

Actually, all valid belief networks are acyclic graphs. What you're thinking of is that inference in tree-structured Bayes nets can be solved in polynomial time (e.g. using the Belief Propagation algorithm), whereas for non-tree structured graphs there is no such polynomail time algorithm currently known (iirc it has been proved that inference in Bayes nets is NP complete in general). Loopy Belief Propagation and variational methods can be used as non-deterministic approximate inference algorithms in non-tree structured Bayes nets.

In other words, "acyclic graphs" = graphs that are acyclic after the directionality of the arrows is forgotten, which is probably what he meant.

On the basis of these remarks I submit the following qualified statement: while the belief network paradigm is mathematically elegant and intuitively appealing, it is NOT very useful for describing real data.

It's sometimes possible to automatically reconstruct causal models from data. For example see Cosma Shalizi's thesis and CSSR software, they completely changed my view of the subject.

Thanks cousin_it! Read that thesis, everyone else! I just did, and it's amazing. Among other things, it contains a nice reduction of "emergence", one that isn't magical. Basically a process is emergent just when the fraction of historical memory stored in it which does "useful work" in the form of telling us about the future is greater than this fraction is in the process it derives from (pg 115-116).

More precisely, the fraction is the ratio of the process' excess entropy (mutual information between its semi-infinite past and its semi-infinite future) and its statistical complexity (entropy of the causal states (informally: the class of sets of "inputs" deriving from identifying inputs leading to the same probability distribution over outputs) of the process).

Thermodynamic macrostate processes are emergent because they more efficiently predict the future than their underlying microstate processes.

The thesis also gives a non-trivial necessary condition for describing a process as "self-organizing", which is that its statistical complexity increases over time - the causal architecture of the process does not change, but the amount of information needed to place the process in a state within the architecture increases. For example, a system that will go from uniform behavior to periodic behavior over time is self-organizing.

Anyway, I took most of that straight out of Chapter 11 of Cosma Shalizi's thesis, and that's the concluding summary chapter, so if you're suspicious something I just said isn't very rigorous, check out the paper. You may or may not be disappointed, as from Shalizi's introduction:

A word about the math. I aim at a moderate degree of rigor throughout | but as two wise men have remarked, "One man's rigor is another man's mortis" (Bohren and Albrecht 1998). My ideal has been to keep to about the level of rigor of Shannon (1948)3. In some places (like Appendix B.3), I'm closer to the onset of mortis. No result on which anything else depends should have an invalid proof. There are places, naturally, where I am not even trying to be rigorous, but merely plausible, or even "physical," but it should be clear from context where those are.

Thanks for the thesis link. That looks to be an interesting read and should be quite informative about pattern recognition, entropy, and complexity.

Some of you may remember a previous discussion of Shalizi here in which he was criticized for his position on thermodynamics. (Scroll to the bottom.)

By the way, I thought Pearl gave algorithms for identifying causal structure in Causality or Probabilistic Reasoning. Are they not effective enough?

Probabilistic Reasoning Intelligent Systems

Chpt 8, Learning structure from data

8.2 presents algorithms for learning tree-shaped networks only.

8.3 discusses how to learn tree-shaped networks using hidden independent variables

[-][anonymous]12y 0

Do you mean this? Pearl and Velma, "A theory of inferred causation" As far as I can see the definitions are very similar, but Pearl's "algorithm" requires complete knowledge of the complete distribution, while Shalizi's reconstruction approach works from samples.

In their original formulation, Bayes nets were a way to capture conditional independence properties of probabilistic models. That is, given any probabilistic model for P(Y|X1,X2,X2,...), there is a Bayes net that captures some of the conditional independence relations in your model. Bayes nets certainly cannot capture all possible conditional independence relations: undirected graphical models, for example, capture a different class of independence relations, while factor graphs capture a superset of the independence relations expressible by Bayes nets.

In this light, I'm not sure that your challenge makes sense. Bayes nets are a way of expressing a properties of probabilistic models, rather than a model unto themselves. "Bayes nets" alone is as meaningless a choice of model as "models expressible in Portuguese".

Perhaps a particular Bayes net together with a certain choice of conditional probability function for each arc and a certain choice of inference algorithm would constitute a model.

In this light, I'm not sure that your challenge makes sense. Bayes nets are a way of expressing a properties of probabilistic models, rather than a model unto themselves.

Valid point, but I think in practice it's possible to identify a model as one of some specific family such as "Bayes Net", "Neural Network", "MaxEnt", etc.

Perhaps a particular Bayes net together with a certain choice of conditional probability function for each arc and a certain choice of inference algorithm would constitute a model.

Right, the point is that the challenger can make any reasonable choice for these unspecified components. Ideally someone would say: here is the data set; I'm modeling it using the method described in such-and-such paper; here are some minor revisions to the method of the paper to make it useful in this case; here are the results.

On the basis of these remarks I submit the following qualified statement: while the belief network paradigm is mathematically elegant and intuitively appealing, it is NOT very useful for describing real data.

The challenge is just as wrong; to quote from the wiki:

Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems; this is also referred to as the "no free lunch" theorem. Determining a suitable classifier for a given problem is still more an art than science.

Russell and Norvig, 1st ed. has a good example comparing the performance of a Bayes net with a decision tree on data that was generated by a decision tree-like process, of course the net did not perform as well as a decision tree on that data, surprise, surprise.

To fix your markup, see the Help tab just below the comment box on the right side.

I don't understand what you mean by the claim "the challenge is just as wrong". Of course I'm aware of the NFL theorem. I'm also aware that data from the real world has structure that can be exploited to achieve better results than the NFL theorem seems to permit; if this weren't true, the field of machine learning would be pointless. My claim is that the belief networks framework doesn't really match that real world structure in most cases (but I'm ready to be proved wrong and in fact that's my motivation for making the challenge).