Two Challenges

Some comments:

We will then compare the likelihoods achieved by the two methods; higher likelihood wins. If there is ambiguity concerning the validity of the result, then we will compress the data set using compression algorithms based on the models and compare compression rates. Constructing a compressor from a statistical model is essentially a technical exercise; I can provide a Java implementation of arithmetic encoding. The compression rates must take into account the size of the compressor itself.

"Likelihood" is ambiguous: a Bayes net can be defined with a single hidden node for "object ID", and all the observed nodes dependent only on the object ID, with conditional probability tables lifted from large look-up tables. This Bayes net would have a low probability from a prior belief distribution over Bayes nets but a high likelihood to the data.

To define compression unambiguously, you should agree on a programming language or executable format, and on runtime and memory bounds on a reference computer, as in the Hutter Prize.

An alternative to a test of compression would be a test of sequential prediction of e.g. individual real-valued or integer measurements, conditional on values of previous measurements according to some ordering. For each measurement, the predictor program would generate a predictive probability distribution, and a judge program would normalize the distribution or test that it summed to 1. The predictor's score would be the log of the product of the sequential likelihoods. Compared to arithmetic compressor/decompressor pairs, sequential predictors might be less technically demanding to program (no problem of synchronizing the compressor and decompressor), but might be more technically demanding to judge (a predictive distribution from Monte Carlo sampling may depend on allowed computation time). The size of the predictor might still need to be included.

The Bayes Net contestant should be permitted to use prior belief distributions over network structures (as here). (This can be encoded as a causal belief about a hidden cause of patterns of causation.)

The Bayes Net contestant may wish to use network search algorithms or priors that hypothesize causal models with unobserved latent nodes.

The Bayes Net contestant should remember that the Skeptic contestant can use compression schemes based on reconstructability analysis, which is relatively powerful, and Markov graphical models, which include all simple Bayes nets for purposes of this problem.

[. . .] while the belief network paradigm is mathematically elegant and intuitively appealing, it is NOT very useful for describing real data.

The challenge is to prove the above claim wrong. [. . .]

The challenge hinges on the meaning of the word "real data". [. . .] Other than that, there are no limitations - it can be image data, text, speech, machine learning sets, NetFlix, social science databases, etc.

Some data are about objects whose existence or relations are causally dependent on objects' properties. Simple Bayesian networks cannot usually represent knowledge of such causal dependencies, but there are more general families of Bayesian models which can. Would it be in the spirit of the challenge for the Bayes Net contestant to use these more general models? Would it be out of the spirit of the challenge for the data to be about such a collection of objects?

[-]Daniel_Burfoot16y00

"Likelihood" is ambiguous: [...] This Bayes net would have a low probability from a prior belief distribution over Bayes nets but a high likelihood to the data.

Right - this would be what I'd call "cheating" or overfitting the data. We'd have to use the compression rate in this case.

To define compression unambiguously, you should agree on a programming language or executable format, and on runtime and memory bounds on a reference computer

Sure. I'll work out the technical details if anyone wants to enter the contest. I would prefer to use the most recent stable JVM. It seems very unlikely to me that the outcome of the contest will depend on precise selection of time or memory bounds - let's say, the time bound is O(24 hours) and the memory bound in O(2 GB).

An alternative to a test of compression

It's actually not very difficult to implement a compression program using arithmetic coding once you have the statistical model. Other prediction evaluation schemes may work, but compression has methodological crispness: look at the compressed file size, check that the decompressed data matches the original exactly.

Would it be in the spirit of the challenge for the Bayes Net contestant to use these more general models? Would it be out of the spirit of the challenge for the data to be about such a collection of objects?

Basically, when I say "belief networks", what I mean is the use of graphs to define probability distributions and conditional independence relationships.

The spirit of the contest is to use a truly "natural" data set. I admit that this is a bit vague. Really my only requirement is to use a non-synthetic data set. I think I know where you're going with the "causally dependent" line of thinking, but it doesn't bother me too much. I get the feeling that I am walking into a trap, but really I've been planning to make a donation to SIAI anyway, so I don't mind losing.

[-]Alex Flint16y60

Unfortunately, there is a big catch - this inference algorithm works only for belief networks that can be expressed as acyclic graphs. If the graph is not acyclic, the computational cost of the inference algorithm is much larger (IIRC, it is exponential in the size of the largest clique in the graph).

Actually, all valid belief networks are acyclic graphs. What you're thinking of is that inference in tree-structured Bayes nets can be solved in polynomial time (e.g. using the Belief Propagation algorithm), whereas for non-tree structured graphs there is no such polynomail time algorithm currently known (iirc it has been proved that inference in Bayes nets is NP complete in general). Loopy Belief Propagation and variational methods can be used as non-deterministic approximate inference algorithms in non-tree structured Bayes nets.

[-]Eliezer Yudkowsky16y30

In other words, "acyclic graphs" = graphs that are acyclic after the directionality of the arrows is forgotten, which is probably what he meant.

[-]cousin_it16y60

On the basis of these remarks I submit the following qualified statement: while the belief network paradigm is mathematically elegant and intuitively appealing, it is NOT very useful for describing real data.

It's sometimes possible to automatically reconstruct causal models from data. For example see Cosma Shalizi's thesis and CSSR software, they completely changed my view of the subject.

[-]SarahNibs16y40

Thanks cousin_it! Read that thesis, everyone else! I just did, and it's amazing. Among other things, it contains a nice reduction of "emergence", one that isn't magical. Basically a process is emergent just when the fraction of historical memory stored in it which does "useful work" in the form of telling us about the future is greater than this fraction is in the process it derives from (pg 115-116).

More precisely, the fraction is the ratio of the process' excess entropy (mutual information between its semi-infinite past and its semi-infinite future) and its statistical complexity (entropy of the causal states (informally: the class of sets of "inputs" deriving from identifying inputs leading to the same probability distribution over outputs) of the process).

Thermodynamic macrostate processes are emergent because they more efficiently predict the future than their underlying microstate processes.

The thesis also gives a non-trivial necessary condition for describing a process as "self-organizing", which is that its statistical complexity increases over time - the causal architecture of the process does not change, but the amount of information needed to place the process in a state within the architecture increases. For example, a system that will go from uniform behavior to periodic behavior over time is self-organizing.

Anyway, I took most of that straight out of Chapter 11 of Cosma Shalizi's thesis, and that's the concluding summary chapter, so if you're suspicious something I just said isn't very rigorous, check out the paper. You may or may not be disappointed, as from Shalizi's introduction:

A word about the math. I aim at a moderate degree of rigor throughout | but as two wise men have remarked, "One man's rigor is another man's mortis" (Bohren and Albrecht 1998). My ideal has been to keep to about the level of rigor of Shannon (1948)3. In some places (like Appendix B.3), I'm closer to the onset of mortis. No result on which anything else depends should have an invalid proof. There are places, naturally, where I am not even trying to be rigorous, but merely plausible, or even "physical," but it should be clear from context where those are.

[-]SilasBarta16y00

Thanks for the thesis link. That looks to be an interesting read and should be quite informative about pattern recognition, entropy, and complexity.

Some of you may remember a previous discussion of Shalizi here in which he was criticized for his position on thermodynamics. (Scroll to the bottom.)

By the way, I thought Pearl gave algorithms for identifying causal structure in Causality or Probabilistic Reasoning. Are they not effective enough?

[-]PhilGoetz16y10

Probabilistic Reasoning Intelligent Systems

Chpt 8, Learning structure from data

8.2 presents algorithms for learning tree-shaped networks only.

8.3 discusses how to learn tree-shaped networks using hidden independent variables

[-][anonymous]16y00

Do you mean this? Pearl and Velma, "A theory of inferred causation" As far as I can see the definitions are very similar, but Pearl's "algorithm" requires complete knowledge of the complete distribution, while Shalizi's reconstruction approach works from samples.

[-]Alex Flint16y20

In their original formulation, Bayes nets were a way to capture conditional independence properties of probabilistic models. That is, given any probabilistic model for P(Y|X1,X2,X2,...), there is a Bayes net that captures some of the conditional independence relations in your model. Bayes nets certainly cannot capture all possible conditional independence relations: undirected graphical models, for example, capture a different class of independence relations, while factor graphs capture a superset of the independence relations expressible by Bayes nets.

In this light, I'm not sure that your challenge makes sense. Bayes nets are a way of expressing a properties of probabilistic models, rather than a model unto themselves. "Bayes nets" alone is as meaningless a choice of model as "models expressible in Portuguese".

Perhaps a particular Bayes net together with a certain choice of conditional probability function for each arc and a certain choice of inference algorithm would constitute a model.

[-]Daniel_Burfoot16y00

In this light, I'm not sure that your challenge makes sense. Bayes nets are a way of expressing a properties of probabilistic models, rather than a model unto themselves.

Valid point, but I think in practice it's possible to identify a model as one of some specific family such as "Bayes Net", "Neural Network", "MaxEnt", etc.

Perhaps a particular Bayes net together with a certain choice of conditional probability function for each arc and a certain choice of inference algorithm would constitute a model.

Right, the point is that the challenger can make any reasonable choice for these unspecified components. Ideally someone would say: here is the data set; I'm modeling it using the method described in such-and-such paper; here are some minor revisions to the method of the paper to make it useful in this case; here are the results.

[-]jstults16y00

On the basis of these remarks I submit the following qualified statement: while the belief network paradigm is mathematically elegant and intuitively appealing, it is NOT very useful for describing real data.

The challenge is just as wrong; to quote from the wiki:

Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems; this is also referred to as the "no free lunch" theorem. Determining a suitable classifier for a given problem is still more an art than science.

Russell and Norvig, 1st ed. has a good example comparing the performance of a Bayes net with a decision tree on data that was generated by a decision tree-like process, of course the net did not perform as well as a decision tree on that data, surprise, surprise.

[-]Cyan16y20

Welcome to LessWrong!

To fix your markup, see the Help tab just below the comment box on the right side.

[-]Daniel_Burfoot16y10

I don't understand what you mean by the claim "the challenge is just as wrong". Of course I'm aware of the NFL theorem. I'm also aware that data from the real world has structure that can be exploited to achieve better results than the NFL theorem seems to permit; if this weren't true, the field of machine learning would be pointless. My claim is that the belief networks framework doesn't really match that real world structure in most cases (but I'm ready to be proved wrong and in fact that's my motivation for making the challenge).

LESSWRONG
LW

LESSWRONG
LW

20

20

20