passinglunatic

Finite Factored Sets

I don't understand the motivation behind the fundamental theorem, and I'm wondering if you could say a bit more about it. In particular, it suggests that if I want to propose a family of probability distributions that "represent" observations somehow ("represents" maybe means in the sense of Bayesian predictions or in the sense of frequentist limits), I want to also consider this family to arise from some underlying mutually independent family along with some function. I'm not sure why I would want to propose an underlying family and a function at all, and even if I do I'm not sure why I want to suppose it is mutually independent.

One thought I had is that maybe this underlying family of distributions on S is supposed to represent "interventions". The reasoning would be something like: there is some set of things that fully control my observations that *I* can control independently and which also vary independently on their own. I don't find this convincing, though - I don't see why independent controllability should imply independent variation.

Another thought I had was that maybe it arises from some kind of maximum entropy argument, but it's not clear why I would want to maximise the entropy of a distribution on some S for every possible configuration of marginals.

Also, do you know how your model relates to structural equation models with hidden variables? Your factored set S looks a lot like a set of independent "noises", and the function f:S->Y looks a lot like a set of structural equations and I think it's straightforward to introduce hidden variables as needed to account for any lossiness. In particular, given a model and a compatible orthogonality database, I can construct an SEM by taking all the variables that appear in the database and defining the structural equation for to be . However, the set of all SEMs that are compatible with a given orthogonality database I think is larger than the set of all FFS models that are similarly compatible. This is because SEMs (in the Pearlean sense) can be distinct even if they have "apparently identical" structural equations. For example, and are distinct because interventions on X will have different results, while my understanding is that they will correspond to the same FFS model.

Your result 2e looks interestingly similar to the DAG result that says and implies something like (where is d-separation). In fact, I think it extends existing graph learning algorithms: in addition to checking independences among the variables as given, you can look for independences between any functions of the given variables. This seems like it would give you many more arrows than usual, though I imagine it would also increase your risk of spurious indepdendences. In fact, I think this connects to approaches to causal identification like regression with subsequent independence test: if is independent of , we prefer , and if is independent of , prefer .

Sounds awful to me. I would absolutely hate to live somewhere where I was regularly told what to do and/or expected to fit in with rituals. I tolerate this kind of thing at work because I have to.

What will you say when people come to you saying "I'm not sure this is really worth it for me"? I personally don't think self-improvement is a very stable overall goal. In my cursory acquaintance, most cults/high-demand living situations tend to believe in "something greater" - often something quite ridiculous, but nonetheless something bigger than the individual. Perhaps it is important to have something which seems to trump feelings of personal discomfort.

I think the motivation for the representability of some sets of conditional independences with a DAG is pretty clear, because people already use probability distributions all the time, they sometimes have conditional independences and visuals are nice.

On the other hand the fundamental theorem relates orthogonality to independences in a family of distributions generated in a particular way. Neither of these things are natural properties of probability distributions in the way that conditional independence is. If I am using probability distributions, it seems to me I'd rather avoid introducing them if I can. Even if the reasons are mysterious, it might be useful to work with models of this type - I was just wondering if there were reasons for doing that are apparent before you derive any useful results.

Alternatively, is it plausible that you could derive the same results just using probability + whatever else you need anyway? For example, you could perhaps define X to be prior to Y if, relative to some ordering of functions by "naturalness", there is a more natural f(X,Y) such that X⊥f(X,Y) and X⊥/f(X,Y)|Y than any g(X,Y) such that Y⊥g(X,Y) etc. I have no idea if that actually works!

~~However, I'm pretty sure you'll need something like a naturalness ordering in order to separate "true orthogonality" from "merely apparent orthogonality", which is why I think it's fair to posit it as an element of "whatever else you need anyway".~~Maybe not.