Here is how I'm currently thinking about this framework and especially inference, in case it's helpful for other folks who have similar priors to mine (or in case something is still wrong).
A description of traditional causal models:
In the factored sets framework:
I think the definition of history is the most natural way to recover something like causal structure in these models.
I'm not sure how much it's about causality. Imagine there's a bunch of envelopes with numbers inside, and one of the following happens:
Alice peeks at three envelopes. Bob peeks at ten, which include Alice's three.
Alice peeks at three envelopes and tells the results to Bob, who then peeks at seven more.
Bob peeks at ten envelopes, then tells Alice the contents of three of them.
Under the FFS definition, Alice's knowledge in each case is "strictly before" Bob's. So it seems less of a causal relationship and more like "depends on fewer basic facts".
Agree it's not totally right to call this a causal relationship.
That said:
So I do feel like there's a case t...
I think I (at least locally) endorse this view, and I think it is also a good pointer to what seems to me to be the largest crux between the my theory of time and Pearl's theory of time.
Makes sense. I think a bit of my naming and presentation was biased by being so surprised by the not on OEIS fact.
I think I disagree about the bipartite graph thing. I think it only feels more natural when comparing to Pearl. The talk frames everything in comparison to Pearl, but I think if you are not looking at Pearl, I think graphs don’t feel like the right representation here. Comparing to Pearl is obviously super important, and maybe the first introduction should just be about the path from Pearl to FFS, but once we are working within the FFS ontology, graphs feel not useful. One crux might be about how I am excited for directions that are not temporal inference from statistical data.
My guess is that if I were putting a lot of work into a very long introduction for e.g. the structure learning community, I might start the way you are emphasizing, but then eventually convert to throwing all the graphs away.
(The paper draft I have basically only ever mentions Pearl/graphs for motivation at the beginning and in the applications section.)
Planned summary for the Alignment Newsletter. I'd be particularly keen for someone to check that I didn't screw up in the section "Orthogonality in Finite Factored Sets":
...This newsletter is a combined summary + opinion for the [Finite Factored Sets sequence](https://www.alignmentforum.org/s/kxs3eeEti9ouwWFzr) by Scott Garrabrant. I (Rohin) have taken a lot more liberty than I usually do with the interpretation of the results; Scott may or may not agree with these interpretations.
## Motivation
One view on the importance of deep learning is that it allows you to automatically _learn_ the features that are relevant for some task of interest. Instead of having to handcraft features using domain knowledge, we simply point a neural net at an appropriate dataset, and it figures out the right features. Arguably this is the _majority_ of what makes up intelligent cognition; in humans it seems very analogous to [System 1](https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow), which we use for most decisions and actions. We are also able to infer causal relations between the resulting features.
Unfortunately, [existing models](https://en.wikipedia.org/wiki/The_Book_of_Why) of causal inference d
Are there any interesting pure combinatorics problems about finite factored sets that you're interested in?
I've thought a good amount about Finite Factored Sets in the past year or two, but I do sure keep going back to thinking about the world primarily in the form of Pearlian causal influence diagrams, and I am not really sure why.
I do think this one line by Scott at the top gave me at least one pointer towards what was happening:
but I'm trained as a combinatorialist, so I'm giving a combinatorics talk upfront.
In the space of mathematical affinities, combinatorics is among the branches of math I feel most averse to, and I think that explains a good...
Let's try category theory.
A partition is a family of sets called parts. A partition morphism X->X' has a function from each part of X to some part of X'. It witnesses that X is finer than X'¹.
The underlying set of a partition is its disjoint union. Call the discrete partition of S DS. The functions S->⊔X correspond to the partition morphisms DS->X. Call the trivial partition of S TS. The functions ⊔X->S correspond to the partition morphisms X->TS. In terser notation, we have D ⊣ ⊔ ⊣ T.
A factorization is a family of partitions called fac...
I was thinking of some terminology that might make it easier to thinking about factoring and histories and whatnot.
A partition can be thought of as a (multiple-choice) question. Like for a set of words, you could have the partition corresponding to the question "Which letter does the word start with?" and then the partition groups together elements with the same answer.
Then a factoring is set of questions, where the set of answers will uniquely identify an element. The word that comes to mind for me is "signature", where an element's signature is the set o...
I have read most of the article, but not yet carefully combed the math or watched the video.
The OEIS gap seems suggestive of "being on the track of something novel (and thus maybe novel and good)".
Reading this, some things clicked for me as "possibly related and worth looking at" that I hadn't really noticed before.
Specifically, I was reminded of how "TF * IDF" is this old pragmatic "it works in practice" mathematical kludge for information retrieval that just... gets the job done better "with" than "without" nearly all of the time? People have...
Let , where and
[...] The second rule says that is orthogonal to itself
Should that be "is not orthogonal to itself"? I thought the meant non-orthogonal, so would think means that is not orthogonal to itself.
(The transcript accurately reflects what was said in the talk, but I'm asking whether Scott misspoke.)
And if X is independent of X XOR Y, we’re actually going to be able to conclude that X is before Y!
It's interesting to translate that to the language of probabilities. For example, your condition holds for any X,Y (possibly dependent) such that P(X)=P(Y)=1/2, but it doesn't make sense to say that X is before Y in every such pair. For a real world example, take X = "person has above median height" and Y = "person has above median age".
Well, imagine we have three boolean random variables. In "general position" there are no independence relations between them, so we can't say much. Constrain them so two of the variables are independent, that's a bit less "general", and we still can't say much. Constrain some more so the xor of all three variables is always 1, that's even less "general", now we can use your method to figure out that the third variable is downstream of the first two. Constrain some more so that some of the probabilities are 1/2, and the method stops working. What I'd like to understand is the intuition, which real world cases have the particular "general position" where the method works.
Hmm, first I want to point out that the talk here sort of has natural boundaries around inference, but I also want to work in a larger frame that uses FFS for stuff other than inference.
If I focus on the inference question, one of the natural questions that I answer is where I talk about grue/bleen in the talk.
I think for inference, it makes the most sense to think about FFS relative to Pearl. We have this problem with looking at smoking/tar/cancer, which is what if we carved into variables the wrong way. What if instead of tar/cancer, we had a variable for "How much bad stuff is in your body?" and "What is the ratio of tar to cancer?" The point is that our choice of variables both matters for the Pearlian framework, and is not empirically observable. I am trying to do all the good stuff in Pearl without the dependence on the variables
Indeed, I think the largest crux between FFS and Pearl is something about variable realism. To FFS, there is no realism to a variable beyond its information content, so it doesn't make sense to have two variables X, X' with the same information, but different temporal properties. Pearl's ontology, on the other hand, has these graphs with variabl...
Cool topic!
Here are some critiques on part 1 of your presentation: Short Combinatorial Talk
1 Colour coding slides by content is a nifty idea. I hadn't seen this before. Unfortunately, even with as little as five headings it is difficult to recall the correspondence between colours and content. Why not try something else. Maybe a designation in the upper right hand corner of each slide?
2. That looks like an interesting diagram on slide 4. Why didn't you explain it?
3. You tend to introduce succinct definitions first, motivating examples later. This worked al...
Curated. This is a fascinating framework that (to the best of my understanding) makes substantive improvements on the Pearlian paradigm. It's also really exciting that you found a new simple sequence.
Re: the writeup, it's explained very clearly, the Q&A interspersed is a very nice touch. I like that the talk factorizes.
I really appreciate the research exploration you do around ideas of agency and am very happy to celebrate the writeups like this when you produce them.
This is really cool!
The example of inferring from the independence of and reminds me of some techniques discussed in Elements of Causal Inference. They discuss a few different techniques for 2-variable causal inference.
One of them, which seems to be essentially analogous to this example, is that if are real-valued variables, then if the regression error (i.e for some constant ) is independent of , it's highly likely that is downstream of . It sounds like factored sets (or some extension to capture continuous-valued variables) migh...
Possible examples: After staring at the definition of a set factorization for a minute, it clicked for me when I thought about Quarto.
Quarto is a simple board game played with 16 pieces (and a 4x4 grid) where each piece is (short or tall) and (light or dark) and (round or square) and (solid or hollow). There's exactly one piece with each combination of attributes; for example, there's exactly one tall dark round hollow piece.
Thus, the full set of 16 pieces can be factored into {{short, tall}, {light, dark}, {round, square}, {solid, hollow}}. &n...
You said that you thought that this could be done in a categorical way. I attempted something which appears to describe the same thing when applied to the category FinSet , but I'm not sure it's the sort of thing you meant by when you suggested that the combinatorial part could potentially be done in a categorical way instead, and I'm not sure that it is fully categorical.
Let S be an object.
For i from 1 to k, let be an object, (which is not anything isomorphic to the product of itself with itself, or at least is not the terminal object) .
Let&n...
I'm confused by the definition of conditional history, because it doesn't seem to be a generalisation of history. I would expect , but both of the conditions in the definition of are vacuously true if . This is independent of what is, so . Am I missing something?
I don't understand the motivation behind the fundamental theorem, and I'm wondering if you could say a bit more about it. In particular, it suggests that if I want to propose a family of probability distributions that "represent" observations somehow ("represents" maybe means in the sense of Bayesian predictions or in the sense of frequentist limits), I want to also consider this family to arise from some underlying mutually independent family along with some function. I'm not sure why I would want to propose an underlying family and a function at all, and...
Hmm, I am not sure what to say about the fundamental theorem, because I am not really understanding the confusion. Is there something less motivating about the fundamental theorem, than the analogous theorem about d-seperation being equivalent to conditional independence in all distributions comparable with the DAG?
Maybe this helps? (probably not): I am mostly imagining interacting with only a single distributions in the class, and the claim about independence in all probability distributions comparable with the structure can be replaced with instead independence in a general position probability distribution comparable with the structure.
I am not thinking of it as related to a maximum entropy argument.
The point about SEMs having more structure that I am ignoring is correct. I think that the largest philosophical difference between my framework and Pearlian one is that I am denying realism of anything beyond the "apparently identical." Another way to think about it is that I am denying realism about there being anything about the variables beyond their information. All of my definitions are only based on the information content of the variables, and so, for example, if you have two...
On the last example with the XOR temporal inference - since the partitions/queries we’re asking about are also possible factors, doesn’t the temporal data in terms of history etc depend on which choice of factorisation we go with?
We have a choice of 2 out of 3 factors each of which corresponds to one of the partitions in question, so surely by factorising in different ways we can make any two of the variables have history of 1 and thus automatically orthogonal?
I was thinking about the difficulty of finite factored sets not understanding the uniform distribution over 4 elements, and it makes me feel like something fundamental needs to be recast. An analogy came to mind about eigenvectors vs. eigenspaces.
What we might like to be true about the unit eigenvectors of a matrix is that they are the unique unit vectors for which the linear transformation preserves direction. But if two eigenvectors have the same eigenvalue, the choice of eigenvectors is not unique--we could choose any pair on that plane. So really, it s...
I'm a bit confused about how X can be independent of both Y and of (X xor Y). What would a probability distribution where this holds look like?
Imagine that I had a bit, and it’s either a 0 or a 1, and it’s either blue or green. And these two facts are primitive and independently generated. And I also have this other concept that’s like, “Is it grue or bleen?”, which is the XOR of blue/green and 0⁄1.
There’s a sense in which we’re inferring X is before Y, and in that case, we can infer that blueness is before grueness. And that’s pointing at the fact that blueness is more primitive, and grueness is a derived property.
I found it helpful to work through this example. Let's say is "is the bit 0 ...
Some general comments:
You mention above that Pearl's ontology 'has blinded us to the obvious next question'. I am very sympathetic to research programmes that try to overcome such blindness, this is the kind or research I have been doing myself recently. The main type of blindness that I have been trying to combat is blindness to complex types of self-referencing and indirect representation that can be present inside online machine learning agents, specifically in my recent work I have added a less blind viewpoint by modifying and...
This is the edited transcript of a talk introducing finite factored sets. For most readers, it will probably be the best starting point for learning about factored sets.
Video:
(Lightly edited) slides: https://intelligence.org/files/Factored-Set-Slides.pdf
1. Short Combinatorics Talk
1m. Some Context
Scott: So I want to start with some context. For people who are not already familiar with my work:
For people who are already familiar with my work, I just want to say that according to my personal aesthetics, the subject of this talk is about as exciting as Logical Induction, which is to say I'm really excited about it. And I'm really excited about this audience; I'm excited to give this talk right now.
1t. Factoring the Talk
This talk can be split into 2 parts:
I suspect that if I were better, I would instead be giving a short pure-math category theory talk; but I'm trained as a combinatorialist, so I'm giving a combinatorics talk upfront.
This talk can also be split into 4 parts differentiated by color: Motivation, Table of Contents, Main Body, and Examples. Combining these gives us 8 parts (some of which are not contiguous):
1b. Set Partitions
All right. Here's some background math:
You can also think of partitions as being like variables on your set S. Viewed in that way, the values of a partition X correspond to which part an element is in.
Or you can think of X as a question that you could ask about a generic element of S. If I have an element of S and it's hidden from you and you want to ask a question about it, each possible question corresponds to a partition that splits up S according to the different possible answers.
We're also going to use the lattice structure of partitions:
Hopefully this is mostly background. Now I want to show something new.
1b. Set Factorizations
A factorization of a set S is a set B of nontrivial partitions of S, called factors, such that for each way of choosing one part from each factor in B, there exists a unique element of S in the intersection of those parts.
So this is maybe a little bit dense. My short tagline of this is: "A factorization of S is a way to view S as a product, in the exact same way that a partition was a way to view S as a disjoint union."
If you take one definition away from this first talk, it should be the definition of factorization. I'll try to explain it from a bunch of different angles to help communicate the concept.
If B={b0,…,bn} is a factorization of S, then there exists a bijection between S and b0×⋯×bn given by s↦([s]b0,…,[s]bn). This bijection comes from sending an element of S to the tuple consisting only of parts containing that element. And as a consequence of this bijection, |S|=∏b∈B|b|.
So we're really viewing S as a product of these individual factors, with no additional structure.
Although we won't prove this here, something else you can verify about factorizations is that all of the parts in a factor have to be of the same size.
We'll write Fact(S) for the set of all factorizations of S, and we'll say that a finite factored set is a pair (S,B), where S is a finite set and B∈Fact(S).
Note that the relationship between S and B is somewhat loopy. If I want to define a factored set, there are two strategies I could use. I could first introduce the S, and break it into factors. Alternatively, I could first introduce the B. Any time I have a finite collection of finite sets B, I can take their product and thereby produce an S, modulo the degenerate case where some of the sets are empty. So S can just be the product of a finite collection of arbitrary finite sets.
To my eye, this notion of factorization is extremely natural. It's basically the multiplicative analog of a set partition. And I really want to push that point, so here's another attempt to push that point:
subsets of S such that the obvious
function from the disjoint union of
the elements of X to S is a bijection.
partitions of S such that the obvious
function to the product of
the elements of B from S is a bijection.
I can take a slightly modified version of the partition definition from before and dualize a whole bunch of the words, and get out the set factorization definition.
Hopefully you're now kind of convinced that this is an extremely natural notion.
Andrew Critch: Scott, in one sense, you're treating "subset" as dual to partition, which I think is valid. And then in another sense, you're treating "factorization" as dual to partition. Those are both valid, but maybe it's worth talking about the two kinds of duality.
Scott: Yeah. I think what's going on there is that there are two ways to view a partition. You can view a partition as "that which is dual to a subset," and you can also view a partition as something that is built up out of subsets. These two different views do different things when you dualize.
Ramana Kumar: I was just going to check: You said you can start with an arbitrary B and then build the S from it. It can be literally any set, and then there's always an S...
Scott: If none of them are empty, yes, you could just take a collection of sets that are kind of arbitrary elements. And you can take their product, and you can identify with each of the elements of a set the subset of the product that projects on to that element.
Ramana Kumar: Ah. So the S in that case will just be tuples.
Scott: That's right.
Brendan Fong: Scott, given a set, I find it very easy to come up with partitions. But I find it less easy to come up with factorizations. Do you have any tricks for...?
Scott: For that, I should probably just go on to the examples.
Joseph Hirsh: Can I ask one more thing before you do that? You allow factors to have one element in them?
Scott: I said "nontrivial," which means it does not have one element.
Joseph Hirsh: "Nontrivial" means "not have one element, and not have no elements"?
Scott: No, the empty set has a partition (with no parts), and I will call that nontrivial. But the empty set thing is not that critical.
I'm now going to move on to some examples.
1e. Enumerating Factorizations
Exercise! What are the factorizations of the set {0,1,2,3}?
Spoiler space:
.
.
.
.
.
.
.
.
.
.
First, we're going to have a kind of trivial factorization:
{ {{0},{1},{2},{3}} } 0 1 2 3 ––––––––––––We only have one factor, and that factor is the discrete partition. You can do this for any set, as long as your set has at least two elements.
Recall that in the definition of factorization, we wanted that for each way of choosing one part from each factor, we had a unique element in the intersection of those parts. Since we only have one factor here, satisfying the definition just requires that for each way of choosing one part from the discrete partition, there exists a unique element that is in that part.
And then we want some less trivial factorizations. In order to have a factorization, we're going to need some partitions. And the product of the cardinalities of our partitions are going to have to equal the cardinality of our set S, which is 4.
The only way to express 4 as a nontrivial product is to express it as 2×2. Thus we're looking for factorizations that have 2 factors, where each factor has 2 parts.
We noted earlier that all of the parts in a factor have to be of the same size. So we're looking for 2 partitions that each break our 4-element set into 2 sets of size 2.
So if I'm going to have a factorization of {0,1,2,3} that isn't this trivial one, I'm going to have to pick 2 partitions of my 4-element set that each break the set into 2 parts of size 2. And there are 3 partitions of a 4-element sets that break it up into 2 parts of size 2. For each way of choosing a pair of these 3 partitions, I'm going to get a factorization.
{ {{0,1},{2,3}}, {{0,2},{1,3}}} 0123{ {{0,1},{2,3}}, {{0,3},{1,2}}} 0132{ {{0,2},{1,3}}, {{0,3},{1,2}}} 0231So there will be 4 factorizations of a 4-element set.
In general you can ask, "How many factorizations are there of a finite set of size n?". Here's a little chart showing the answer for n≤25:
You'll notice that if n is prime, there will be a single factorization, which hopefully makes sense. This is the factorization that only has one factor.
A very surprising fact to me is that this sequence did not show up on OEIS, which is this database that combinatorialists use to check whether or not their sequence has been studied before, and to see connections to other sequences.
To me, this just feels like the multiplicative version of the Bell numbers. The Bell numbers count how many partitions there are of a set of size n. It's sequence number 110 on OEIS out of over 300,000; and this sequence just doesn't show up at all, even when I tweak it and delete the degenerate cases and so on.
I am very confused by this fact. To me, factorizations seem like an extremely natural concept, and it seems to me like it hasn't really been studied before.
This is the end of my short combinatorics talk.
Ramana Kumar: If you're willing to do it, I'd appreciate just stepping through one of the examples of the factorizations and the definition, because this is pretty new to me.
Scott: Yeah. Let's go through the first nontrivial factorization of {0,1,2,3}:
{ {{0,1},{2,3}}, {{0,2},{1,3}}} 0123In the definition, I said a factorization should be a set of partitions such that for each way of choosing one part from each of the partitions, there will be a unique element in the intersection of those parts.
Here, I have a partition that's separating the small numbers from the large numbers: {{0,1},{2,3}}. And I also have a partition that's separating the even numbers from the odd numbers: {{0,2},{1,3}}.
And the point is that for each way of choosing either "small" or "large" and also choosing "even" or "odd", there will be a unique element of S that is the conjunction of these two choices.
In the other two nontrivial factorizations, I replace either "small and large" or "even and odd" with an "inner and outer" distinction.
David Spivak: For partitions and for many things, if I know the partitions of a set A and the partitions of a set B, then I know some partitions of A+B (the disjoint union) or I know some partitions of A×B. Do you know any facts like that for factorizations?
Scott: Yeah. If I have two factored sets, I can get a factored set over their product, which sort of disjoint-unions the two collections of factors. For the additive thing, you're not going to get anything like that because prime sets don't have any nontrivial factorizations.
All right. I think I'm going to move on to the main talk.
2. The Main Talk (It's About Time)
2m. The Pearlian Paradigm
We can't talk about time without talking about Pearlian causal inference. I want to start by saying that I think the Pearlian paradigm is great. This buys me some crackpot points, but I'll say it's the best thing to happen to our understanding of time since Einstein.
I'm not going to go into all the details of Pearl's paradigm here. My talk will not be technically dependent on it; it's here for motivation.
Given a collection of variables and a joint probability distribution over those variables, Pearl can infer causal/temporal relationships between the variables. (In this talk I'm going to use "causal" and "temporal" interchangeably, though there may be more interesting things to say here philosophically.)
Pearl can infer temporal data from statistical data, which is going against the adage that "correlation does not imply causation." It's like Pearl is taking the combinatorial structure of your correlation and using that to infer causation, which I think is just really great.
Ramana Kumar: I may be wrong, but I think this is false. Or I think that that's not all Pearl needs—just the joint distribution over the variables. Doesn't he also make use of intervention distributions?
Scott: In the theory that is described in chapter two of the book Causality, he's not really using other stuff. Pearl builds up this bigger theory elsewhere. But you have some strong ability, maybe assuming simplicity or whatever (but not assuming you have access to extra information), to take a collection of variables and a joint distribution over those variables, and infer causation from correlation.
Andrew Critch: Ramana, it depends a lot on the structure of the underlying causal graph. For some causal graphs, you can actually recover them uniquely with no interventions. And only assumptions with zero-measure exceptions are needed, which is really strong.
Ramana Kumar: Right, but then the information you're using is the graph.
Andrew Critch: No, you're not. Just the joint distribution.
Ramana Kumar: Oh, okay. Sorry, go ahead.
Andrew Critch: There exist causal graphs with the property that if nature is generated by that graph and you don't know it, and then you look at the joint distribution, you will infer with probability 1 that nature was generated by that graph, without having done any interventions.
Ramana Kumar: Got it. That makes sense. Thanks.
Scott: Cool.
I am going to (a little bit) go against this, though. I'm going to claim that Pearl is kind of cheating when making this inference. The thing I want to point out is that in the sentence "Given a collection of variables and a joint probability distribution over those variables, Pearl can infer causal/temporal relationships between the variables.", the words "Given a collection of variables" are actually hiding a lot of the work.
The emphasis is usually put on the joint probability distribution, but Pearl is not inferring temporal data from statistical data alone. He is inferring temporal data from statistical data and factorization data: how the world is broken up into these variables.
I claim that this issue is also entangled with a failure to adequately handle abstraction and determinism. To point at that a little bit, one could do something like say:
"Well, what if I take the variables that I'm given in a Pearlian problem and I just forget that structure? I can just take the product of all of these variables that I'm given, and consider the space of all partitions on that product of variables that I'm given; and each one of those partitions will be its own variable. And then I can try to do Pearlian causal inference on this big set of all the variables that I get by forgetting the structure of variables that were given to me."
And the problem is that when you do that, you have a bunch of things that are deterministic functions of each other, and you can't actually infer stuff using the Pearlian paradigm.
So in my view, this cheating is very entangled with the fact that Pearl's paradigm isn't great for handling abstraction and determinism.
2t. We Can Do Better
The main thing we'll do in this talk is we're going to introduce an alternative to Pearl that does not rely on factorization data, and that therefore works better with abstraction and determinism.
Where Pearl was given a collection of variables, we are going to just consider all partitions of a given set. Where Pearl infers a directed acyclic graph, we're going to infer a finite factored set.
In the Pearlian world, we can look at the graph and read off properties of time and orthogonality/independence. A directed path between nodes corresponds to one node being before the other, and two nodes are independent if they have no common ancestor. Similarly, in our world, we will be able to read time and orthogonality off of a finite factored set.
(Orthogonality and independence are pretty similar. I'll use the word "orthogonality" when I'm talking about a combinatorial notion, and I'll use "independence" when I'm talking about a probabilistic notion.)
In the Pearlian world, d-separation, which you can read off of the graph, corresponds to conditional independence in all probability distributions that you can put on the graph. We're going to have a fundamental theorem that will say basically the same thing: conditional orthogonality corresponds to conditional independence in all probability distributions that we can put on our factored set.
In the Pearlian world, d-separation will satisfy the compositional graphoid axioms. In our world, we're just going to satisfy the compositional semigraphoid axioms. The fifth graphoid axiom is one that I claim you shouldn't have even wanted in the first place.
Pearl does causal inference. We're going to talk about how to do temporal inference using this new paradigm, and infer some very basic temporal facts that Pearl's approach can't. (Note that Pearl can also sometimes infer temporal relations that we can't—but only, from our point of view, because Pearl is making additional factorization assumptions.)
And then we'll talk about a bunch of applications.
Excluding the motivation, table of contents, and example sections, this table also serves as an outline of the two talks. We've already talked about set partitions and finite factored sets, so now we're going to talk about time and orthogonality.
2b. Time and Orthogonality
I think that if you capture one definition from this second part of the talk, it should be this one. Given a finite factored set as context, we're going to define the history of a partition.
Let F=(S,B) be a finite factored set. And let X,Y∈Part(S) be partitions of S.
The history of X, written hF(X), is the smallest set of factors H⊆B such that for all s,t∈S, if s∼bt for all b∈H, then s∼Xt.
The history of X, then, is the smallest set of factors H—so, the smallest subset of B—such that if I take an element of S and I hide it from you, and you want to know which part in X it is in, it suffices for me to tell you which part it is in within each of the factors in H.
So the history H is a set of factors of S, and knowing the values of all the factors in H is sufficient to know the value of X, or to know which part in X a given element is going to be in. I'll give an example soon that will maybe make this a little more clear.
We're then going to define time from history. We'll say that X is weakly before Y, written X≤FY, if hF(X)⊆hF(Y). And we'll say that X is strictly before Y, written X<FY, if hF(X)⊂hF(Y).
One analogy one could draw is that these histories are like the past light cones of a point in spacetime. When one point is before another point, then the backwards light cone of the earlier point is going to be a subset of the backwards light cone of the later point. This helps show why "before" can be like a subset relation.
We're also going to define orthogonality from history. We'll say that two partitions X and Y are orthogonal, written X⊥FY, if their histories are disjoint: hF(X)∩hF(Y)={}.
Now I'm going to go through an example.
2e. Game of Life
Let S be the set of all Game of Life computations starting from an [−n,n]×[−n,n] board.
Let R={(r,c,t)∈Z3∣0≤t≤n, |r|≤n−t, |c|≤n−t} (i.e., cells computable from the initial [−n,n]×[−n,n] board). For (r,c,t)∈R, let ℓ(r,c,t)⊆S be the set of all computations such that the cell at row r and column c is alive at time t.
(Minor footnote: I've done some small tricks here in order to deal with the fact that the Game of Life is normally played on an infinite board. We want to deal with the finite case, and we don't want to worry about boundary conditions, so we're only going to look at the cells that are uniquely determined by the initial board. This means that the board will shrink over time, but this won't matter for our example.)
S is the set of all Game of Life computations, but since the Game of Life is deterministic, the set of all computations is in bijective correspondence with the set of all initial conditions. So |S|=2(2n+1)2, the number of initial board states.
This also gives us a nice factorization on the set of all Game of Life computations. For each cell, there's a partition that separates out the Game of Life computations in which that cell is alive at time 0 from the ones where it's dead at time 0. Our factorization, then, will be a set of (2n+1)2 binary factors, one for each question of "Was this cell alive or dead at time 0?".
Formally: For (r,c,t)∈R, let L(r,c,t)={ℓ(r,c,t),S∖ℓ(r,c,t)}. Let F=(S,B), where B={L(r,c,0)∣−n≤r,c≤n}.
There will also be other partitions on this set of all Game of Life computations that we can talk about. For example, you can take a cell and a time t and say, "Is this cell alive at time t?", and there will be a partition that separates out the computations where that cell is alive at time t from the computations where it's dead at time t.
Here's an example of that:
The lowest grid shows a section of the initial board state.
The blue, green, and red squares on the upper boards are (cell, time) pairs. Each square corresponds to a partition of the set of all Game of Life computations, "Is that cell alive or dead at the given time t?"
The history of that partition is going to be all the cells in the initial board that go into computing whether the cell is alive or dead at time t. It's everything involved in figuring out that cell's state. E.g., knowing the state of the nine light-red cells in the initial board always tells you the state of the red cell in the second board.
In this example, the partition corresponding to the red cell's state is strictly before the partition corresponding to the blue cell. The question of whether the red cell is alive or dead is before the question of whether the blue cell is alive or dead.
Meanwhile, the question of whether the red cell is alive or dead is going to be orthogonal to the question of whether the green cell is alive or dead.
And the question of whether the blue cell is alive or dead is not going to be orthogonal to the question of whether the green cell is alive or dead, because they intersect on the cyan cells.
Generalizing the point, fix X=L(rX,cX,tX),Y=L(rY,cY,tY), where (rX,cX,tX),(rY,cY,tY)∈R. Then:
We can also see that the blue and green cells look almost orthogonal. If we condition on the values of the two cyan cells in the intersection of their histories, then the blue and green partitions become orthogonal. That's what we're going to discuss next.
David Spivak: A priori, that would be a gigantic computation—to be able to tell me that you understand the factorization structure of that Game of Life. So what intuition are you using to be able to make that claim, that it has the kind of factorization structure you're implying there?
Scott: So, I've defined the factorization structure.
David Spivak: You gave us a certain factorization already. So somehow you have a very good intuition about history, I guess. Maybe that's what I'm asking about.
Scott: Yeah. So, if I didn't give you the factorization, there's this obnoxious number of factorizations that you could put on the set here. And then for the history, the intuition I'm using is: "What do I need to know in order to compute this value?"
I actually went through and I made little gadgets in Game of Life to make sure I was right here, that every single cell actually could in some situations affect the cells in question. But yeah, the intuition that I'm working from is mostly about the information in the computation. It's "Can I construct a situation where if only I knew this fact, I would be able to compute what this value is? And if I can't, then it can take two different values."
David Spivak: Okay. I think deriving that intuition from the definition is something I'm missing, but I don't know if we have time to go through that.
Scott: Yeah, I think I'm not going to here.
2b. Conditional Orthogonality
So, just to set your expectations: Every time I explain Pearlian causal inference to someone, they say that d-separation is the thing they can't remember. d-separation is a much more complicated concept than "directed paths between nodes" and "nodes without any common ancestors" in Pearl; and similarly, conditional orthogonality will be much more complicated than time and orthogonality in our paradigm. Though I do think that conditional orthogonality has a much simpler and nicer definition than d-separation.
We'll begin with the definition of conditional history. We again have a fixed finite set as our context. Let F=(S,B) be a finite factored set, let X,Y,Z∈Part(S), and let E⊆S.
The conditional history of X given E, written hF(X|E), is the smallest set of factors H⊆B satisfying the following two conditions:
The first condition is much like the condition we had in our definition of history, except we're going to make the assumption that we're in E. So the first condition is: if all you know about an object is that it's in E, and you want to know which part it's in within X, it suffices for me to tell you which part it's in within each factor in the history H.
Our second condition is not actually going to mention X. It's going to be a relationship between E and H. And it says that if you want to figure out whether an element of S is in E, it's sufficient to parallelize and ask two questions:
If both of these questions return "yes", then the point has to be in E.
I am not going to give an intuition about why this needs to be a part of the definition. I will say that without this second condition, conditional history would not even be well-defined, because it wouldn't be closed under intersection. And so I wouldn't be able to take the smallest set of factors in the subset ordering.
Instead of justifying this definition by explaining the intuitions behind it, I'm going to justify it by using it and appealing to its consequences.
We're going to use conditional history to define conditional orthogonality, just like we used history to define orthogonality. We say that X and Y are orthogonal given E⊆S, written X⊥FY∣E, if the history of X given E is disjoint from the history of Y given E: hF(X|E)∩hF(Y|E)={}.
We say X and Y are orthogonal given Z∈Part(S), written X⊥FY∣Z, if X⊥FY∣z for all z∈Z. So what it means to be orthogonal given a partition is just to be orthogonal given each individual way that the partition might be, each individual part in that partition.
I've been working with this for a while and it feels pretty natural to me, but I don't have a good way to push the naturalness of this condition. So again, I instead want to appeal to the consequences.
2b. Compositional Semigraphoid Axioms
Conditional orthogonality satisfies the compositional semigraphoid axioms, which means finite factored sets are pretty well-behaved. Let F=(S,B) be a finite factored set, and let X,Y,Z,W∈Part(S) be partitions of S. Then:
The first four properties here make up the semigraphoid axioms, slightly modified because I'm working with partitions rather than sets of variables, so union is replaced with common refinement. There's another graphoid axiom which we're not going to satisfy; but I argue that we don't want to satisfy it, because it doesn't play well with determinism.
The fifth property here, composition, is maybe one of the most unintuitive, because it's not exactly satisfied by probabilistic independence.
Decomposition and composition act like converses of each other. Together, conditioning on Z throughout, they say that X is orthogonal to both Y and W if and only if X is orthogonal to the common refinement of Y and W.
2b. The Fundamental Theorem
In addition to being well-behaved, I also want to show that conditional orthogonality is pretty powerful. The way I want to do this is by showing that conditional orthogonality exactly corresponds to conditional independence in all probability distributions you can put on your finite factored set. Thus, much like d-separation in the Pearlian picture, conditional orthogonality can be thought of as a combinatorial version of probabilistic independence.
A probability distribution on a finite factored set F=(S,B) is a probability distribution P on S that can be thought of as coming from a bunch of independent probability distributions on each of the factors in B. So P(s)=∏b∈BP([s]b) for all s∈S.
This effectively means that your probability distribution factors the same way your set factors: the probability of any given element is the product of the probabilities of each of the individual parts that it's in within each factor.
The fundamental theorem of finite factored sets says: Let F=(S,B) be a finite factored set, and let X,Y,Z∈Part(S) be partitions of S. Then X⊥FY∣Z if and only if for all probability distributions P on F, and all x∈X, y∈Y, and z∈Z, we have P(x∩z)⋅P(y∩z)=P(x∩y∩z)⋅P(z). I.e., X is orthogonal to Y given Z if and only conditional independence is satisfied across all probability distributions.
This theorem, for me, was a little nontrivial to prove. I had to go through defining certain polynomials associated with the subsets, and then dealing with unique factorization in the space of these polynomials; I think the proof was eight pages or something.
The fundamental theorem allows us to infer orthogonality data from probabilistic data. If I have some empirical distribution, or I have some Bayesian distribution, I can use that to infer some orthogonality data. (We could also imagine orthogonality data coming from other sources.) And then we can use this orthogonality data to get temporal data.
So next, we're going to talk about how to get temporal data from orthogonality data.
2b. Temporal Inference
We're going to start with a finite set Ω, which is our sample space.
One naive thing that you might think we would try to do is infer a factorization of Ω. We're not going to do that because that's going to be too restrictive. We want to allow for Ω to maybe hide some information from us, for there to be some latent structure and such.
There may be some situations that are distinct without being distinct in Ω. So instead, we're going to infer a factored set model of Ω: some other set S, and a factorization of S, and a function from S to Ω.
A model of Ω is a pair (F,f), where F=(S,B) is a finite factored set and f:S→Ω. (f need not be injective or surjective.)
Then if I have a partition of Ω, I can send this partition backwards across f and get a unique partition of S. If X∈Parts(Ω), then f−1(X)∈Parts(S) is given by s∼f−1(X)t⇔f(s)∼Xf(t).
Then what we're going to do is take a bunch of orthogonality facts about Ω, and we're going to try to find a model which captures the orthogonality facts.
We will take as given an orthogonality database on Ω, which is a pair D=(O,N), where O (for "orthogonal") and N (for "not orthogonal") are each sets of triples (X,Y,Z) of partitions of Ω. We'll think of these as rules about orthogonality.
What it means for a model (F,f) to satisfy a database D is:
So we have these orthogonality rules we want to satisfy, and we want to consider the space of all models that are consistent with these rules. And even though there will always be infinitely many models that are consistent with my database, if at least one is—you can always just add more information that you then delete with f—we would like to be able to sometimes infer that for all models that satisfy our database, f−1(X) is before f−1(Y).
And this is what we're going to mean by inferring time. If all of our models (F,f) that are consistent with the database D satisfy some claim about time f−1(X)<Ff−1(Y), we'll say that X<DY.
2e. Two Binary Variables (Pearl)
So we've set up this nice combinatorial notion of temporal inference. The obvious next questions are:
Pearlian temporal inference is really quite powerful; given enough data, it can infer temporal sequence in a wide variety of situations. How powerful is the finite factored sets approach by comparison?
To address that question, we'll go to an example. Let X and Y be two binary variables. Pearl asks: "Are X and Y independent?" If yes, then there's no path between the two. If no, then there may be a path from X to Y, or from Y to X, or from a third variable to both X and Y.
In either case, we're not going to infer any temporal relationships.
To me, it feels like this is where the adage "correlation does not imply causation" comes from. Pearl really needs more variables in order to be able to infer temporal relationships from more rich combinatorial structures.
However, I claim that this Pearlian ontology in which you're handed this collection of variables has blinded us to the obvious next question, which is: is X independent of X XOR Y?
In the Pearlian world, X and Y were our variables, and X XOR Y is just some random operation on those variables. In our world, X XOR Y instead is a variable on the same footing as X and Y. The first thing I do with my variables X and Y is that I take the product X×Y and then I forget the labels X and Y.
So there's this question, "Is X independent of X XOR Y?". And if X is independent of X XOR Y, we're actually going to be able to conclude that X is before Y!
So not only is the finite factored set paradigm non-vacuous, and not only is it going to be able to keep up with Pearl and infer things Pearl can't, but it's going to be able to infer a temporal relationship from only two variables.
So let's go through the proof of that.
2e. Two Binary Variables (Factored Sets)
Let Ω={00,01,10,11}, and let X, Y, and Z be the partitions (/questions):
Let D=(O,N), where O={(X,Z,{Ω})} and N={(Z,Z,{Ω})}. If we'd gotten this orthogonality database from a probability distribution, then we would have more than just two rules, since we would observe more orthogonality and non-orthogonality than that. But temporal inference is monotonic with respect to adding more rules, so we can just work with the smallest set of rules we'll need for the proof.
The first rule says that X is orthogonal to Z. The second rule says that Z is not orthogonal to itself, which is basically just saying that Z is non-deterministic; it's saying that both of the parts in Z are possible, that both are supported under the function f. The {Ω} indicates that we aren't making any conditions.
From this, we'll be able to prove that X<DY.
Proof. First, we'll show that that X is weakly before Y. Let (F,f) satisfy D. Let HX be shorthand for hF(f−1(X)), and likewise let HY=hF(f−1(Y)) and HZ=hF(f−1(Z)).
Since (X,Z,{Ω})∈O, we have that HX∩HZ={}; and since (Z,Z,{Ω})∈N, we have that HZ≠{}.
Since X≤ΩY∨ΩZ—that is, since X can be computed from Y together with Z—HX⊆HY∪HZ. (Because a partition's history is the smallest set of factors needed to compute that partition.)
And since HX∩HZ={}, this implies HX⊆HY, so X is weakly before Y.
To show the strict inequality, we'll assume for the purpose of contradiction that HX = HY.
Notice that Z can be computed from X together with Y—that is, Z≤ΩX∨ΩY—and therefore HZ⊆HX∪HY (i.e., HZ⊆HX). It follows that HZ=(HX∪HY)∩HZ=HX∩HZ. But since HZ is also disjoint from HX, this means that HZ={}, a contradiction.
Thus HX≠HY, so HX⊂HY, so f−1(X)<Ff−1(Y), so X<DY. □
When I'm doing temporal inference using finite factored sets, I largely have proofs that look like this. We collect some facts about emptiness or non-emptiness of various Boolean combinations of histories of variables, and we use these to conclude more facts about histories of variables being subsets of each other.
I have a more complicated example that uses conditional orthogonality, not just orthogonality; I'm not going to go over it here.
One interesting point I want to make here is that we're doing temporal inference—we're inferring that X is before Y—but I claim that we're also doing conceptual inference.
Imagine that I had a bit, and it's either a 0 or a 1, and it's either blue or green. And these two facts are primitive and independently generated. And I also have this other concept that's like, "Is it grue or bleen?", which is the XOR of blue/green and 0/1.
There's a sense in which we're inferring X is before Y, and in that case, we can infer that blueness is before grueness. And that's pointing at the fact that blueness is more primitive, and grueness is a derived property.
In our proof, X and Z can be thought of as these primitive properties, and Y is a derived property that we're getting from them. So we're not just inferring time; we're inferring facts about what are good, natural concepts. And I think that there's some hope that this ontology can do for the statement "you can't really distinguish between blue and grue" what Pearl can do to the statement "correlation does not imply causation".
2b. Applications / Future Work / Speculation
The future work I'm most excited by with finite factored sets falls into three rough categories: inference (which involves more computational questions), infinity (more mathematical), and embedded agency (more philosophical).
Research topics related to inference:
There are a lot of research directions suggested by questions like "How do we do efficient inference in this paradigm?". Some of the questions here come from the fact that we're making fewer assumptions than Pearl, and are in some sense more coming from the raw data.
Then I have the applications that are about extending factored sets to the infinite case:
Everything I've presented in this talk was under the assumption of finiteness. In some cases this wasn't necessary—but in a lot of cases it actually was, and I didn't draw attention to this.
I suspect that the fundamental theorem can be extended to finite-dimensional factored sets (i.e., factored sets where |B| is finite), but it can not be extended to arbitrary-dimension factored sets.
And then, what I'm really excited about is applications to embedded agency:
I focused on the temporal inference aspect of finite factored sets in this talk, because it's concrete and tangible to be able to say, "Ah, we can do Pearlian temporal inference, only we can sometimes infer more structure and we rely on fewer assumptions."
But really, a lot of the applications I'm excited about involve using factored sets to model situations, rather than inferring factored sets from data.
Anywhere that we currently model a situation using graphs with directed edges that represent information flow or causality, we might instead be able to use factored sets to model the situation; and this might allow our models to play more nicely with abstraction.
I want to build up the factored set ontology as an alternative to graphs when modeling agents interacting with things, or when modeling information flow. And I'm really excited about that direction.