The eleventh virtue is scholarship. Study many sciences and absorb their power as your own. Each field that you consume makes you larger.
Aspiring alignment researcher with a keen interest in agent foundations. Studying math, physics, theoretical CS (Harvard 2027). Contact me via Discord: dalcy_me, email: dalcy.mail@gmail.com. They / Them, He / Him.
I've been doing a deep dive on this post, and while the main theorems make sense I find myself quite confused about some basic concepts. I would really appreciate some help here!
The Resampling stuff is a bit confusing too:
if we have a natural latent , then construct a new natural latent by resampling conditional on (i.e. sample from ), independently of whatever other stuff we’re interested in.
And finally:
In standard form, a natural latent is always approximately a deterministic function of . Specifically: .
...
Suppose there exists an approximate natural latent over . Construct a new random variable sampled from the distribution . (In other words: simultaneously resample each given all the others.) Conjecture: is an approximate natural latent (though the approximation may not be the best possible). And if so, a key question is: how good is the approximation?
Where is the top result proved, and how is this statement different from the Universal Natural Latent Conjecture below? Also is this post relevant to either of these statements, and if so, does that mean they only hold under strong redundancy?
Does anyone know if Shannon arrive at entropy from the axiomatic definition first, or the operational definition first?
I've been thinking about these two distinct ways in which we seem to arrive at new mathematical concepts, and looking at the countless partial information decomposition measures in the literature all derived/motivated based on an axiomatic basis, and not knowing which intuition to prioritize over which, I've been assigning less premium on axiomatic conceptual definitions than i used to:
The basis of comparison would be its usefulness and ease-of-generalization to better concepts:
(obviously these two feed into each other)
Just finished the local causal states paper, it's pretty cool! A couple of thoughts though:
I don't think the causal states factorize over the dynamical bayes net, unlike the original random variables (by assumption). Shalizi doesn't claim this either.
Also I don't follow the Markov Field part - how would proving:
if we condition on present neighbors of the patch, as well as the parents of the patch, then we get independence of the states of all points at time t or earlier. (pg 16)
... show that the causal states is a markov field (aka satisfies markov independencies (local or pairwise or global) induced by an undirected graph)? I'm not even sure what undirected graph the causal states would be markov with respect to. Is it the ...
Also for concreteness I think I need to understand its application in detecting coherent structures in cellular automata to better appreciate this construction, though the automata theory part does go a bit over my head :p
a Markov blanket represents a probabilistic fact about the model without any knowledge you possess about values of specific variables, so it doesn't matter if you actually do know which way the agent chooses to go.
The usual definition of Markov blankets is in terms of the model without any knowledge of the specific values as you say, but I think in Critch's formalism this isn't the case. Specifically, he defines the 'Markov Boundary' of (being the non-abstracted physics-ish model) as a function of the random variable (where he writes e.g. ), so it can depend on the values instantiated at .
So I think under this definition of Markov blankets, they can be used to denote agent boundaries, even in physics-ish models (i.e. ones that relate nicely to causal relationships). I'd like to know what you think about this.
I thought if one could solve one NP-complete problem then one can solve all of them. But you say that the treewidth doesn't help at all with the Clique problem. Is the parametrized complexity filtration by treewidth not preserved by equivalence between different NP-complete problems somehow?
All NP-complete problems should have parameters that makes the problem polynomial when bounded, trivially so by the => 3-SAT => Bayes Net translation, and using the treewidth bound.
This isn't the case for the clique problem (finding max clique) because it's not NP-complete (it's not a decision problem), so we don't necessarily expect its parameterized version to be polynomial tractable — in fact, it's the k-clique problem (yes/no is there a clique larger than size k) that is NP-complete. (so by the above translation argument, there certainly exists some graphical quantity that when bounded makes the k-clique problem tractable, though I'm not aware of it, or whether it's interesting)
To me, the interesting question is whether:
Looking at the 3-SAT example ( are the propositional variables, the ORs, and the AND with serving as intermediate ANDs):
I would be interested in a similar analysis for more NP-complete problems known to have natural parameterized complexity characterization.
You mention treewidth - are there other quantities of similar importance?
I'm not familiar with any, though ChatGPT does give me some examples! copy-pasted below:
- Solution Size (k): The size of the solution or subset that we are trying to find. For example, in the k-Vertex Cover problem, k is the maximum size of the vertex cover. If k is small, the problem can be solved more efficiently.
- Treewidth (tw): A measure of how "tree-like" a graph is. Many hard graph problems become tractable when restricted to graphs of bounded treewidth. Algorithms that leverage treewidth often use dynamic programming on tree decompositions of the graph.
- Pathwidth (pw): Similar to treewidth but more restrictive, pathwidth measures how close a graph is to a path. Problems can be easier to solve on graphs with small pathwidth.
- Vertex Cover Number (vc): The size of the smallest vertex cover of the graph. This parameter is often used in graph problems where knowing a small vertex cover can simplify the problem.
- Clique Width (cw): A measure of the structural complexity of a graph. Bounded clique width can be used to design efficient algorithms for certain problems.
- Max Degree (Δ): The maximum degree of any vertex in the graph. Problems can sometimes be solved more efficiently when the maximum degree is small.
- Solution Depth (d): For tree-like or hierarchical structures, the depth of the solution tree or structure can be a useful parameter. This is often used in problems involving recursive or hierarchical decompositions.
- Branchwidth (bw): Similar to treewidth, branchwidth is another measure of how a graph can be decomposed. Many algorithms that work with treewidth also apply to branchwidth.
- Feedback Vertex Set (fvs): The size of the smallest set of vertices whose removal makes the graph acyclic. Problems can become easier on graphs with a small feedback vertex set.
- Feedback Edge Set (fes): Similar to feedback vertex set, but involves removing edges instead of vertices to make the graph acyclic.
- Modular Width: A parameter that measures the complexity of the modular decomposition of the graph. This can be used to simplify certain problems.
- Distance to Triviality: This measures how many modifications (like deletions or additions) are needed to convert the input into a simpler or more tractable instance. For example, distance to a clique, distance to a forest, or distance to an interval graph.
- Parameter for Specific Constraints: Sometimes, specific problems have unique natural parameters, like the number of constraints in a CSP (Constraint Satisfaction Problem), or the number of clauses in a SAT problem.
I like to think of treewidth in terms of its characterization from tree decomposition, a task where you find a clique tree (or junction tree) of an undirected graph.
Clique trees for an undirected graph is a tree such that:
You can check that these properties hold in the example below. I will also refer to nodes of a clique tree as 'cliques'. (image from here)
My intuition for the point of tree decompositions is that you want to coarsen the variables of a complicated graph so that they can be represented in a simpler form (tree), while ensuring the preservation of useful properties such as:
Of course tree decompositions aren't unique (image):
So we define .
Bayes Net inference algorithms maintain its efficiency by using dynamic programming over multiple layers.
Level 0: Naive Marginalization
Level 1: Variable Elimination
Level 2: Clique-tree based algorithms — e.g., Sum-product (SP) / Belief-update (BU) calibration algorithms
Level 3: Specialized query-set answering algorithms over a calibrated clique tree.
Perhaps I should one day in the far far future write a sequence on bayes nets.
Some low-effort TOC (this is basically mostly koller & friedman):
Thank you, that is very clarifying!