Dalcy

Tomorrow can be brighter than today
Although the night is cold
The stars may seem so very far away

But courage, hope and reason burn
In every mind, each lesson learned
Shining light to guide our way

Make tomorrow brighter than today

Wiki Contributions

Comments

Sorted by
Dalcy289

The 3-4 Chasm of Theoretical Progress

epistemic status: unoriginal. trying to spread a useful framing of theoretical progress introduced from an old post.

Tl;dr, often the greatest theoretical challenge comes from the step of crossing the chasm from [developing an impractical solution to a problem] to [developing some sort of a polytime solution to a problem], because the nature of their solutions can be opposites.

Summarizing Diffractor's post on Program Search and Incomplete Understanding:

Solving a foundational problem to its implementation often takes the following steps (some may be skipped):

  1. developing a philosophical problem
  2. developing a solution given infinite computing power
  3. developing an impractical solution
  4. developing some sort of polytime solution
  5. developing a practical solution

and he says that it is often during the 3 -> 4 step in which understanding gets stuck and the most technical and brute-force math (and i would add sometimes philosophical) work is needed, because:

  • a common motif in 3) is that they're able to proving interesting things about their solutions, like asymptotic properties, by e.g., having their algorithms iterate through all turing machines, hence somewhat conferring the properties of the really good turing machine solution that exists somewhere in this massive search space to the overall search algorithm (up to a massive constant, usually).
    • think of Levin's Universal Search, AIXItl, Logical Induction.
    • he says such algorithms are secretly a black box algorithm; there are no real gears.
  • Meanwhile, algorithms in 4) have the opposite nature - they are polynomial often because they characterize exploitable patterns that make a particular class of problems easier than most others, which requires Real Understanding. So algorithms of 3) and 4) often look nothing alike.

I liked this post and the idea of the "3-4 chasm," because it explicitly captures the vibes of why I personally felt the vibes that, e.g., AIT, might be less useful for my work: after reading this post, I realized that for example when I refer to the word "structure," I'm usually pointing at the kind of insights required to cross the 3-4 gap, while others might be using the same word to refer to things at a different level. This causes me to get confused as to how some tool X that someone brought up is supposed to help with the 3-4 gap I'm interested in.[1]

Vanessa Cosoy refers to this post, saying (in my translation of her words) that a lot of the 3-4 gap in computational learning theory has to do with our lack of understanding of deep learning theory, like how the NP-complete barrier is circumvented in practical problems, what are restrictions we can put on out hypothesis class to make them efficiently learnable in the same way our world seems efficiently learnable, etc.

  • She mentions that this gap, at least in the context of deep learning theory, isn't too much of a pressing problem because it already has mainstream attention - which explains why a lot of her work seems to lie in the 1-3 regime.

I asked GPT for examples of past crossings of the 3-4 chasm in other domains, and it suggested [Shannon's original technically-constructive-but-highly-infeasible proof for the existence of optimal codes] vs. [recent progress on Turbocodes that actually approach this limit while being very practical], which seems like a perfect example.

  1. ^

    AIT in specific seems to be useful primarily in the 1-2 level?

Dalcy10

Thank you! I tried it on this post and while the post itself is pretty short, the raw content that i get seems to be extremely long (making it larger than the o1 context window, for example), with a bunch of font-related information inbetween. Is there a way to fix this?

Dalcy50

The critical insight is that this is not always the case!

Let's call two graphs I-equivalent if their set of independencies (implied by d-separation) are identical. A theorem of Bayes Nets say that two graphs are I-equivalent if they have the same skeleton and the same set of immoralities.

This last constraint, plus the constraint that the graph must be acyclic, allows some arrow directions to be identified - namely, across all I-equivalent graphs that are the perfect map of a distribution, some of the edges have identical directions assigned to them.

The IC algorithm (Verma & Pearl, 1990) for finding perfect maps (hence temporal direction) is exactly about exploiting these conditions to orient as many of the edges as possible:

More intuitively, (Verma & Pearl, 1992) and (Meek, 1995) together shows that the following four rules are necessary and sufficient operations to maximally orient the graph according to the I-equivalence (+ acyclicity) constraint:

Anyone interested in further detail should consult Pearl's Causality Ch 2. Note that for some reason Ch 2 is the only chapter in the book where Pearl talks about Causal Discovery (i.e. inferring time from observational distribution) and the rest of the book is all about Causal Inference (i.e. inferring causal effect from (partially) known causal structure).

Dalcy476

The Metaphysical Structure of Pearl's Theory of Time

Epistemic status: metaphysics

I was reading Factored Space Models (previously, Finite Factored Sets) and was trying to understand in what sense it was a Theory of Time.

Scott Garrabrant says "[The Pearlian Theory of Time] ... is the best thing to happen to our understanding of time since Einstein". I read Pearl's book on Causality[1], and while there's math, this metaphysical connection that Scott seems to make isn't really explicated. Timeless Causality and Timeless Physics is the only place I saw this view explained explicitly, but not at the level of math / language used in Pearl's book.

Here is my attempt at explicitly writing down what all of these views are pointing at (in a more rigorous language)—the core of the Pearlian Theory of Time, and in what sense FSM shares the same structure.


Causality leave a shadow of conditional independence relationships over the observational distribution. Here's an explanation providing the core intuition:

  1. Suppose you represent the ground truth structure of [causality / determination] of the world via a Structural Causal Model over some variables, a very reasonable choice. Then, as you go down the Pearlian Rung: SCM [2] Causal Bayes Net [3] Bayes Net, theorems guarantee that the Bayes Net is still Markovian wrt the observational distribution.
    1. (Read Timeless Causality for an intuitive example.)
  2. Causal Discovery then (at least in this example) reduces to inferring the equation assignment directions of the SCM, given only the observational distribution.
  3. The earlier result guarantees that all you have to do is find a Bayes Net that is Markovian wrt the observational distribution. Alongside the faithfulness assumption, this thus reduces to finding a Bayes Net structure G whose set of independencies (implied by d-separation) are identical to that of P (or, finding the Perfect Map of a distribution[4]).
  4. Then, at least some of the edges of the Perfect Map will have its directions nailed down by the conditional independence relations.

The metaphysical claim is that, this direction is the definition of time[5], morally so, based on the intuition provided by the example above.

So, the Pearlian Theory of Time is the claim that Time is the partial order over the variables of a Bayes Net corresponding to the perfect map of a distribution.


Abstracting away, the structure of any Theory of Time is then to:

  1. find a mathematical structure [in the Pearlian Theory of Time, a Bayes Net]
  2. ... that has gadgets [d-separation]
  3. ... that are, in some sense, "equivalent" [soundness & completeness] to the conditional independence relations of the distribution the structure is modeling
  4. ... while containing a notion of order [parenthood relationship of nodes in a Bayes Net]
  5. ... while this order induced from the gadget coinciding to that of d-separation [trivially so here, because we're talking about Bayes Nets and d-separation] such that it captures the earlier example which provided the core intuition behind our Theory of Time.

This is exactly what Factored Space Model does:

  1. find a mathematical structure [Factored Space Model]
  2. ... that has gadgets [structural independence]
  3. ... that are, in some sense, "equivalent" [soundness & completeness] to the conditional independence relations of the distribution the structure is modeling
  4. ... while containing a notion of order [preorder relation induced by the subset relationship of the History]
  5. ... while this order induced from the gadget coinciding to that of d-separation [by a theorem of FSM] such that it captures the earlier example which provided the core intuition behind our Theory of Time.
  6. while, additionally, generalizing the scope of our Theory of Time from [variables that appear in the Bayes Net] to [any variables defined over the factored space].

... thus justifying calling FSM a Theory of Time in the same spirit that Pearlian Causal Discovery is a Theory of Time.

  1. ^

    Chapter 2, specifically, which is about Causal Discovery. All the other chapters are mostly irrelevant for this purpose.

  2. ^

    By (1) making a graph with edge direction corresponding to equation assignment direction, (2) pushforwarding uncertainties to endogenous variables, and (3) letting interventional distributions be defined by the truncated factorization formula.

  3. ^

    By (1) forgetting the causal semantics, i.e. no longer associating the graph with all the interventional distributions, and only the no intervention observational distribution.

  4. ^

    This shortform answers this question I had.

  5. ^

    Pearl comes very close. In his Temporal Bias Conjecture (2.8.2):

    "In most natural phenomenon, the physical time coincides with at least one statistical time."

    (where statistical time refers to the aforementioned direction.)

    But doesn't go as far as this ought to be the definition of Time.

Dalcy40

The grinding inevitability is not a pressure on you from the outside, but a pressure from you, towards the world. This type of determination is the feeling of being an agent with desires and preferences. You are the unstoppable force, moving towards the things you care about, not because you have to but simply because that’s what it means to care. 

I think this is probably one of my favorite quotes of all time. I translated it to Korean (with somewhat major stylistic changes) with the help of ChatGPT:

의지(意志)라 함은,
하나의 인간으로서,
멈출 수 없는 힘으로 자신이 중요히 여기는 것들을 향해 나아가는 것.

이를 따르는 갈아붙이는 듯한 필연성은,
외부에서 자신을 압박하는 힘이 아닌,
스스로가 세상을 향해 내보내는 압력임을.

해야 해서가 아니라,
단지 그것이 무언가를 소중히 여긴다는 뜻이기 때문에.

Dalcy30

https://www.lesswrong.com/posts/KcvJXhKqx4itFNWty/k-complexity-is-silly-use-cross-entropy-instead

The K-complexity of a function is the length of its shortest code. But having many many codes is another way to be simple! Example: gauge symmetries in physics. Correcting for length-weighted code frequency, we get an empirically better simplicity measure: cross-entropy.

However:

[this] is a well-known notion in algorithmic information theory, and differs from K-complexity by at most a constant

Dalcy4120

Epistemic status: literal shower thoughts, perhaps obvious in retrospect, but was a small insight to me.

I’ve been thinking about: “what proof strategies could prove structural selection theorems, and not just behavioral selection theorems?”

Typical examples of selection theorems in my mind are: coherence theorems, good regulator theorem, causal good regulator theorem.

  • Coherence theorem: Given an agent satisfying some axioms, we can observe their behavior in various conditions and construct , and then the agent’s behavior is equivalent to a system that is maximizing .
    • Says nothing about whether the agent internally constructs  and uses them.
  • (Little Less Silly version of the) Good regulator theorem: A regulator  that minimizes the entropy of a system variable  (where there is an environment variable  upstream of both  and ) without unnecessary noise (hence deterministic) is behaviorally equivalent to a deterministic function of  (despite being a function of ).
    • Says nothing about whether  actually internally reconstructs  and uses it to produce its output.
  • Causal good regulator theorem (summary): Given an agent achieving low regret across various environment perturbations, we can observe their behavior in specific perturbed-environments, and construct  that is very similar to the true environment . Then argue: “hence the agent must have something internally isomorphic to ”. Which is true, but …
    • says nothing about whether the agent actually uses those internal isomorphic-to- structures in the causal history of computing its output.

And I got stuck here wondering, man, how do I ever prove anything structural.

Then I considered some theorems that, if you squint really really hard, could also be framed in the selection theorem language in a very broad sense:

  • SLT: Systems selected to get low loss are likely to be in a degenerate part of the loss landscape.[1]
    • Says something about structure: by assuming the system to be a parameterized statistical model, it says the parameters satisfy certain conditions like degeneracy (which further implies e.g., modularity).

This made me realize that to prove selection theorems on structural properties of agents, you should obviously give more mathematical structure to the “agent” in the first place:

  • SLT represents a system as a parameterized function - very rich!
  • In coherence theorem, the agent is just a single node that outputs decision given lotteries. In the good regulator theorem and the causal good regulator theorem, the agent is literally just a single node in a Bayes Net - very impoverished!

And recall, we actually have an agent foundations style selection theorem that does prove something structural about agent internals by giving more mathematical structure to the agent:

  • Gooder regulator theorem: A regulator is now two nodes instead of one, but the latter-in-time node gets an additional information about the choice of “game” it is being played against (thus the former node acts as a sort of information bottleneck). Then, given that the regulator makes  take minimum entropy, the first node must be isomorphic to the likelihood function .
    • This does say something about structure, namely that an agent (satisfying certain conditions) with an internal information bottleneck (structural assumption) must have that bottleneck be behaviorally equivalent to a likelihood function, whose output is then connected to the second node. Thus it is valid to claim that (under our structural assumption) the agent internally reconstructs the likelihood values and uses it in its computation of the output.

So in short, we need more initial structure or even assumptions on our “agent,” at least more so than literally a single node in a Bayes Net, to expect to be able to prove something structural.

Here is my 5-minute attempt to put more such "structure" to the [agent/decision node] in the Causal good regulator theorem with the hopes that this would make the theorem more structural, and perhaps end up as a formalization of the Agent-like Structure Problem (for World Models, at least), or very similarly the Approximate Causal Mirror hypothesis:

  • Similar setup to the Causal good regulator theorem, but instead of a single node representing an agent’s decision node, assume that the agent as a whole is represented by an unknown causal graph , with a number of nodes designated as input and output, connected to the rest-of-the-world causal graph . Then claim: Agents with low regret must have  that admits an abstracting causal model map (summary) from , and (maybe more structural properties such as) the approximation error should roughly be lowest around the input/output & utility nodes, and increase as you move further away from it in the low-level graph. This would be a very structural claim!
  1. ^

    I'm being very very [imprecise/almost misleading] here—because I'm just trying to make a high-level point and the details don't matter too much—one of the caveats (among many) being that this statement makes the theoretically yet unjustified connection between SGD and Bayes.

Dalcy122

"I always remember, [Hamming] would come into my office and try to solve a problem [...] I had a very big blackboard, and he’d start on one side, write down some integral, say, ‘I ain’t afraid of nothin’, and start working on it. So, now, when I start a big problem, I say, ‘I ain’t afraid of nothin’, and dive into it."

Bruce MacLennan

Dalcy10

The question is whether this expression is easy to compute or not, and fortunately the answer is that it's quite easy! We can evaluate the first term by the simple Monte Carlo method of drawing many independent samples  and evaluating the empirical average, as we know the distribution  explicitly and it was presumably chosen to be easy to draw samples from.

My question when reading this was: why can't we say the same thing about ? i.e. draw many independent samples and evaluate the empirical average? Usually  is also assumed known and simple to sample from (e.g., gaussian).

So far, my answer is:

  • , so assuming  is my data, usually  will be high when  is high, so the samples during MCMC will be big enough to contribute to the sum, unlike blindly sampling from  where most samples will contribute nearly  to the sum.
  • Also another reason being how the expectation can be reduced to the sum of expectations over each of the dimensions of  and  if  and  factorizes nicely.
Dalcy10

Is there a way to convert a LessWrong sequence into a single pdf? Should ideally preserve comments, latex, footnotes, etc.

Load More