Natural Abstractions: Key Claims, Theorems, and Critiques

Leon Lang; Erik Jenner

Brief responses to the critiques:

Results don’t discuss encoding/representation of abstractions

Totally agree with this one, it's the main thing I've worked on over the past month and will probably be the main thing in the near future. I'd describe the previous results (i.e. ignoring encoding/representation) as characterizing the relationship between the high-level and the high-level.

Definitions depend on choice of variables

The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient for convergence of abstractions. For instance, it doesn't make sense to use variables which "rotate together" the states of five different local patches of spacetime which are not close to each other. (For instance, those five different local patches will generally not be rotated together by default in an evolving agent's sensory feed.)

That does still leave degrees of freedom in how we represent all the local patches, but those are exactly the degrees of freedom which don't matter for natural abstraction. (Under the minimal latent formulation: we can represent each individual variable or set-of-variables-which-we're-making-independent-of-some-other-stuff in a different way without changing anything informationally. Under the redundancy formulation: assume our resampling process allows simultaneous resampling of small sets of variables, to avoid the thing where there's two variables very tightly coupled but they're otherwise independent of everything else. With that modification in place, same argument as the minimal latent formulation applies.)

Theorems focus on infinite limits, but abstractions happen in finite regimes

Totally agree with this one too, and it has also been a major focus for me over the past couple months.

I'd also offer this as one defense of my relatively low level of formality to date: finite approximations are clearly the right way to go, and I didn't yet know the best way to handle finite approximations. I gave proof sketches at roughly the level of precision which I expected to generalize to the eventual "right" formalizations. (The more general principle here is to only add formality when it's the right formality, and not to prematurely add ad-hoc formulations just for the sake of making things more formal. If we don't yet know the full right formality, then we should sketch at the level we think we do know.)

Missing theoretical support for several key claims

Basically agree with this. In particular, I think the quoted block is indeed a place where I was a bit overexcited at the time and made too strong a claim. More generally, for a while I was thinking of "deterministic constraints" as basically implying "low-dimensional" in practice, based on intuitions from physics. But in hindsight, that's at least not externally-legibly true, and arguably not true in general at all.

Figuring out whether the Universality Hypothesis is true
... What we’re less convinced of is that the current theoretical approach is a good way to tackle this question. One worrying sign is that almost two years after the project announcement (and over three years after work on natural abstractions began), there still haven’t been major empirical tests, even though that was the original motivation for developing all of the theory. ... Of course sometimes experiments do require upfront theory work. But in this case, we think that e.g. empirical interpretability work is already making progress on the Universality Hypothesis, whereas we’re unsure whether the natural abstractions agenda is much closer to major empirical tests than it was two years ago.

See the section on "Low level of precision...". Also, You Are Not Measuring What You Think You Are Measuring is a very relevant principle here - I have lots of (not necessarily externally-legible) bits of evidence about a rough version of natural abstraction, but the details I'm still figuring out are (not coincidentally) exactly the details where it's hard to tell whether we're measuring the right thing.

Abstractions as a bottleneck for agent foundations: The high-level story for why abstractions seem important for formalizing e.g. values seems very plausible to us. It’s less clear to us whether they are necessary (or at least a good first step)

Yeah, I don't think this should be externally-legibly clear right now. I think people need to spend a lot of time trying and failing to tackle agent foundations problem themselves, repeatedly running into the need for a proper model of abstraction, in order for this to be clear.

Accelerating alignment research: The promise behind this motivation is that having a theory of natural abstractions will make it much easier to find robust formalizations of abstractions such as “agency”, “optimizer”, or “modularity”. ... To us, such an outcome seems unlikely, though it may still be worth pursuing

I probably put higher probability on success here then you do, but I don't think it should be legibly clear.

Interpretability: ... Figuring out the real-world meaning of internal network activations is one of the core themes of safety-motivated interpretability work. And reverse-engineering a network into “pseudocode” is not just some separate problem, it’s deeply intertwined. We typically understand the inputs of a network, so if we can figure out how the network transforms these inputs, that can let us test hypotheses for what the meaning of internal activations is.

An intuitive understanding of inputs plus a circuit is not, in general, sufficient to interpret the internal things computed by the circuit. Easy counterargument: neural nets are circuits, so if those two pieces were enough, we'd already be done; there would be no interpretability problem in the first place.

Existing work has managed to go from pseudocode/circuits to interpretation of inputs mainly by looking at cases where the circuits in question are very small and simple - e.g. edge detectors in Olah's early work, or the sinusoidal elements in Neel's work on modular addition. But this falls apart quickly as the circuits get bigger - e.g. later layers in vision nets, once we get past early things like edge and texture detectors.

Low level of precision and formalization

I mentioned earlier the heuristic of "only add formality when it's the right formality; don't prematurely add ad-hoc formulations just for the sake of making things more formal".

More generally, if you're used to academia, then bear in mind the incentives of academia push towards making one's work defensible to a much greater degree than is probably optimal for truth-seeking. Formalization is one part of this: in academia, the incentive is usually to add ad-hoc formalization in order to get a full formal proof rather than a sketch, even if the ad-hoc formalization added does not match reality well. On the experimental side, the incentive is usually on bulletproof results, rather than gaining lots of information. (... and that's the better case. In the worse case, the incentive is on jumping through certain hoops which are nominally about bulletproofing, but don't even do that job very well, like e.g. statistical significance.) And yes, defensibility does have value even for truth-seeking, but there are tradeoffs and I advise against anchoring too much on academia.

With that in mind: both my current work and most of my work to date is aimed more at truth-seeking than defensibility. I don't think I currently have all the right pieces, and I'm trying to get the right pieces quickly. For that purpose, it's important to make the stuff I think I understand as legible as possible so that others can help. I try to accurately convey my models and epistemic state. But it's not important to e.g. make it easy for others to point out mistakes in places where I didn't think the formality was right anway. If and when I have all the pieces, then I can worry about defensible proof.

That said, I agree with at least some parts of the critique. Being both precise and readable at the same time is hard, man.

Few experiments
As we briefly discussed earlier, we think it’s worrying that there haven’t been major experiments on the Natural Abstraction Hypothesis, given that John thinks of it as mostly an empirical claim. We would be excited to see more discussion on experiments that can be done right now to test (parts of) the natural abstractions agenda! We elaborate on a preliminary idea in the appendix (though it has a number of issues).

I do love your experiment ideas! The experiments I ran last summer had a similar flavor - relatively-simple checks on MNIST nets - though they were focused on the "information at a distance" lens rather than the redundancy or minimal latent lenses.

Anyway, similar answer here as the previous section: at this point I'm mainly trying to get to the right answers quickly, not trying to provide some impressive defensible proof. I run experiments insofar as they give me bits about what the right answers are.

[-]Erik Jenner3yΩ11161

Thanks for the responses! I think we qualitatively agree on a lot, just put emphasis on different things or land in different places on various axes. Responses to some of your points below:

The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient for convergence of abstractions. [...]

Let me try to put the argument into my own words: because of locality, any "reasonable" variable transformation can in some sense be split into "local transformations", each of which involve only a few variables. These local transformations aren't a problem because if we, say, resample variables at a time, then transforming $m < n$ variables doesn't affect redundant information.

I'm tentatively skeptical that we can split transformations up into these local components. E.g. to me it seems that describing some large number $N$ of particles by their center of mass and the distance vectors from the center of mass is a very reasonable description. But it sounds like you have a notion of "reasonable" in mind that's more specific then the set of all descriptions physicists might want to use.

I also don't see yet how exactly to make this work given local transformations---e.g. I think my version above doesn't quite work because if you're resampling a finite number $n$ of variables at a time, then I do think transforms involving fewer than $n$ variables can sometimes affect redundant information. I know you've talked before about resampling any finite number of variables in the context of a system with infinitely many variables, but I think we'll want a theory that can also handle finite systems. Another reason this seems tricky: if you compose lots of local transformations, for overlapping local neighborhoods, you get a transformation involving lots of variables. I don't currently see how to avoid that.

I'd also offer this as one defense of my relatively low level of formality to date: finite approximations are clearly the right way to go, and I didn't yet know the best way to handle finite approximations. I gave proof sketches at roughly the level of precision which I expected to generalize to the eventual "right" formalizations. (The more general principle here is to only add formality when it's the right formality, and not to prematurely add ad-hoc formulations just for the sake of making things more formal. If we don't yet know the full right formality, then we should sketch at the level we think we do know.)

Oh, I did not realize from your posts that this is how you were thinking about the results. I'm very sympathetic to the point that formalizing things that are ultimately the wrong setting doesn't help much (e.g. in our appendix, we recommend people focus on the conceptual open problems like finite regimes or encodings, rather than more formalization). We may disagree about how much progress the results to date represent regarding finite approximations. I'd say they contain conceptual ideas that may be important in a finite setting, but I also expect most of the work will lie in turning those ideas into non-trivial statements about finite settings. In contrast, most of your writing suggests to me that a large part of the theoretical work has been done (not sure to what extent this is a disagreement about the state of the theory or about communication).

Existing work has managed to go from pseudocode/circuits to interpretation of inputs mainly by looking at cases where the circuits in question are very small and simple - e.g. edge detectors in Olah's early work, or the sinusoidal elements in Neel's work on modular addition. But this falls apart quickly as the circuits get bigger - e.g. later layers in vision nets, once we get past early things like edge and texture detectors.

I totally agree with this FWIW, though we might disagree on some aspects of how to scale this to more realistic cases. I'm also very unsure whether I get how you concretely want to use a theory of abstractions for interpretability. My best story is something like: look for good abstractions in the model and then for each one, figure out what abstraction this is by looking at training examples that trigger the abstraction. If NAH is true, you can correctly figure out which abstraction you're dealing with from just a few examples. But the important bit is that you start with a part of the model that's actually a natural abstraction, which is why this approach doesn't work if you just look at examples that make a neuron fire, or similar ad-hoc ideas.

More generally, if you're used to academia, then bear in mind the incentives of academia push towards making one's work defensible to a much greater degree than is probably optimal for truth-seeking.

I agree with this. I've done stuff in some of my past papers that was just for defensibility and didn't make sense from a truth-seeking perspective. I absolutely think many people in academia would profit from updating in the direction you describe, if their goal is truth-seeking (which it should be if they want to do helpful alignment research!)

On the other hand, I'd guess the optimal amount of precision (for truth-seeking) is higher in my view than it is in yours. One crux might be that you seem to have a tighter association between precision and tackling the wrong questions than I do. I agree that obsessing too much about defensibility and precision will lead you to tackle the wrong questions, but I think this is feasible to avoid. (Though as I said, I think many people, especially in academia, don't successfully avoid this problem! Maybe the best quick fix for them would be to worry less about precision, but I'm not sure how much that would help.) And I think there's also an important failure mode where people constantly think about important problems but never get any concrete results that can actually be used for anything.

It also seems likely that different levels of precision are genuinely right for different people (e.g. I'm unsurprisingly much more confident about what the right level of precision is for me than about what it is for you). To be blunt, I would still guess the style of arguments and definitions in your posts only work well for very few people in the long run, but of course I'm aware you have lots of details in your head that aren't in your posts, and I'm also very much in favor of people just listening to their own research taste.

both my current work and most of my work to date is aimed more at truth-seeking than defensibility. I don't think I currently have all the right pieces, and I'm trying to get the right pieces quickly.

Yeah, to be clear I think this is the right call, I just think that more precision would be better for quickly arriving at useful true results (with the caveats above about different styles being good for different people, and the danger of overshooting).

Being both precise and readable at the same time is hard, man.

Yeah, definitely. And I think different trade-offs between precision and readability are genuinely best for different readers, which doesn't make it easier. (I think this is a good argument for separate distiller roles: if researchers have different styles, and can write best to readers with a similar style of thinking, then plausibly any piece of research should have a distillation written by someone with a different style, even if the original was already well written for a certain audience. It's probably not that extreme, I think often it's at least possible to find a good trade-off that works for most people, though hard).

[-]johnswentworth3y120

We may disagree about how much progress the results to date represent regarding finite approximations. I'd say they contain conceptual ideas that may be important in a finite setting, but I also expect most of the work will lie in turning those ideas into non-trivial statements about finite settings. In contrast, most of your writing suggests to me that a large part of the theoretical work has been done (not sure to what extent this is a disagreement about the state of the theory or about communication).

Perhaps your instincts here are better than mine! Going to the finite case has indeed turned out to be more difficult than I expected at the time of writing most of the posts you reviewed.

[-]JessRiedel2y275

(Self-promotion warning.) Alexander Gietelink Oldenziel pointed me toward this post after hearing me describe my physics research and noticing some potential similarities, especially with the Redundant Information Hypothesis. If you'll forgive me, I'd like to point to a few ideas in my field (many not associated with me!) that might be useful. Sorry in advance if these connections end up being too tenuous.

In short, I work on mathematically formalizing the intuitive idea of wavefunction branches, and a big part of my approach is based on finding variables that are special because they are redundantly recorded in many spatially disjoint systems. The redundancy aspects are inspired by some of the work done by Wojciech Zurek (my advisor) and collaborators on quantum Darwinism. (Don't read too much into the name; it's all about redundancy, not mutation.) Although I personally have concentrated on using redundancy to identify quantum variables that behave classically without necessarily being of interest to cognitive systems, the importance of redundancy for intuitively establishing "objectivity" among intelligent beings is a big motivation for Zurek.

Building on work by Brandao et al., Xiao-Liang Qi & Dan Ranard made use of the idea of "quantum Markov blankets" in formalizing certain aspects of quantum Darwinism. I think these are playing a very similar role to the (classical) Markov blankets discussed above.

In the section "Definitions depend on choice of variables" of the current post, the authors argue that Wentworth's construction depends on a choice of variables, and that without a preferred choice it's not clear that the ideas are robust. So it's maybe worth noting that a similar issue arises in the definition of wavefunction branches. The approach several researchers (including me) have been taking is to ground the preferred variables in spatial locality, which is about as fundamental a constraint as you can get in physics. More specifically, the idea is that the wavefunction branche decomposition should be invariant under arbitrary local operations ("unitaries") on each patch of space, but not invariant under operations that mix up different spatial regions.

Another basic physics idea that might be relevant is hydrodynamic variables and the relevant transport phenomena. Indeed, Wentworth brings up several special cases (e.g., temperature, center-of-mass momentum, pressure), and he correctly notes that their important role can be traced back to their local conservation (in time, not just under re-sampling). However, while very-non-exhaustively browsing through his other posts on LW it seemed as if he didn't bring up what is often considered their most important practical feature: predictability. Basically, the idea is this: Out of the set of all possible variables one might use to describe a system, most of them cannot be used on their own to reliably predict forward time evolution because they depend on the many other variables in a non-Markovian way. But hydro variables have closed equations of motion, which can be deterministic or stochastic but at the least are Markovian. Furthermore, the rest of the variables in the system (i.e., all the chaotic microscopic degrees of freedom) are usually "as random as possible" -- and therefore unnecessary to simulate -- in the sense that it's infeasible to distinguish them from being in equilibrium (subject, of course, to the constraints implied by the values of the conserved quantities). This formalism is very broad, extending well beyond fluid dynamics despite the name "hydro".

[-]Erik Jenner2y52

Thanks for that overview and the references!

On hydrodynamic variables/predictability: I (like probably many others before me) rediscovered what sounds like a similar basic idea in a slightly different context, and my sense is that this is somewhat different from what John has in mind, though I'd guess there are connections. See here for some vague musings. When I talked to John about this, I think he said he's deliberately doing something different from the predictability-definition (though I might have misunderstood). He's definitely aware of similar ideas in a causality context, though it sounds like the physics version might contain additional ideas

[-]Alexander Gietelink Oldenziel2y20

John has several lenses on natural abtractions:

natural abstraction as information-at-a-distance
natural abstraction = redundant & latent representation of information
natural abstraction = convergent abstraction for 'broad' class of minds

the thing that felt closest to me to the Quantum Darwinism story that Jess was talking about as the 'redudant/ latent story, e.g. https://www.lesswrong.com/posts/N2JcFZ3LCCsnK2Fep/the-minimal-latents-approach-to-natural-abstractions and https://www.lesswrong.com/posts/dWQWzGCSFj6GTZHz7/natural-latents-the-math

[-]Alexander Gietelink Oldenziel2y31

Curious if @johnswentworth has any takes on this.

[-]Dalcy2y20

Out of the set of all possible variables one might use to describe a system, most of them cannot be used on their own to reliably predict forward time evolution because they depend on the many other variables in a non-Markovian way. But hydro variables have closed equations of motion, which can be deterministic or stochastic but at the least are Markovian.

This idea sounds very similar to this—it definitely seems extendable beyond the context of physics:

We argue that they are both; more specifically, that the set of macrostates forms the unique maximal partition of phase space which 1) is consistent with our observations (a subjective fact about our ability to observe the system) and 2) obeys a Markov process (an objective fact about the system's dynamics).

[-]So8res3yΩ11271

I'm awarding another $3,000 distillation prize for this piece, with complements to the authors.

[-]Daniel Kokotajlo3yΩ111921

Big thank you for doing this work!

[-]johnswentworth3yΩ13199

+1, this is probably going to be my new default post to link people to as an intro.

[-]Vanessa Kosoy10moΩ7120Review for 2023 Review

This post is a great review of the Natural Abstractions research agenda, covering both its strengths and weaknesses. It provides a useful breakdown of the key claims, the mathematical results and the applications to alignment. There's also reasonable criticism.

To the weaknesses mentioned in the overview, I would also add that the agenda needs more engagement with learning theory. Since the claim is that all minds learn the same abstractions, it seems necessary to look into the process of learning, and see what kind of abstractions can or cannot be learned (both in terms of sample complexity and in terms of computational complexity).

Some thoughts about natural abstractions inspired by this post:

The concept of natural abstractions seems closely related to my informally conjectured agreement theorem for infra-Bayesian physicalism. In a nutshell, two physicalist agents in the same universe with access to "similar" information should asymptotically arrive at similar beliefs (notably this is false for cartesian agents because of the different biases resulting from the different physical points of view).
A possible formalization of the agreement theorem inspired by my richness of mathematics conjecture: Given two beliefs and $Φ$ , we say that $Ψ ⪯ Φ$ when some conditioning of $Ψ$ on a finite set of observations produces a refinement of some conditioning of $Φ$ on a finite set of observations (see linked shortform for mathematical details). This relation is a preorder. In general, we can expect an agent to learn a sequence of beliefs of the form $Ψ_{0} ≺ Ψ_{1} ≺ Ψ_{2} ≺ \dots$ Here, the sequence can be over physical time, or over time discount or over a parameter such as "availability of computing resources" or "how much time the world allows you for thinking between decisions": the latter is the natural asymptotic for metacognitive agents (see also logical time). Given two agents, we get two such sequences ${Ψ_{i}}$ and ${Φ_{i}}$ . The agreement theorem can then state that for all $i \in N$ , there exists $j \in N$ s.t. $Φ_{j} ⪯ Ψ_{i}$ (and vice versa). More precisely, this relation might hold up to some known function $ϵ (i, j)$ s.t. ${lim}_{j \to \infty} ϵ (i, j) = 0$ .
The "agreement" in the previous paragraph is purely semantic: the agents converge to believing in the same world, but this doesn't say anything about the syntactic structure of their beliefs. This seems conceptually insufficient for natural abstractions. However, maybe there is a syntactic equivalent where the preorder $⪯$ is replaced by morphisms in the category of some syntactic representations (e.g. string machines). It seems reasonable to expect that agents must use such representations to learn efficiently (see also frugal compositional languages).
In this picture, the graphical models used by John are a candidate for the frugal compositional language. I think this might be not entirely off the mark, but the real frugal compositional language is probably somewhat different.

[-]Erik Jenner10mo80Review for 2023 Review

I think this was a very good summary/distillation and a good critique of work on natural abstractions; I'm less sure it has been particularly useful or impactful.

I'm quite proud of our breakdown into key claims; I think it's much clearer than any previous writing (and in particular makes it easier to notice which sub-claims are obviously true, which are daring, which are or aren't supported by theorems, ...). It also seems that John was mostly on board with it.

I still stand by our critiques. I think the gaps we point out are important and might not be obvious to readers at first. That said, I regret somewhat that we didn't focus more on communicating an overall feeling about work on natural abstractions, and our core disagreements. I had some brief back-and-forth with John in the comments, where it seemed like we didn't even disagree that much, but at the same time, I still think John's writing about the agenda was wildly more optimistic than my views, and I don't think we made that crisp enough.

My impression is that natural abstractions are discussed much less than they were when we wrote the post (and this is the main reason why I think the usefulness of our post has been limited). An important part of the reason I wanted to write this was that many junior AI safety researchers or people getting into AI safety research seemed excited about John's research on natural abstractions, but I felt that some of them had a rosy picture of how much progress there'd been/how promising the direction was. So writing a summary of the current status combined with a critique made a lot of sense, to both let others form an accurate picture of the agenda's progress while also making it easier for them to get started if they wanted to work on it. Since there's (I think) less attention on natural abstractions now, it's unsurprising that those goals are less important.

As for why there's been less focus on natural abstractions, my guess is a combination of at least:

John has been writing somewhat less about it than during his peak-NAH-writing.
Other directions have gotten off the ground and have captured a lot of excitement (e.g. evals, control, and model organisms).
John isn't mentoring for MATS anymore, so junior researchers don't get exposure to his ideas through that.

It's also possible that many became more pessimistic about the agenda without public fanfare, or maybe my impression of relative popularity now vs then is just off.

I still think very high effort distillations and critiques can be a very good use of time (and writing this one still seems reasonable ex ante, though I'd focus more on nailing a few key points and less on being super comprehensive).

[-]aysja3y87

How does the redundancy definition of abstractions account for numbers, e.g., the number three? It doesn’t seem like “threeness” is redundantly encoded in, for example, the three objects on the floor of my room (rug, sweater, bottle of water) as rotation is in the gear example, since you wouldn’t be able to uncover information about “three” from any one object in particular.

I could imagine some definition based on redundancy capturing “threeness” by looking at a bunch of sets containing three things. But I think the reason the abstraction “three” feels a little strange on this account is that it is both highly natural (math!) but also can be highly “arbitrary,” e.g., “threeness” is wherever a mind can count three distinct objects (and those objects can be maximally unrelated!).

Perhaps counting the three objects on the floor of my room is a non-natural use case of the abstraction “three,” but if so, why? And where is the natural abstraction “three” in the world?

[-]Jonas Hallgren11mo60Review for 2023 Review

I find myself going back to this post again and again for explaing the Natural Abstraction Hypothesis. When this came out I was very happy as I finally had something I could share on John's work that made people understand it within one post.

[-]carboniferous_umbraculum3y30

I've only skimmed this, but my main confusions with the whole thing are still on a fairly fundamental level.

You spend some time saying what abstractions are, but when I see the hypothesis written down, most of my confusion is on what "cognitive systems" are and what one means by "most". Afaict it really is a kind of empirical question to do with "most cognitive systems". Do we have in mind something like 'animal brains and artificial neural networks'? If so then surely let's just say that and make the whole thing more concrete; so I suspect not....but in that case....what does it include? And how we will know if 'most' of them have some property? (At the moment, whenever I find evidence that two systems don't share an abstraction that they 'ought to' I can go "well the hypothesis is only most"...)

[-]Jonas Hallgren3y1-2

(My attempt at an explanation:)

In short, we care about the class of observers/agents that get redundant information in a similar way.

I think we can look at the specific dynamics of the systems described here to actually get a better perspective on whether the NAH should hold or not:

- I think you can think of the redundant information between you and the thing you care about as a function of all the steps in between for that information to reach you.
If we look at the question, we have a certain amount of necessary things for the (current implementation of) NAH to hold:
- 1. Redundant information is rare
  - To see if this is the case you will want to look at each of the individual interactions and analyse to what degree redundant information is passed on.
  - I guess the question of "how brutal is the local optimisation environment" might be good to estimate each information redundancy (A,B,C,D in the picture). Another question is, "what level of noise do I expect to be formed at each transition?" as that would tell you to what degree the redundant information is lost in noise. (they pointed this out as the current hypothesis for usefulness in the post in section 2d.)
- 2. The way we access said information is similar
  - If you can determine to what extent the information flow between two agents is similar, you can estimate a probability of natural abstractions occurring in the same way.
  - For example, if we use vision versus hearing, we get two different information channels & so the abstractions will most likely change. (Causal proximity of the individual functions is changed with regards to the flow of redundant information)
Based on this I would say that the question isn't really if it is true for NNs & brains in general but that it's rather more helpful to ask what information is abstracted with specific capabilities such as vision or access to language.
So it's more about the class of agents that follow these constraints which is probably a sub-section of both NNs & brains in specific information environments

[-]rpglover643y20

I had an insight about the implications of NAH which I believe is useful to communicate if true and useful to dispel if false; I don't think it has been explicitly mentioned before.

One of Eliezer's examples is "The AI must be able to make a cellularly identical but not molecularly identical duplicate of a strawberry." One of the difficulties is explaining to the AI what that means. This is a problem with communicating across different ontologies--the AI sees the world completely differently than we do. If NAH in a strong sense is true, then this problem goes away on its own as capabilities increase; that is, AGI will understand us when we communicate something that has a coherent natural interpretation, even without extra effort on our part to translate it to the AGI version of machine code.

Does that seem right?

[-]LawrenceC3y40

That seems included in the argument of this section, yes.

[-]Noosphere893y20

Basically this. It has other directions, but I do think the NAH is trying to investigate how hard translating between ontologies are as capabilities scale up.

[-]romeostevensit3yΩ120

Tangentially related: recent discussion raising a seemingly surprising point about LLM's being lossless compression finders https://www.youtube.com/watch?v=dO4TPJkeaaU

[-]Arjun Pitchanathan6moΩ110

representation of a variable $X$ for variable $Y$

Hm, I don't understand what $Y$ is supposed to be here.

[-]Leon Lang6moΩ230

Y would be a target variable that one wants to predict, e.g. in supervised learning.

In other words, one wants to learn a model that predicts Y from X. As an intermediate step, one creates a representation T of X. T does not need to keep *all* information about X, but it needs to keep enough information to be able to extract Y from it. The remaining information about X is minimized.

Does this clarify it?

[-]Arjun Pitchanathan6mo30

Yes, thanks!

[-]Review Bot2y*10

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

[-]Roman Leventov3y00

Relevant academic work, hopefully, will be interesting to someone wrt. NAH. From Fields et al. (2023):

We show in this paper that control flow in such systems can always be formally described as a tensor network, a factorization of some overall tensor (i.e., high-dimensional matrix) operator into multiple component tensor operators that are pairwise contracted on shared degrees of freedom [48]. In particular, we show that the factorization conditions that allow the construction of a TN are exactly the same as those that allow the identification of distinct, mutually conditionally independent (in quantum terms, decoherent), sets of data on the MB, and hence allow the identification of distinct “objects” or “features” in the environment. This equivalence allows the topological structures of TNs – many of which have been well-characterized in applications of the TN formalism to other domains [48] – to be employed as a classification of control structures in active inference systems; including cells, organisms, and multi-organism communities. It allows, in particular, a principled approach to the question of whether, and to what extent, a cognitive system can impose a decompositional or mereological (i.e., part-whole) structure on its environment. Such structures naturally invoke a notion of locality, and hence of geometry. The geometry of spacetime itself has been described as a particular TN – a multiscale entanglement renormalization ansatz (MERA) [49, 50, 51] – suggesting a deep link between control flow in systems capable of observing spacetime (i.e., capable of implementing internal representations of spacetime) and the deep structure of spacetime as a physical construct.

^{^}

John mentioned a caveat on this to us:

Note that I sometimes hedge about whether "the natural abstractions" are $F (X)$ itself, or whether they're a latent variable of which $F (X)$ is an estimate. The latter is probably the right answer, but we'd expect in typical systems that the estimate is very precise, so the distinction doesn't matter much. (Prototypical example: average particle energy in one chunk of a gas as an estimate of the temperature of the gas.)
[Further explanation after some discussion with us:]
Latent variables, in general, are not necessarily fully determined by the physical state of the universe; that much just naturally drops out of the math. Latents are just these mathematical constructs. They can be predictively useful and powerful, while still mathematically having uncertainty separate from the state of the world.

Another way to frame it: consider the Kolmogorov complexity/Solomonoff induction view. From a God's-eye view, we could observe the entire low-level state of the universe, then find the shortest program which outputs that state. And it's entirely possible that that shortest program contains some variables whose values we are unable to perfectly estimate, even knowing the entire low-level state of the universe. (In the Kolmogorov context, this means that there are multiple different programs with approximately-the-same length which all output the observed universe-state, and all have very similar structure, but assign different values to corresponding variables.) What our uncertainty is over is the values of the latent variables - i.e. the internal variables used by the programs which approximately-maximally compress the low-level universe state. Insofar as the programs are near-optimal compressions, that uncertainty should be small, but it's not necessarily zero. And of course those internal variables can be predictively useful and powerful for modeling the world, even if their values are not fully determinable from the full world-state.

We're not sure whether we fully understand his views here, and in any case think this distinction shouldn't matter too much for the rest of our post, so we won't discuss it further.

^{^}

The (slight) difference is that Gibbs sampling is typically defined as resampling $X_{1}$ , then $X_{2}$ , and so on, wrapping around to $X_{1}$ after each variable has been resampled once. In contrast, John proposes randomly choosing which variable to resample at each step.

^{^}

Note that it's currently not quite clear in which sense anything converges here, see appendix for some notes on further formalization of $X^{\infty}$ .

^{^}

It’s certainly possible that the connection between theoretical progress so far and future empirical tests is just not meant to be fully legible based on John’s public writing.

246

Natural Abstractions: Key Claims, Theorems, and Critiques

246

Ω 101

246

Ω 101

Results don’t discuss encoding/representation of abstractions

Theorems focus on infinite limits, but abstractions happen in finite regimes

Missing theoretical support for several key claims

Low level of precision and formalization

Few experiments

Introduction

What do we mean by abstractions?

Why expect abstractions to be natural?

Why study natural abstractions for alignment?

Existing writing on the natural abstractions agenda

Related work

Machine learning

Representation Learning

The universality hypothesis in machine learning

MCMC and Gibbs sampling

Information Decompositions and Redundancy

Neuroscience

(Cognitive) Psychology

Philosophy

Key high-level claims

0. Abstractability: Our universe abstracts well

1. The Universality Hypothesis: Most cognitive systems learn and use similar abstractions

1a. Most cognitive systems learn subsets of the same abstractions

1b. The space of abstractions used by most cognitive systems is roughly discrete

1c. Most general cognitive systems can learn the same abstractions

1d. Humans and ML models both use natural abstractions

2. The Redundant Information Hypothesis: A mathematical description of natural abstractions

2a. Natural abstractions are functions of redundantly encoded information

2b. Redundant information can be formalized via resampling or minimal latents

2c. In our universe, most information is not redundant

2d. Locality, noise, and chaos are the key mechanisms for most information not being redundant

Key Mathematical Developments and Proofs

The Telephone Theorem

Abstractions as Redundant Information

More Details on Redundant information as resampling-invariant information

Telephone Abstractions are a Function of Redundant Information

Minimal Latents as a Function of Redundant Information

The Generalized Koopman-Pitman-Darmois Theorem

An almost formal formulation of generalized KPD

The Speculative Connection between gKPD and Redundancy

How is the natural abstractions agenda relevant to alignment?

Four reasons to work on natural abstractions

1. The Universality Hypothesis being true or false has strategic implications for alignment

2. Defining abstractions is a bottleneck for agent foundations

3. A formalization of abstractions would accelerate alignment research

4. Interpretability

How existing results fit into the larger plan

Selection theorems

Discussion, limitations, and critiques

Gaps in the theory

Results don’t discuss encoding/representation of abstractions

Definitions depend on choice of variables Xi

Theorems focus on infinite limits, but abstractions happen in finite regimes

Missing theoretical support for several key claims

Missing formalizations

Relevance to alignment

Methodological critiques

Low level of precision and formalization

Few experiments

Little engagement with existing work

Should this all be delegated?

Conclusion

Acknowledgments

Definitions depend on choice of variables $X_{i}$