Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is going to be a half-assed review of John Wentworth's research. I studied his work last year, and was kinda hoping to write up a better review, but am lowering my standards on account of how that wasn't happening.

Short version: I've been unimpressed by John's technical ideas related to the natural abstractions hypothesis. He seems to me to have some fine intuitions, and to possess various laudable properties such as a vision for solving the whole dang problem and the ability to consider that everybody else is missing something obvious. That said, I've found his technical ideas to be oversold and underwhelming whenever I look closely.

(By my lights, Lawrence Chan, Leon Lang, and Erik Jenner’s recent post on natural abstractions is overall better than this post, being more thorough and putting a finger more precisely on various fishy parts of John's math. I'm publishing this draft anyway because my post adds a few points that I think are also useful (especially in the section “The Dream”).)

To cite a specific example of a technical claim of John's that does not seem to me to hold up under scrutiny:

John has previously claimed that markets are a better model of intelligence than agents, because while collective agents don't have preference cycles, they're willing to pass up certain gains.

For example, if an alien tries to sell a basket "Alice loses $1, Bob gains $3", then the market will refuse (because Alice will refuse); and if the alien then switches to selling "Alice gains $3, Bob loses $1" then the market will refuse (because Bob will refuse); but now a certain gain has been passed over.

This argument seems straightforwardly wrong to me, as summarized in a stylized dialogue I wrote (that includes more details about the point). If Alice and Bob are sufficiently capable reasoners then they take both trades and even things out using a side channel. (And even if they don't have a side channel, there are positive-EV contracts they can enter into in advance before they know who will be favored. And if they reason using LDT, they ofc don't need to sign contracts in advance.)

(Aside: A bunch of the difficult labor in evaluating technical claims is in the part where you take a high-falutin' abstract thing like "markets are a better model of intelligence than agents" and pound on it until you get a specific minimal example like "neither of the alien's baskets is accepted by a market consisting of two people named Alice and Bob", at which point the error becomes clear. I haven't seen anybody else do that sort of distillation with John's claims. It seems to me that our community has a dearth of this kind of distillation work. If you're eager to do alignment work, don't know how to help, and think you can do some of this sort of distillation, I recommend trying. MATS might be able to help out.)

I pointed this out to John, and (to John's credit) he seemed to update (in realtime, which is rare) ((albeit with a caveat that communicating the point took a while, and didn't transmit the first few times that I tried to say it abstractly before having done the distillation labor)). The dialogue I wrote recounting that convo is probably not an entirely unfair summary (John said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment).

My impression of John's other technical claims about natural abstractions is that they have similar issues. That said, I don't have nearly so crisp a distillation of John's views on natural abstractions, nor nearly so short a refutation. I spent a significant amount of time looking into John’s relevant views (we had overlapping travel plans and conspired to share a cross-country flight together, and a comedy of airline mishaps extended that journey to ~36 hours, during which I just would not stop pestering him) ((we also spent a decent amount of time going back-and-forth online)), and got a few whiffs of things that didn't sit right with me, although I was not quite able to get a sufficiently comprehensive understanding of the technical details to complete the distillation process and pin down my concerns.

This review has languished in my backlog long enough that I have now given up on completing that comprehend-digest-distill process, and so what follows is some poorly-organized and undigested reasons why I'm unpersuaded and unimpressed by John's technical work on natural abstractions. (John: sorry for not having something more distilled. Or more timely.)


The Dream

(wherein Nate attempts to motivate natural abstractions in his own words)

Suppose we have a box containing an ideal gas (that evolves according to classical mechanics, to keep things simple). Suppose we've numbered the particles, and we're tasked with predicting the velocity of particle #57 after some long time interval. Suppose we understand the initial conditions up to some finite degree of precision. In theory, if our knowledge of the initial conditions is precise enough and if we have enough computing power, then we can predict the velocity of particle #57 quite precisely. However, if we lack precision or computing power, the best we can do is probably a Maxwell-Boltzmann distribution, subject to the constraint that the expected energy (of any given particle) is the average energy (per-particle in the initial conditions).

This is interesting, because it suggests a sharp divide between the sorts of predictions that are accessible to a superintelligence, and the predictions that are accessible to an omniscience. Someone with enough knowledge of the initial conditions and with enough compute to simulate the entire history exactly can get the right answer, whereas everyone with substantially less power than that—be they superintelligences or humans—is relegated to a Maxwell-Boltzmann distribution subject to an energy constraint.

(And in real life, not even the gods can know the initial conditions in sufficient detail, because there's no such thing as "the" initial conditions; we're in a quantum multiverse.)

So, in some sense, "energy" (or something equivalent) is an abstraction that even superintelligences must use in some form, if they want to predict the velocity of particle #57 after a really long time. It's not like you get a little smarter and then notice the existence of "double-energy" which lets you predict the velocity even better than we can. There's a gradient of how well you can predict the velocity as you fumble around understanding how physics works, and then there's the Maxwell-Boltzmann prediction that you make once you understand what the heck is going on, and then there's a vast barren plateau from here to "perfect simulation", in which even the superintelligences can do no better.

In the simple case of an ideal gas evolving classically, we can probably prove some theorems corresponding to this claim. I haven't seen theorems written from exactly this point of view, but if you're technically inclined you can probably prove something like "time-evolution is ergodic within regions of phase-space of constant energy", or *cough* *cough* chaos *cough* so the only facts that are practically predictable in thermodynamic equilibrium correspond directly to conservation laws. Or something.

This is relevant to our interests, because we sure would like a better understanding of when and where the abstractions of humans and the abstractions of superintelligences overlap. "Convergently useful" low-level abstractions could help us with ontology identification; mastery of convergently-useful abstractions could help us manufacture circumstances that make the AI converge on humane abstractions; etc.

The wild hope, here, is that all human concepts have some nature kinda like "energy" has in the simplest toy model of statistical mechanics. Like, obviously "trees" and "windows" are much more complicated concepts than "energy"; obviously there isn't going to be quite so crisp a notion of "the best concepts you can use short of simulating the whole system". But, like, various distributions from statistical mechanics turn out to be empirically useful even though our universe isn't in thermodynamic equilibrium yet, and so there's some hope that these "idealized" or "convergently instrumentally useful" concepts degrade cleanly into practical real-world concepts like "trees" and "windows". Which are hopefully so convergently instrumentally useful that the AIs also use them.

And, by understanding the circumstances that give rise to convergently-useful abstractions, and by understanding the extent of their reach, we might gain the ability to recognize them inside an AI's mind (rendering it much less alien to us), and/or to distinguish the concepts we care about from nominally nearby ones, and/or to shape the AI's learning such that its abstractions are particularly human-recognizable.

That's the dream of natural abstractions, as I understand it.[1]

I think this is a fine dream. It’s a dream I developed independently at MIRI a number of years ago, in interaction with others. A big reason why I slogged through a review of John's work is because he seemed to be attempting to pursue a pathway that appeals to me personally, and I had some hope that he would be able to go farther than I could have.

John's research plans are broader than just this one dream, as I understand it; I'm going to focus on this one anyway because I think it's a dream that John and I share, and it's the one that I poked at in the past.

When I spent time thinking about this topic, my main hope was that an improved understanding might allow us to shape abstractions in alien minds. I credit John with the additional observation that we might use an improved understanding to recognize abstractions in alien minds.


Natural Abstractions

The following is an idea in pursuit of the above dream that I attribute to John (with my distillation and word choice):

It's no mistake that, in the gas example, the only free variable in the bounded-optimal epistemic state (a Maxwell-Boltzmann distribution) is a conserved quantity (the energy). Intuitively, we can think of physics as decomposing into some facts about the initial conditions that hold true indefinitely, plus a bunch of other facts that are washed away by chaos and time. From that point of view, it's no surprise that the facts (about the initial conditions) that are still practically predictively useful after an arbitrarily long amount of time (and chaotic particle-interaction) are the conserved quantities. If the information hangs around for arbitrary amounts of time, then it's conserved; that's practically a definition of "conserved"!

This yields a candidate definition of "convergently useful" abstractions: they're simply conserved quantities.

Of course, taking that definition literally is too strict. Windows are not literally conserved by the laws of physics, and nor are their defining characteristics (such as width or tint). And, of course, physics is not in thermodynamic equilibrium but for the windows. If we want to find a more general notion of "natural abstractions", we'll need to weaken the demands of literal conservation, and literal thermodynamic equilibrium.

A followup idea that I attribute to John (with my partial distillation):

Consider the workings of a Boltzmann distribution. The probability of a state is penalized by some factor (the coldness) for every unit of energy in that state. This works when the energy in the state is small relative to the energy in the whole system.[2] We can think of the Boltzmann distribution as saying:

"Because the thing we're considering is small, we can ignore its effect on everything else, and estimate its state as being concordant with all the surrounding states. (And, because all non-conserved information has been washed away by chaos and time, we use the maximum-entropy / minimum-information distribution that fits that constraint.)"

What the Boltzmann distribution needs is not that energy is literally physically conserved, but that the average energy can be faithfully inferred from the surroundings. This will happen whenever we have a conserved quantity in thermodynamic equilibrium,[3] but we can just cut to the chase and ask for this "inferable from the surroundings" property instead.

(And, notably, this is what we're already doing when we use statistical mechanics to predict the behavior of a gas. We don't literally measure the energy in the whole gas, by measuring the positions and velocities and masses of each particle and running an annoying calculation. We let a few of the particles bump against our thermometer, and against our barometer, and against our scale, and then we scribble out some calculations and pretend we know the total energy. Where that last step is invoking the fact that the total energy can be accurately inferred from a few measures of the average energy, at least in a relatively closed system that isn't about to get struck by the oh my god particle.)

Bringing the discussion back to windows: there are lots of properties of any given window that can be inferred by looking not at it, but at all other windows: width, height, tint, texture. Perhaps we're supposed to identify natural abstractions that relate to a given object with all the properties of that object that we could (probabilistically) reconstruct if we forgot everything we knew about that object.

To which I say: That sounds like an interesting idea! I don't understand it, and I'm not sure that it makes sense: for instance, trying to define the abstraction "window" as "that which you can deduce from other windows if you forget everything about one window" seems circular, and I'm not sure whether the circularity is fatal. And I have only a vague understanding of how this is supposed to generalize the case with energy. But if I squint, I can see how maybe it could be fleshed out into something interesting.

At which point John says something like "well today's your lucky day; I have math!". But, unfortunately, I wasn't able to make sense of the math, despite trying.


The Math

John claims a proof-sketch of the following:

Take an infinite set of random variables  with distribution . Consider the following process

- Sample some values  from 

- Pick one of the s at random (with nonzero probability for each), and resample that  value conditional on the values of all the other variables. Repeat until the distribution converges.

Take the most general function  whose value is conserved by this process. (Equivalently, find the most general  whose value is conserved by resampling any individual variable conditional on the others.)

Then I claim that, for any infinite sequence of distinct variables  drawn from the set , and any finite-entropy function , the mutual information between  and variables in the “tail” of our sequence past  approaches 0 conditional on  as :

(He had another few versions, allegedly with fuller proofs, though I was not able to understand them and focused on this one.)

And... I'm not really up for rehashing the whole discussion we had here. But the short version is that I found a counterexample where  is constant, and John was like "oh, no, the s have to be non-repeating", and I was like "wait so this theorem only works if we have a literal infinitude of variables?" and he was like "yes".

And then I constructed another (infinite) example where the mutual information (MI) in the limit was not the limit of the mutual information in the finite approximations, and I was like “???”. And John was like "you got the sign wrong, but yes, the MI in the limit is not the limit of the MIs in each finite approximation." And I was like "Then how is this supposed to tell me anything about windows?? There are only finitely many windows!"[4]


My Concerns

I don't particularly doubt that John's theorem is true. My issue is that, as far as I've been able to figure out, it works only in the case where we have infinitely many independent random variables. I do not know a form that behaves nicely in finite approximations, and I was not able to extract one from John (despite trying).

(This despite John handwaving past a variety of technical subtleties, on the grounds that he's aiming for the virtues of a physicist rather than a mathematician. I'm all for using frameworks that are making sharp empirical predictions regardless of whether we've sorted out the theoretical technicalities. But this isn't an issue where the mathematician is saying "hey wait, do those integrals actually commute?" and the physicist is saying "probably!". This is a case where the math only works in the infinite case and is not shedding light here in a world with only finitely many windows.)

(Like, physicists sometimes write down equations that work well in finite approximations, and ignore the mathematicians as they complain that their series blows up when taken to infinity. If you're doing that, I have no objection. But John's doing the opposite! His theorem works in the infinite case, and doesn't say anything interesting about any finite case! John claiming the virtues of a physicist in his technical work, and then having stuff like this come up when I look closer, feels to me like a microcosm of my overall impression. I found it quite frustrating.)

This shortcoming strikes me as fatal: I already know how to identify abstractions like "energy" in extremely well-behaved cases; I'm trying to understand how we weaken our demands while keeping a working notion of "natural abstraction". Understanding this theorem might teach me something about how to shift from demanding perfect conservation and ergodicity/chaos/whatever outside of that one conservation law, which would be nice. But it doesn't even have anything to say in the finite case. Which means it has nothing to say about windows, of which there are finitely many.

Worse, my ability to understand the theorem is hindered by my inability to construct finite examples.

The math could perhaps be repaired to yield reasonable finite approximations, and those might well teach us something interesting. I didn't manage to get that far. I tried to understand the theorem, and tried a little to repair it to work in finite approximations. John and I disagreed about what sort of methods were likely to lead to repair. I gave up.

To be clear, it seems pretty plausible to me that something like this theorem can be repaired. As I said at the beginning, I think that John has some pretty good intuitions, and is trying to go in some reasonable directions. I'm just disappointed with his actual results, and think that he's often doing shoddy technical work, drawing wrong conclusions from it, and then building off of them enthusiastically.

And, to be clear: I'm interested in seeing a repaired theorem! It's plausible to me that a repaired version has something to teach me about how to identify convergently-useful abstractions. I don't have quite enough hope/vision about it to have done it myself in the past 11 months, but it's on my list of things to do, and I'd love a solution.[5]


The Generalized Koopman-Pitman-Darmois theorem

One reason that I didn't have enough hope/vision myself to attempt to repair John's mutual-information theorem is because I didn't really see the connection from that theorem back to the dream (even if I imagined it making sense in worlds with only finitely many windows, which I don't yet see how to do). Like, OK, sure, perhaps we can take any physical system and ask which properties are kinda-sorta "conserved" in the sense that that information is repeated many times in the larger world. Suppose I grant that. What then? Where are we going?

I tried probing John on this point, and largely wasn't able to make sense of his English-language utterances. But he seemed to be pursuing a dream that I also have, and claimed to have even more math! "You're asking about the generalized Koopman-Pitman-Darmois theorem!", he said.

And, sure, I'm always up for looking at your allegedly-related math. Especially when your English descriptions aren't making sense to me. So I took a look!

I was previously planning to process this section more, and give an intuitive description of the gKPD theorem and a summary of my take, but since this post has languished for a year, I’ll just post a section of chat logs (with some typos fixed) in which I talk John's ear off about KPD. (It contains some context and backreferences that will probably be confusing. Sorry.)


My current guess is pretty strongly that the princess is in another castle. 

Some background, before arguing this: according to me, the way the boltzmann distro works in statmech, is that we imagine that reality is some big system (drawn from a space i'll call ) and each  has some energy . we get to observe some small fragment of that system (drawn from a space i'll call ), and each  has its own energy, . then there's some function  of the overall-energy, such that for each fixed energy , we have

in other words, the probability of finding the observed microstate (of the small fragment of the system we manage to observe) in any given configuration, drops exponentially in the energy of that microstate. 

normally we suppress , b/c we don't know the total energy, and it doesn't much matter to us given that energy is conserved (doubly so if we're taking ratios) 

like, the fact that lower-energy microstates (in the fragment of the system we observe) are exponentially more likely (as a function of their low-energy-ness), remains regardless of what the actual value of  is. 

this seems to me like the standard statmech story. 

it seems plausible to me that you have a very different understanding of statmech. if so, i'm interested how you're interpreting it. 

in particular, it seems plausible to me that you aren't thinking of  as a fragment of some larger system. 

(one thing i like about this hypothesis is that it explains why you think the energy and the temperature "remain" "as latents" or w/e even when you condition on "the entire microstate") 

like, the way I read statmech, the total energy (and thus the temperature) continue to have uncertainty once you condition on the entirety of , but they are fully deterministic in  

(and statmech was only ever telling us about what happens when we observe some fragment of the system; observing the whole microstate of reality being rather difficult) 

i'm not sure what alternative read of statmech you're taking (or even whether you have one) 


what i think KPD says, in general, is that when we have some large system , which has some small-as-a-fraction-of- fragment  (that itself factors as ) we're going to observe and some parameter  that renders all the factors of  independent, then if we have some way  of summarizing the relevance of  to  into a summary that's small-in-comparison-to-, then  is an exponential family with parameters drawn from , and with -many statistics. 

and the theorem basically says "that's what ' summarizes the relevance of  to  into a small summary' means", which is legit. 

but -- i'd say -- the magic step is the existence of  in the first place. 

like, suppose we use the john-idea of "look at the data required to screen off two distant variables, take that as ". now we have 

to be useful,  probably has to be something like "the data i observe", or some subset thereof. 

one immediate problem here is that the data i observe isn't obviously sufficiently far apart to be screened off by , b/c, eg, i often observe a bunch of stuff that's right close together. 

but handwave that problem away; like, suppose we cluster our data into chunks that are sufficiently far apart, and call each chunk one of the , or whatever. 

what's this  thing? 

 is a function that summarizes all the relevance of the data we've seen so far, to the underlying parameters. 

like, in the statmech case,  is the thing that's estimating the energy from the sample. and in the gear example,  is summarizing a patch to an angular momentum. and in a survey of height or w/e,  is summarizing out the mean and the variance. 

and, yes, once you've found that summary, you're a mere stone's throw from an exponential family. 

but the question I have, is when do those sorts of summaries exist, and why? 

like, the answer is obviously "all the time", but the "why" is the important bit. 

insofar as there is oomph in this approach, i expect it to come from mastery of realities-which-admit-these-sorts-of-summaries. 

KPD assumes such a summary. 

the john-style "look at that which renders distant things conditionally independent" argues that there is a thing to summarize about 

but the princess is explaining why there are so often summaries, and how to find them, or something more like this. 

(i'm not saying it has to be particularly tricky) 

(but i am saying that, so far, i have not been able to understand your math as doing much to answer the questions i started with, about what's up with these summary things) 


i hear you as trying to say something like " includes only the information that is arbitrarily well-preserved, which can't be all that much, and intuitively all such data should admit sufficient statistics from any small patch of observations, which should be in similarly low quantity" 

and i'm like "i agree that something like this looks like it happens in real life (at least at the statmech level, and maaaybe also at the window level); the mastery comes from understanding in a technical sense why, and under what conditions, this is true" 

where the small elephant in the room is something like "under what technical circumstances is not-much-data sufficiently-well-preserved between one of my window observations and another?" 

and the big elephant in the room is "under what technical circumstances is it cheap to summarize the impact of one window-observation, on the set of all sufficiently-well-preserved-window-info?" 


Part of my interpretation of the gKPD is that this question reduces to  being small. As long as the information-which-induces-conditional-independence-between-far-apart-things is small, expfam stuff kicks in, and it's cheap to summarize the impact of local observations on the set of sufficiently-well-preserved-info. Agree/disagree with that interpretation? 



the heavy lifting in KPD is done not by  being small, but by  existing. 


Ah, sorry, I was failing to distinguish those again.

Ok, the other things you said make more sense now.

So you want to close the gap between small  and small 



i'm like "the big elephant in the room is the existence of small (and, in practice, cheap, but we can start w/ small) 

(in particular, i'm like "having mostly digested KPD, it seems to me like it's just pointing a finger at  and saying 'assuming this is akin to assuming the conclusion'") 


Great. Ok, I agree that is a great big gap in the current theorems, and one I had not been paying sufficient attention to (as evidenced by constantly forgetting to distinguish  from  at all). Thankyou.


(and also, in the examples i've looked at, existence of  seems like a pretty heavy assumption, above and beyond existence of )

My overall take from looking into some of John's stuff is a mix of hope, disappointment, and exasperation.

Hope because I think he is barking up some interesting trees, and he's trying to wrap his whole head around big questions that have a chance of shedding lots of light on AI alignment ("are there convergently useful abstractions, and can we learn to recognize and/or manipulate them?").

Disappointment because when I look closer, he seems to regularly get some technical result that doesn't seem very insightful to me, read things off of it that don't seem particularly correct to me, and then barge off in some direction that doesn't seem particularly promising to me. I was prepared for John's lines of approach to seem unpromising to me—that's par for the course—but the thing where he seems to me to put undue weight on his flimsy technical results was a negative update for me.

Exasperation because of how John treats his technical results. My impression has been that he makes lots of high-falutin' nice-sounding English claims, and claims he has technical results to support them, and slings some math around, but when you look closely the math is... suggestive? Kinda? But not really doing what he seemed to be advertising?

Perhaps I've simply been misreading John, and he's been intending to say "I have some beliefs, and separately I have some suggestive technical results, and they feel kinda related to me! Which is not to say that any onlooker is supposed to be able to read the technical results and then be persuaded of any of my claims; but it feels promising and exciting to me!".[6]

I wouldn't be exasperated if it were apparent to me that John’s doing that. But that's not the impression I got from how John billed his technical results, and I spent time trying to understand them (and think I did an OK job) only to find that the technical results weren't the sort of thing that support his claims; they're the sort of thing that're maybe possibly suggestive if you already have his intuitions and are squinting at them the way that he squints. In particular, they didn't seem very suggestive to me.

I think there's some sort of social institution I want to protect here. There's an art to knowing exactly what your technical results say, and knowing when you can carefully and precisely trace your English-language claims all the way back to their technical support. In the telephone theorem post, when John says stuff like "The theorems in this post show that those summaries are estimates/distributions of deterministic (in the limit) constraints in the systems around us.", I read that as John implying that he knows how to cash out some of his English-language claims into the math of the telephone theorem. And that's not what I found when I looked, at least not in a manner that's legible to me.

I think I'm exasperated in part because this seems to me like it erodes an important type of technical social trust we have around these parts (at least among people at John's level of cohesive pursuit of an alignment agenda; I hold most others to a lower standard). I hereby explicitly request that he be more careful about those sorts of implications in the future.

(While also noting that it's entirely plausible here that I'm misreading things; communication is hard; maybe I'm the only idiot in the room who's not able to understand how John's theorems relate to his words.)

Stepping back even further: John's approach here is not how I would approach achieving the dream that we share, and that I sketched out at the top. (We know this, because I've tried before, and I tried differently.) Which doesn't mean his directions are unpromising! My directions didn't pan out, after all. I think he's pursuing some interesting routes, and I'm interested to see where they go.

While I have some qualms about John thinking that merely-suggestive technical results say more than he thinks they do, I am sympathetic to a variety of his intuitions. The art of not reading too much into your technical results seems easier to acquire than the art of having good research intuitions, so on net I'm enthusiastic about John's research directions.

(Less enthusiastic than John, probably, on account of how my guess as to how all this plays out is that there are lots and lots of "natural abstractions" when you weaken the notion enough to allow for windows, and which ones a mind pays attention to winds up being highly contingent on specifics of its architecture, training, and objectives. Mastery in this domain surely would have its uses, but I think I'm much less optimistic than John about using the naturalness of abstractions to give a workable descriptive account of human values, which IIUC is part of John's plan.[7])

Also, thanks again to John for putting up with my pestering.


  1. ^

    Insofar as the above feels like a more concise description of why there might be any hope at all in studying natural abstractions, and what those studies might entail, I reiterate that it seems to me like this community has a dearth of distillations. Alternatively, it's plausible to me that John's motivations make more sense to everyone else than they do to me, and/or that my attempts at explanation make no more sense to anybody else than John's.

  2. ^

    Analogy: if you know that the sum of two dice is 5, then you know that the first die definitely didn't come up six. This is some "extra" information above and beyond the fact that the average dice-value is 2.5. If instead you know that the sum of two thousand dice is 5000, then you can basically just ignore that "extra" information, and focus only on the average value. And somewhere around here, there's a theorem saying that the extra information goes to zero in the limit.

  3. ^

    Or, well, when we know all the conserved properties, and the rest of the laws of physics are sufficiently ergodic or chaotic or something; I'm not sure exactly what theorem we'd want here; I'm just trying to summarize my understanding of John's position. I'd welcome further formalization.

  4. ^

    If you want those examples, then… sorry. I'm going to go ahead and say that they're an exercise for the reader. If nobody else can reconstruct them, and you really want them, I might go delve through the chat logs. (My apologies for the inconvenience. Skipping that delve-and-cleanup process is part of the cost of getting this dang thing out at all, rather than never.)

  5. ^

    I also note that I was super annoying in my attempts to extract a working version of this theorem from John. I started out by trying to probe all his verbal intuitions about the "natural abstractions are like conserved quantities" stuff, and then when I couldn't make any sense of that we went to the math. And, because none of his English phrases were making sense to me, I just meticulously tried to understand the details of the math, which involved a whole lot of not knowing what the heck his notation meant, and a whole lot of inability to fill out partial definitions in "the obvious way", which I suspect was frustrating. Sorry John; thanks for putting up with me.

  6. ^

    But John, commenting on a draft of this post, was like "Nope!" and helpfully provided a quote.

  7. ^

    John noted in a draft of this document that this post of his was largely intended as a response to me on this point.

New Comment
17 comments, sorted by Click to highlight new comments since:

A meta-related comment from someone who's not deep into alignment (yet) but does work in AI/academia.

My impression on reading LessWrong has been that the people who are deep into alignment research are generally spending a great deal of their time working on their own independent research agendas, which - naturally - they feel are the most fruitful paths to take for alignment.

I'm glad that we seem to be seeing a few more posts of this nature recently (e.g. with Infra-Bayes, etc) where established researchers spend more of their time both investigating and critiquing others' approaches. This is one good way to get alignment researchers to stack more, imo.


Perhaps I've simply been misreading John, and he's been intending to say "I have some beliefs, and separately I have some suggestive technical results, and they feel kinda related to me! Which is not to say that any onlooker is supposed to be able to read the technical results and then be persuaded of any of my claims; but it feels promising and exciting to me!".

For what it's worth, I ask John about once ever month or two about his research progress and his answer has so far been (paraphrased) "I think I am making progress. I don't think I have anything to show you that would definitely convince you of my progress, which is fine because this is a preparadigmatic field. I could give you some high-level summaries or we could try to dive into the math, though I don't think I have anything super robust in the math so far, though I do think I have interesting approaches."

You might have had a totally different experience, but I've definitely had the epistemic state so far that John's math was in the "trying to find remotely reasonable definitions with tenuous connection of formalism to reality" stage, and not the "I have actually demonstrated robust connection of math to reality stage", so I feel very non-mislead by John. A good chunk of this impression comes from random short social interactions I've had with John, so someone who more engaged with just his online writing might come away with a different impression (though I've also done that a lot and don't super feel like John has ever tried to sell me in his writing on having super robust math to back things up).


John has also made various caveats to me, of the form "this field is pre-paradigmatic and the math is merely suggestive at this point". I feel like he oversold his results even so.

Part of it is that I get the sense that John didn't understand the limitations of his own results--like the fact that the telephone theorem only says anything in the infinite case, and the thing it says then does not (in its current form) arise as a limit of sensible things that can be said in finite cases. Or like the fact that the alleged interesting results of the gKPD theorem are a relatively-shallow consequence of the overly-strong assumption of .

My impression was that I had to go digging into the theorems to see what they said, only to be disappointed by how little resemblance they bore to what I'd heard John imply. (And it sounds to me like Lawrence, Leon, and Erik had a similar experience, although I might be misreading them on account of confirmation bias or w/e.)

I acknowledge that it's tricky to draw a line between "someone has math that they think teaches them something, and is inarticulate about exactly what it teaches" and "someone has math that they don't understand and are overselling". The sort of observation that would push me towards the former end in John's case is stuff like: John being able to gesture more convincingly at ways concepts like "tree" or "window" are related to his conserved-property math even in messy finite cases. I acknowledge that this isn't a super legible distinction and that that's annoying.

(Also, I had the above convos with John >1y ago, and perhaps John simply changed since then.)

Note that I continue to think John's cool for pursuing this particular research direction, and I'd enjoy seeing his math further fleshed out (and with more awareness on John's part of its current limitations). I think there might be interesting results down this path.

(Also, I had the above convos with John >1y ago, and perhaps John simply changed since then.)

In hindsight, I do think the period when our discussions took place were a local maximum of (my own estimate of the extent of applicability of my math), partially thanks to your input and partially because I was in the process of digesting a bunch of the technical results we talked about and figuring out the next hurdles. In particular, I definitely underestimated the difficulty of extending the results to finite approximations.

That said, I doubt that fully accounts for the difference in perception.



For example, if an alien tries to sell a basket "Alice loses $1, Bob gains $3", then the market will refuse (because Alice will refuse); and if the alien then switches to selling "Alice gains $3, Alice loses $1" then the market will refuse (because Bob will refuse); but now a certain gain has been passed over.

(fixed, thanks)

Not fixed!

For example, if an alien tries to sell a basket "Alice loses $1, Bob gains $3", then the market will refuse (because Alice will refuse); and if the alien then switches to selling "Bob gains $3, Alice loses $1"

(oops! thanks. i now once again think it's been fixed (tho i'm still just permuting things rather than reading))

Disclaimer: I have not read the Wentworth's post or the linked one but I know (little) about finite-sample and asymptotic bounds.

(He had another few versions, allegedly with fuller proofs, though I was not able to understand them and focused on this one.)

I think the key point of the statement is "any finite-entropy function ". This makes sure that the "infinity" in the sampling goes away. That being said, it should be possible to extend the proof to non-independent samples, Cosma Shalizi has done a ton of work on this.

Speaking of abstractions, recently I've been wondering about whether they can be supported by some sort of counting argument. Suppose you live in a universe with volume , and you are trying to create a taxonomy for objects of size . Intuitively, it seems like there should be  many conceivable such objects. But there's only space for  objects, so if , we can fall back to "allocate a new label for each unique object", which still allows us to abstract relatively cheaply (at  cost). (Which is a tool we regularly use for abstraction! Like for people, companies, etc., we do have some abstract taxonomies, but mostly we refer to them by unique labels. After conditioning on their class of course; clearly "person" contains a lot of information.)

Of course allocating a new label for each unique object is often not so satisfying. In practice we can usually do way better than that, so the counting argument must be missing something. But some of the things the counting argument misses seem to be things that are covered by previous theories of abstraction, so maybe by combining multiple approaches we can get a more comprehensive theory. In a way, if we think of traditional abstraction methods as looking at a low-rank approximation of stuff, then the new "allocate a new label for each unique object" algorithm gives you a sparse approximation of stuff, and we know that stuff is usually well-approximated by low-rank + sparse.

Object-level: This updated me closer to Nate's view, though I think abstractions-but-not-necessarily-natural-ones would still be valuable enough to justify quite a lot of focus on them.

Meta-level: This furthers my hunch that the AI alignment field is an absurdly inadequate market, lacking what-seems-to-me-like basic infrastructure. In other fields with smart people, informal results don't sit around, unexamined and undistilled, for months on end.

I'm not even sure where the bottleneck is on this anymore; do we lack infrastructure due to a lack of funds, or a lack of talent? (My current answer: More high-talent people may be needed to get a field paradigm, more medium-talent people would help the infrastructure, funders would help infrastructure, and funders are waiting on a field paradigm. Gah!)

Hey Nate, thanks for the 3/4 ass review of John's research. 

I'm not very familiar with the current state of complex system, chaos theory, linguistic etc. research so take my thoughts with a grain of salt.

However, I am familiar with the metaphysical and epistemological underpinnings of scientific knowledge, think Kant, Hume, Locke etc., and of language, think Wittgenstein, Russel, Diamond, etc. And solely based on that, I agree with your critique of John's approach. 

Kant's and Wittgenstein's metaphysical and epistemological ideas create significant barriers for John's goals, not just his methods. And that any attempt whatsoever to describe how we can gain abstract knowledge from finite samples needs to seriously and dutifully grapple with the problems they posed, and the naive solutions they criticized. 

Kant was obsessed with the questions of universal (true of categories) and necessary (true in all possible worlds) empirical truths and our ability to know them, especially how they related to the problems with empiricism that Hume posed.

Kant believed that experience cannot be the result of all of our ideas or knowledge, and that they must instead be contained in our consciousness and greater mind because of simple questions he posed, "There are objects that exist in space and time outside of me” or "Subjects are persistent in time". These questions cannot be proven using a priori or a posteriori methods (try it!), which means that the truths necessary to answer questions like the previous are simply axiomatic.

Additionally, Ludwig Wittgenstein was obsessed with the questions and problems surrounding language and communication. He believed that discussing and understanding the meanings of words independently of their usage and grammar was folly. 

In fact, Wittgenstein's research in his later life was centered around the limits of language and communication. The limits of rules, and inner thought. How do we learn rules? How do we follow rules? How do we know if we have successfully followed a rule? How are rules stored in our minds? Are we just appealing to intuition when we apply or follow rules? Yes, he basically contributed questions, and that's the point.

In terms of communication, Wittgenstein believed that person's inner thoughts and language could only refer to the immediate contents of his consciousness. And that consequently, not only do we need shared agreements to communicate, but also shared experiences, agreement not in opinions, but rather in form of life.”

I don't mean to say that it is impossible to achieve the goals that you and John share. But that rummaging around in the dark while some of the greatest thinkers stand nearby with soft candles is as tragic as you and John have hinted.

It appears people believe,

  1. I wrote to persuade, not explain (in hindsight I would agree)
  2. I wrote in a condescending tone (in hindsight I would agree)
  3. My critique did not offer anything concrete or any models
  4. My critique was "not even wrong"
  5. My critique was obviously false
  6. My critique was obviously true 
  7. My critique added nothing to the conversation

I'd love for anyone to explain which they thought and why.


And besides the point, I may have unintentionally (worried of criticism) underplayed my knowledge of chaos theory, complex systems, and linguistics research. But, I thought a person who had just read Nate's critique would be especially open to a philosophical (pre-axiomatic or axiomatic) perspective.

My bottom-line thinking reading John's arguments and thoughts was that John's, and even Nate's, disuse of the shared language provided by Kant and Wittgenstein hinted at either,

1. a lack of understanding of their arguments

2. an understanding of one or only a few interpretation of their arguments

My guess is that most people who downvoted think popular philosophy is unlikely to be relevant for no-nonsense applications like math and alignment

If you're interested, I generated a critique of John Wentworth's Natural Abstraction thesis using GPT4 here. It's not as good as if I'd actually written it myself, but it was better than I was expecting.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

various distributions from statistical mechanics turn out to be empirically useful even though our universe isn't in thermodynamic equilibrium yet, and so there's some hope that these "idealized" or "convergently instrumentally useful" concepts degrade cleanly into practical real-world concepts like "trees" and "windows". Which are hopefully so convergently instrumentally useful that the AIs also use them.

I don't quite understand the turn of phrase ‘degrade cleanly into practical real-world concepts like "trees" and "windows"’ here. Per my understanding of the analogy to statistical mechanics, I would expect there to be an idealized concept of ‘window’ that assumes idealized circumstances that are never present in real life, but which is nevertheless a useful concept... and that the human concept is that idealized concept, and not a degraded version. Because – isn't the point that the actual real-world concept that doesn't assume idealized circumstances is too computationally complex so that all intelligent agents have to fall back to the idealized version?

Maybe that's what you mean and I'm just misreading what you wrote.