Trying to understand John Wentworth's research agenda

johnswentworth; habryka; David Lorell

Oli, you wanna kick us off?

johnswentworth

I mostly plan to observe (and make edits/suggestions) here, but I'll chime in if I think there is something particularly important being missed.

David Lorell

Ok, context for this dialogue is that I was evaluating your LTFF application, and you were like "I would like to do more research on natural abstractions/maybe release products that test my hypotheses, can you give me money?".

I then figured that given that I've been trying to make dialogues happen on LW, we might also be able to create some positive externalities by making a dialogue about your research, which allows other people who read it later to also get to learn more about what you are working on.

I've generally liked a lot of your historical contributions to a bunch of AI Alignment stuff, but you haven't actually written much about your current central research agenda/direction, which does feel like a thing I should understand at least in broad terms before finishing my evaluation of your LTFF application.

My current strawman of your natural abstraction research is approximately "John is working on ontology identification/manipulation, just like everyone else these days". By which I mean, it feels like the central theme of Paul's research, a bunch of the most interesting parts of Anthropic's prosaic steering/interpretability work, and some of the more interesting MIRI work.

Here are some things that I associate with this line of research:

Trying to understand the mapping between concepts that we have and the concepts that an AI would use in the process of trying to achieve its goals/reason about the world
Trying to get to a point where if we have an AI system that arrives at some conclusion, that we have some model of "what reasoning did it very roughly use to arrive at that conclusion"
A major research problem that keeps coming up in this space is "ontology combination", like if I have one argument that uses an ontology that carves reality at joints X, and another argument that carves reality at joints Y, how do I combine them? This feels like one of the central components of heuristic arguments, and also is one of the key components of cartesian frames.

But I am not super confident that any of this matches that closely to the research you've been doing. I also haven't had a sense that people have been making tons of progress here, and at least in the case of Paul's heuristic arguments work I've mostly been jokingly referring to it as "in order to align an AI I just first have to formalize all of epistemology", and while I think that's a cool goal, it does seem like one that's pretty hard and on priors I don't expect someone to just solve it in the next few years.

habryka

Ok, I'm just gonna state our core result from the past few months, then we can unpack how that relates the all the things you're talking about.

Suppose I have two probabilistic models (e.g. from two different agents), which make the same predictions over some "observables" X. For instance, picture two image generators which generate basically the same distribution of images, but using different internal architectures. The two models differ in using different internal concepts - operationalized here as different internal latent variables.

Now, suppose one of the models has some internal latent which (under that model) mediates between two chunks of observable variables - this would be a pretty typical thing for an internal latent in some generative model. Then we can give some simple sufficient conditions on a latent $Λ$ in the other model, such that $Λ \to Λ^{'} \to X$ .

Another possibility (dual to the previous claim): suppose that one of the two models has some internal latent $Λ^{'}$ which (under that model) can be precisely estimated from any of several different subsets of the observables - i.e. we can drop any one part of the observables and still get a precise estimate of $Λ^{'}$ . Then, under the same simple sufficient conditions on $Λ$ in the other model, $Λ^{'} \to Λ \to X$ .

A standard example here is temperature ( $Λ$ ) in an ideal gas. It satisfies the sufficient conditions, so I can look at any model which makes the same predictions about the gas, and look at any latent $Λ^{'}$ in that model which mediates between two far-apart chunks of gas, and I'll find that that $Λ^{'}$ includes temperature (i.e. temperature is a function of $Λ^{'}$ ). Or I can look at any $Λ^{'}$ which can be estimated precisely from any little chunk of the gas, and I'll find that temperature includes $Λ^{'}$ (i.e. $Λ^{'}$ is a function of temperature).

... and crucially, this all plays well with approximation, so if e.g. $Λ^{'}$ approximately mediates between two chunks then we get an approximate claim about the two latents.

johnswentworth

So the upshot of all that is: we have a way to look at some properties of latent variables ("concepts") internal to one model, and say how they'll relate to broad classes of latents in another model.

johnswentworth

Ok, what does it mean to "mediate between two chunks of observable variables"?

habryka

It's mediation in the sense of conditional independence - so e.g., two chunks of gas which aren't too close together have approximately independent low-level state given their temperature, or two not-too-close-together chunks of an image spit out by an image generator are independent conditional on some "long-range" latents upstream of both of them.

johnswentworth

(In particular, this mediation thing is relevant to any mind which is factoring its model of the world in such a way that the two chunks are in different "factors", with some relatively-sparse interaction between them.)

johnswentworth

What do the arrows mean in this equation?

$Λ \to Λ^{'} \to X$

habryka

The arrow-diagrams are using Bayes net notation, so e.g. $Λ \to Λ^{'} \to X$ is saying that $X$ and $Λ$ are independent given $Λ^{'}$ , or equivalently that $Λ$ tells us nothing more about $X$ once we know $Λ^{'}$ .

johnswentworth

(Also, if you'd prefer a description more zoomed-out than this, we can do that instead.)

johnswentworth

Cool, let me think out loud while I am trying to translate this math into more concrete examples for me.

Here is a random example of two programs that use different ontologies but have the same probability distribution over some set of observables.

Let's say the programs do addition and one of them uses integers and the other one uses floating point numbers, and I evaluate them both in the range of 1-100 or so, where the floating point error shouldn't matter.

Does the thing you are saying have any relevance to this matter?

Naively I feel like what you are saying translates to something like "if I know the floating point representation of a number, I should be able to find some way of using that representation in the context of the integer program to get the same result". But I sure don't immediately understand how to do that. I mean, I can translate them, of course, but I don't really know what to say beyond that.

habryka

Ok, so first we need to find either some internal thing which mediates between two parts of the "observables" (i.e. the numbers passed in and out), or some internal thing which we can precisely estimate from subsets of the observables.

For instance: maybe I look inside the integer-adder, and find that there's a carry (e.g. number "carried from" the 1's-place to the 10's-place), and the carry mediates between the most-significant-digits of the inputs/outputs, and the least-significant-digits.

Then, I know that any quantity which can be computed from either the most-significant-digits or the least-significant digits (i.e. the same quantity must be computable from either one) is a function of that carry. So if the floating-point adder is using anything along those lines, I know it's a function of the carry in the integer-adder's ontology.

(In this particular example I don't think we have any such interesting quantity; our "observables" are "already compressed" in some sense, so there's not much nontrivial to say about the relationship between the two programs.)

johnswentworth

Ah, so when you are saying $Λ \to Λ^{'} \to X$ you don't mean "we can find a $Λ$ ", you are saying "if there is a $Λ$ that meets some conditions then we can make it conditionally independent from the observables using $Λ^{'}$ "?

habryka

Yes.

johnswentworth

Ok, cool, that makes more sense to me. What are the conditions on $Λ$ that allow us to do that?

habryka

Turns out we've already met them. The two conditions are:

$Λ$ must mediate between some chunks of $X$
$Λ$ must be precisely-estimable from all but one of those chunks, for any chunk excluded (i.e. we can drop any one chunk and still get a precise estimate of $Λ$ ).

So for instance, in an ideal gas, far-apart chunks are approximately independent given the temperature, and we can precisely estimate the temperature from any one (or a few) chunks. So, temperature in that setting satisfies the conditions. And approximation works, so that would also carry over to approximately-ideal gases.

johnswentworth

Ok, so what we are saying here is something like "if I have two systems that somehow model the behavior of ideal gases, if there is any latent variable that mediates between distant clusters of gas, then I must be able to extract temperature from that variable"?

Of course, we have no guarantee that any such system actually has any latent variable that mediates between the predictions about distant clusters of gas. Like, the thing could just be a big pre-computed lookup table.

habryka

That's exactly right. We do have some reasons to expect mediation-heavy models to be instrumentally convergent, but that's a separate story.

johnswentworth

I find it hard to come up with examples of the "precisely calculable" condition. Like, what are some realistic situations where that is the case?

habryka

Sure, the "precisely calculable" is actually IMO easier to see once you know what to look for. Examples:

We can get a good estimate of the consensus genome of a species by looking at a few members of the species.
... for that matter, we can get a good estimate of all sorts of properties of a species by looking at a few members of the species. Body shape, typical developmental trajectory (incl. e.g. weight or size at each age), behaviors, etc
We can get a good estimate of lots of things about 2009 toyota corollas by looking at a few 2009 toyota corollas.
In clustering language, we can get a very precise estimate of cluster-statistics (e.g. mean and variance, in a mixture-of-gaussians model) by looking at a moderate-sized sample of points from that cluster.

johnswentworth

Notably in these examples, the quantity is not actually "exactly" calculable, but as he's been mentioning we have approximation results which go through fine. (And by "fine," I mean, "with well defined error bounds.")

David Lorell

I do feel a bit confused about how the approximate thing could be true, but let's leave that for later, I can take it as given for now.

habryka

So for instance, if (under my model) there's some characteristic properties of 2009 toyota corollas, such that all 2009 toyota corollas are approximately independent given those properties AND I could estimate those properties from a sample of corollas, then I have the conditions.

(Note that I don't actually need to know the values of the properties - e.g. a species' genome can satisfy the properties under my model even if I don't know that species' whole consensus sequence.)

johnswentworth

Like, I keep wanting to translate the above into sentences of the form "I believe A because of reason X, you believe A because of reason Y, and we are both correct in our beliefs about A. This means that somehow I can say something about the relation between X and Y", but I feel like I can't yet quite translate things into that form.

Like, I definitely can't make the general case of "therefore X and Y can be calculated from each other", which of course they could not be. So you are saying something weaker.

habryka

Under one model, I can estimate the consensus genomic sequence of oak trees from a small sample of oaks. Under some other medieval model, most oak trees are approximately-independent given some mysterious essence of oak-ness, and that mysterious essence could in-principle be discovered by examining a sample of oak trees, though nobody knows the value of that mysterious essence (they've just hypothesized its existence). Well, good news: the consensus genomic sequence is implicitly included in the mysterious essence of oak-ness.

More precisely: any way of building a joint model which includes both the consensus sequence and the mysterious essence of oak-ness, both with their original relationships to all the "observable" oaks, will say that the consensus genomic sequence is a function of the mysterious essence, i.e. the oaks themselves can tell us nothing additional about the consensus sequence once we know the value of the mysterious essence.

johnswentworth

Ok, maybe I am being dumb, but like, let's say you carved your social security number into two oak trees. Now the consensus genomic sequence with your social security number appended can be estimated from any sample of oaks that leaves at most one out. Under some other medieval model there is some mysterious essence of oak-ness that can be discovered via sampling some oak trees. Now, it's definitely not the case that the consensus genomic sequence plus your social security number is implicitly included in the mysterious essence of oak-ness.

habryka

Yeah, that's a great example. In this case, the social security number would be excluded from the mysterious essence by having an "estimate from subset" requirement which allows throwing out more than just one oak (and indeed it is pretty typical that we want a stronger requirement along those lines).

However, the flip side is arguably more interesting: if you carve your social security number into enough oak trees, over all the oak trees to ever exist, then your social security number will naturally be part of the "cluster of oak trees". People can change which "concepts" are natural, by changing the environment.

johnswentworth

I think this would benefit from more concreteness. John, can you specify your SSN for us?

David Lorell

Ok, so this must be related to the "approximate" stuff I don't understand. Like, if I had to translate the above into the more formal criterion, you would say it fails because like, the oak trees are not actually independent without knowledge of your social security number, but I am kind of failing to make the example work out here.

habryka

I don't think the current example is about "approximation" in the sense we were using it earlier. There's a different notion of "approximation" which is more relevant here: we can totally weaken the conditions to "most chunks of observables are independent given latent", and then the same conclusions still hold. (The reason is that, if most chunks of observables are independent given latent, then we can still find some chunks which are independent given the latent, and then we just use those chunks as the foundation for the claim.)

(More generally, the sort of way we imagine using this machinery is to go looking around in models for chunks of observables over which we can find latents satisfying the conditions; we don't need to use all the observables all the time.)

johnswentworth

Ok, I would actually be interested in a rough proof outline here, but probably we should spend some more time talking about higher-level stuff.

habryka

Sure, what sorts of higher-level questions do you have?

johnswentworth

Like, if you had to give a very short and highly-compressed summary of how this kind of research helps with AI not killing everyone, what would it be? I could also make a first guess and then you can correct me.

habryka

My own very short story is something like:

Well, it really matters to understand the conditions under which AI will develop the same abstractions as humans do, since when they do, if we can also identify and manipulate those abstractions, we can use those to steer powerful AI systems away from having catastrophic consequences.

habryka

Sure. Simple story is:

This gives us some foundations to expect that certain kinds of latents/internal structures will map well across minds/ontologies.
We can go look for such structures in e.g. nets, see how well they seem to match our own concepts, and have some reason to expect they'll match our own concepts robustly in certain cases.

Furthermore, as new and more powerful AI comes along, we expect to be able to transport those concepts to the ontologies of those more powerful AIs (insofar as the more powerful AIs still have latents satisfying one or both conditions, which is itself something checkable in-principle).
... and insofar as all that works, we can use these sorts of latents as "ontology-transportable" building blocks with which to build more complex concepts.

johnswentworth

So, if we can build whatever-kind-of-alignment-machinery we want out of these sorts of building blocks, then we have some reason to expect that machinery to transport well across ontologies, especially to new more-powerful future AI.

johnswentworth

To clarify that last part, is your story here something like "we get some low-powered AI systems to do things for reasons we roughly understand and like. Then we make it so that a high-powered AI system does things for roughly the same reasons. That means the high-powered AI system roughly behaves predictably and in a way we like"?

habryka

Not the typical thing I'd picture, but sure, that's an example use-case.

johnswentworth

My prototypical picture would be more "we decode (a bunch of) the AI's internal language, to the point where we can state something we want in that language, then target its internal search process at that". And the ontology transport matters for both the "decode its internal language" step, and the unstated "transport that same goal to a successor AI" step.

johnswentworth

Ah, OK. The "retarget its internal search process at that" part does sound like more the kind of thing you would say.

habryka

Ok, so, why are you thinking about maybe building products instead of (e.g.) writing up some of the proofs or results you have so far? I mean, seems fine to validate hypotheses however one likes, but building a product takes a long time, and I don't fully understand the path from that to good things happening (like, to what degree is the path "John has better models of the problem domain vs. John has a pretty concrete and visceral demonstration of a bunch of solutions which then other people adopt and iterate on and then scale?"

habryka

So, I've done a lot of "writing up stuff" over the past few years, and the number of people who have usefully built on my technical work has been close to zero. (Though admittedly the latest results are nicer, I do think people would be more likely to build on them, but I don't know how much more.) On the flip side, fork hazards: I have to worry that this stuff is dangerous to share in exactly the worlds where it's most useful.

My hope is that products will give a more useful feedback signal than other peoples' commentary on our technical work.

(Also, we're not planning to make super-polished products, so hopefully the time sink isn't too crazy. Like, we've got some theory here which is pretty novel and cool, if we're doing our jobs well then even kinda-hacky products should be able to do things which nobody else can do today.)

johnswentworth

My sense is the usual purpose of publishing proofs and explanations is more than just field-building. My guess is it's also a pretty reasonable part of making your own epistemic state more robust, though that's not like a knock-down argument (and I am not saying people can't come to justifiedly true beliefs without publishing, though I do think justified true belief in the absence of public argument is relatively rare for stuff in this space).

habryka

(Also, we're not planning to make super-polished products, so hopefully the time sink isn't too crazy. Like, we've got some theory here which is pretty novel and cool, if we're doing our jobs well then even kinda-hacky products should be able to do things which nobody else can do today.)

Ok, are we thinking about "products" a bit more in the sense of "Chris Olah's team kind of makes a product with each big release, in the form of like a webapp that allows you to see things and do things with neural networks that weren't possible before" or more in the sense of "you will now actually make a bunch of money and get into YC or whatever"?

habryka

My sense is the usual purpose of publishing proofs and explanations is more than just field-building. My guess is it's also a pretty reasonable part of making your own epistemic state more robust...

I agree with that in principle. In practice, the number of people who have usefully identified technical problems in my work is... less than half a dozen. There's also some value in writing it up at all - I have frequently found problems when I go to write things up - but that seems easier to substitute for.

johnswentworth

I agree with that in principle. In practice, the number of people who have usefully identified technical problems in my work is... less than half a dozen.

I mean, six people seems kind of decent. I don't know whether more engagement than that was necessary for e.g. the development of most of quantum mechanics or relativity or electrodynamics.

habryka

Ok, are we thinking about "products" a bit more in the sense of "Chris Olah's team kind of makes a product with each big release, in the form of like a webapp that allows you to see things and do things with neural networks that weren't possible before" or more in the sense of "you will now actually make a bunch of money and get into YC or whatever"

Somewhere between those two. Like, making money is a super-useful feedback signal, but we're not looking to go do YC or refocus our whole lives on building a company.

johnswentworth

Ok, cool, that helps me get some flavor here.

habryka

I have lots more questions to ask, but we are at a natural stopping point. I'll maybe ask some more questions async or post-publishing, but thank you, this was quite helpful!

habryka

[-]LawrenceC6mo4119

I'm curious what form these "products" are intended to take -- if possible, could you give some examples of things you might do with a theory of natural abstractions? If I had to guess, the product will be an algorithm that identifies abstractions in a domain where good abstractions are useful, but I'm not sure how or in what domain.

[-]Algon6mo20

Seconded. For personal reasons, I'd be interested if you can identify important concepts in a text document.

[-]Max H6mo177

We can go look for such structures in e.g. nets, see how well they seem to match our own concepts, and have some reason to expect they'll match our own concepts robustly in certain cases.

Checking my own understanding with an example of what this might look like concretely:

Suppose you have a language model that can play Chess (via text notation). Presumably, the model has some kind of internal representation of the game, the board state, the pieces, and strategy. Those representations are probably complicated linear combinations / superpositions of activations and weights within the model somewhere. Call this representation in your notation.

If you just want a traditional computer program to play Chess you can use much simpler (or at least more bare metal / efficient) representations of the game, board state, and pieces as a 2-d array of integers or a bitmap or whatever, and write some relatively simple code to manipulate those data structures in ways that are valid according to the rules of Chess. Call this representation $Λ$ in your notation.

And, to the degree that the language model is actually capable of playing valid Chess (since that's when we would expect the preconditions to hold), you expect to be able to identify latents within the model and find a map from $Λ^{'}$ to $Λ$ , such that you can manipulate $Λ$ and use information you learn from those manipulations to precisely predict stuff about $Λ^{'}$ . More concretely, once you have the map, you can predict the moves of the language model by inspecting its internals and then translating them into the representation used by an ordinary Chess analysis program, and then, having predicted the moves, you'll be able to predict (and perhaps usefully manipulate) the language model's internal representations by mapping from $Λ$ back to $Λ^{'}$ .

And then the theorems are just saying under what conditions exactly you expect to be able to do this kind of thing, and it turns out those conditions are actually relatively lax.

Roughly accurate as an example / summary of the kind of thing you expect to be able to do?

[-]Zvi6mo1610

After all the worries about people publishing things they shouldn't, I found it very surprising to see Oliver advocating for publishing when John wanted to hold back, combined with the request for incremental explanations of progress to justify continued funding.

John seems to have set a very high alternative proof bar here - do things other people can't do. That seems... certainly good enough, if anything too stringent? We need to find ways to allow deep, private work.

[-]johnswentworth6mo112

To be clear, the discussion about feedback loops was mostly about feedback loops for me (and David), i.e. outside signals for us to make sure we haven't lost contact with reality. This was a discussion about epistemics, not a negotiation over preconditions of funding.

[-]Bill Benzon6mo72

"...to understand the conditions under which AI will develop the same abstractions as humans do..."

I know from observation that ChatGPT has some knowledge of the concepts of justice and charity. It can define them in a reasonable way and create stories illustrating them. In some sense, it understands those concepts, and it arrived at them, I presume, through standard pretraining. Has it therefore developed those abstractions in the sense you're talking about?

[-]Joseph Bloom6mo30

I'd be very interested in seeing these products and hearing about the use-cases / applications. Specifically, my prior experience at a startup leads me to believe that building products while doing science can be quite difficult (although there are ways that the two can synergise).

I'd be more optimistic about someone claiming they'll do this well if there is an individual involved in the project who is both deeply familiar with the science and has build products before (as opposed to two people each counting on the other to have sufficient expertise they lack).

A separate question I have is about how building products might be consistent with being careful about what information you make public. If you there are things that you don't want to be public knowledge, will there be proprietary knowledge not shared with users/clients? It seems like a non-trivial problem to maximize trust/interest/buy-in whilst minimizing clues to underlying valuable insights.

[-]mtaran6mo3-2

It'd be great if one of the features of these "conversation" type posts was that they would get an LLM-genererated summary or a version of it not as a conversation. Because at least for me this format is super frustrating to read and ends up having a lower signal to noise ratio.

[-]habryka6mo46

I do think at least for a post like this, the likelihood the LLM would get any of the math right is pretty low. I do think some summary that allows people to decide whether to read the thing is pretty valuable, but I think it's currently out of reach to have a summary actually contain the relevant ideas/insights in a post.

[-]Domenic6mo20

I've had a hard time connecting John's work to anything real. It's all over Bayes nets, with some (apparently obviously true https://www.lesswrong.com/posts/2WuSZo7esdobiW2mr/the-lightcone-theorem-a-better-foundation-for-natural?commentId=K5gPNyavBgpGNv4m3 ) theorems coming out of it.

In contrast, look at work like Anthropic's superposition solution, or the representation engineering paper from CAIS. If someone told me "I'm interested in identifying the natural abstractions AIs use when producing their output", that is the kind of work I'd expect. It's on actual LLMs! (Or at least "LMs", for the Anthropic paper.) They identified useful concepts like "truth-telling" or "Arabic"!

In John's work, his prose often promises he'll point to useful concepts like different physics models, but the results instead seem to operate on the level of random variables and causal diagrams. I'd love to see any sign this work is applicable toward real-world AI systems, and can, e.g., accurately identify what abstractions GPT-2 or LLaMA are using.

LESSWRONG
LW

Trying to understand John Wentworth's research agenda

92

92