Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The conversation

Martín:
Any write-up on the thing that "natural abstractions depend on your goals"? As in, pressure is a useful abstraction because we care about / were trained on a certain kind of macroscopic patterns (because we ourselves are such macroscopic patterns), but if you cared about "this exact particle's position", it wouldn't be.[1]

John:
Nope, no writeup on that.

And in the case of pressure, it would still be natural even if you care about the exact position of one particular particle at a later time, and are trying to predict that from data on the same gas at an earlier time. The usual high level variables (e.g. pressure, temperature, volume) are summary stats (to very good approximation) between earlier and later states (not too close together in time), and the position of one particular particle is a component of that state, so (pressure, temperature, volume) are still summary stats for that problem.

The main loophole there is that if e.g. you're interested in the 10th-most-significant bit of the position of a particular particle, then you just can't predict it any better than the prior, so the empty set is a summary stat and you don't care about any abstractions at all.

Martín:
Wait, I don't buy that:

The gas could be in many possible microstates. Pressure partitions them into macrostates in a certain particular way. That is, every possible numerical value for pressure is a different macrostate, that could be instantiated by many different microstates (where the particle in question is in very different positions).
Say instead of caring about this partition, you care about the macrostate partition which tracks where that one particle is.
It seems like these two partitions are orthogonal, meaning that conditioning on a pressure level gives you no information about where the particle is (because the system is symmetric with respect to all particles, or something like that).
[This could be false due to small effects like "higher pressure makes it less likely all particles are near the center" or whatever, but I don't think that's what we're talking about. Ignore them for now, or assume I care about a partition which is truly orthogonal to pressure level.]
So tracking the pressure level partition won't help you.

It's still true that "the position of one particular particle is a component of the (micro)state", but we're discussing which macrostates to track, and pressure is only a summary stat for some macrostates (variables), but not others.

John:
Roughly speaking, you don't get to pick which macrostates to track. There are things you're able to observe, and those observations determine what you're able to distinguish.

You do have degrees of freedom in what additional information to throw away from your observations, but for something like pressure, the observations (and specifically the fact that observations are not infinite-precision) already pick out (P, V, T) as the summary stats; the only remaining degree of freedom is to throw away even more information than that.

Applied to the one-particle example in particular: because of chaos, you can't predict where the one particle will be (any better than P, V, T would) at a significantly later time without extremely-high-precision observations of the particle states at an earlier time.

Martín:
Okay, so I have a limited number of macrostate partitions I can track (because of how my sensory receptors are arranged), call this set S, and my only choice is which information from that to throw out (due to computational constraints), and still be left with approximately good models of the environment.

(P, V, T) is considered a natural abstraction in this situation because it contains almost all information from S (or almost all information relevant to a certain partition I care about, which I can track by using all of S). That's the definition of natural abstraction (being a summary stat). And so, natural abstractions are observer-dependent: a different observer with different sensory receptors would have a different S, and so possibly different summary stats. (And possibly also goal-dependent, if instead of defining them as "summary stats for this whole S", you define them as "summary stats for this concrete variable (that can be tracked using S)".)

And, if as an agent with only access to S, I want to track a partition which is not deducible from S, then I'm just screwed.
[And we can even make the argument that, due to how evolved agents like humans come about, it is to be expected that our goals are controlling partitions that we can deduce from S. Otherwise our mechanisms pursuing these goals would serve no purpose.]

Any disagreements?

John:
Yup, that all sounds right.


Martín:
So, to the extent your main plan is "use our understanding of Natural Abstractions to look inside an AI and retarget the search", why aren't you worried about natural abstractions being goal-dependent?

Natural abstractions being sensory-perceptors-dependent might not be that much of a worry, because if necessary we can purposefully train the AI on the kinds of macroscopic tasks we want.
But if you buy the high internal decoupling (inner misalignment) story, maybe by the time you look into the trained AI it has already developed a goal about molecular squiggles, and correspondingly some of its natural abstractions have reshaped to better track those macrostates (and you don't understand them).
Probably the plan is closer to "we check natural abstractions continuously during training, and so we'll be able to notice if something like this starts happening". Even then, a high enough goal-dependence might make the plan unworkable (because you would need a trillion checks during training). But intuitively this doesn't seem likely (opinions?)

[Also note this is in the vicinity of an argument for alignment-by-default: maybe since its natural abstractions have been shaped by our Earthly macroscopic tasks, it would be hard for it to develop a very alien goal. This could hold either because internal concepts track our kind of macrostates and this makes it less likely for an alien goal to crystallize, or because alien goals do crystallize but then the agent doesn't perform well (since it doesn't have natural abstractions to correctly pursue those) and so the alien goal is dissolved in the next training steps. I expect you to believe something like "goal-dependence is not so low that you'd get alignment by default, but it's low enough that if you know what you're looking for inside the AI you don't need a trillion checks".]

John:
Good question.

Short answer: some goals incentivize general intelligence, which incentivizes tracking lots of abstractions and also includes the ability to pick up and use basically-any natural abstractions in the environment at run-time.

Longer answer: one qualitative idea from the Gooder Regulator Theorem is that, for some goals in some environments, the agent won't find out until later what its proximate goals are. As a somewhat-toy example: imagine playing a board game or video game in which you don't find out the win conditions until relatively late into the game. There's still a lot of useful stuff to do earlier on - instrumental convergence means that e.g. accumulating resources and gathering information and building general-purpose tools are all likely to be useful for whatever the win condition turns out to be.

That's the sort of goal (and environment) which incentivizes general intelligence.

In terms of cognitive architecture, that sort of goal incentivizes learning lots of natural abstractions (since they're likely to be useful for many different goals), and being able to pick up new abstractions on the fly as one gains new information about one's goals.

I claim that humans have that sort of "general intelligence". One implication is that, while there are many natural abstractions which we don't currently track (because the world is big, and I can't track every single object in it), there basically aren't any natural abstractions which we can't pick up on the fly if we need to. Even if an AI develops a goal involving molecular squiggles, I can still probably understand that abstraction just fine once I pay attention to it.I also claim the kind of AI we're interested in, the kind which is dangerous and highly useful, will also have that sort of "general intelligence". Implication is that, while there will be many natural abstractions which such an AI isn't tracking at any given time, there basically aren't any natural abstractions which it can't pick up on the fly if it needs to. Furthermore, it will probably already track the "most broadly relevant" abstractions in the environment, which likely includes most general properties of humans (as opposed to properties of specific humans, which would be more locally-specific).

Martín:
Okay, so maybe the formal version is the following:

For some kind of agents and some kind of environments, while it remains strictly true that they only have direct observations of the macrostate partitions in S (my eyes cannot see into the infrared), they can (and are incentivized to) use these in clever ways (like building a thermal camera) to come to correct hypotheses about the other partitions (assuming a certain simple conceptual structure governs and binds all partitions, like the laws of physics). This will de facto mean that the agent develops internal concepts (natural abstractions, summary stats) that efficiently track these additional macrostates S’, almost as natively as if they could directly observe them.

Two agents who have gotten past the point of doing that will continue building more general cognitive tools to expand S’ further, because having more macrostate partitions is sometimes more useful: maybe whether a human survives is determined by the position of an electron, and so I start caring about controlling that macrostate as well (even if it wasn’t initially observable by me). Thus the particularity of their abstractions (first set by their perceptors and goals) will dissolve into more general all-purpose mechanisms, and they’ll end up with pretty similar S’. There might still be some path-dependence here, but we quantitatively expect it to be low. In particular, you expect the abstractions we track and study (inside the human S’) to be enough to understand those within an AI (or at least, to be enough of a starting point to build more abstractions and end up achieving that).
[That expectation sounds reasonable, although I guess eventually I would like more quantitative arguments to that effect.]

John:
Yep!

Some afterthoughts

I think this is a philosophically important point that most people don't have in mind when thinking about natural abstractions (nor I had seen addressed in John's work): we have some vague intuition that an abstraction like pressure will always be useful, because of some fundamental statistical property of reality (non-dependent on the macrostates we are trying to track), and that's not quite true.

As discussed, it's unclear whether this philosophical point poses a pragmatic problem, or any hindrance to the realistic implementation of John's agenda. My intuition is no, but this is a subtle question.[2]

This discussion is very similar to the question of whether agency is observer-dependent. There again I think the correct answer is the intentional stance: an agent is whatever is useful for me to model as intention-driven. And so, since different observers will find different concepts useful (due to their goals or sensors or computational capabilities), different observers will define agency differently.
And as above, this true philosophical point doesn't prohibit the possibility that, in practice, observers who have been shaped enough by learning (and have learned to shape themselves further and improve their abstractions) might agree on which concepts and computations (including agency-related heuristics) are most useful to deal with physical systems under different constraints, because this is a property of physics/math.

Related post: Why does generalization work?

  1. ^

    I'm not sure I independently discovered this consideration, maybe a similar one was floating around some Agent Foundations workshop, possibly voiced by Sam Eisenstat. But the way it came up more recently was in my thinking about why generalization works in our universe (post coming soon).

  2. ^

    A way to get rid of the observer(and goal)-dependence is by integrating over all observers (and goals) or a subclass of them, thus the new definition of natural abstraction could be "this property of reality is on average a good summary stat for most observers (and goals)" (for some definition of "on average", which would probably be tricky to decide on). But if different observers have quite different natural abstractions, this won't be much useful. So the important question is, quantitatively, how much convergence there is.

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 2:05 PM
[-]aysja2mo2816

we have some vague intuition that an abstraction like pressure will always be useful, because of some fundamental statistical property of reality (non-dependent on the macrostates we are trying to track), and that's not quite true.

I do actually think this is basically true. It seems to me that when people encounter that maps are not the territory—see that macrostates are relative to our perceptual machinery or what have you—they sometimes assume that this means the territory is arbitrarily permissive of abstractions. But that seems wrong to me: the territory constrains what sorts of things maps are like. The idea of natural abstractions, imo, is to point a bit better at what this “territory constrains the map” thing is. 

Like sure, you could make up some abstraction, some summary statistic like “the center point of America” which is just the point at which half of the population is on one side and half on the other (thanks to Dennett for this example). But that would be horrible, because it’s obviously not very joint-carvey. Where “joint carvy-ness” will end up being, I suspect, related to “gears that move the world,” i.e., the bits of the territory that can do surprisingly much, have surprisingly much reach, etc. (similar to the conserved information sense that John talks about). And I think that’s a territory property that minds pick up on, exploit, etc. That the directionality is shaped more like “territory to map,” rather than “map to territory.” 

Another way to say it is that if you sampled from the space of all minds (whatever that space, um, is), anything trying to model the world would very likely end up at the concept “pressure.” (Although I don’t love this definition because I think it ends up placing too much emphasis on maps, when really I think pressure is more like a territory object, much more so than, e.g., the center point of America is). 

There again I think the correct answer is the intentional stance: an agent is whatever is useful for me to model as intention-driven. 

I think the intentional stance is not the right answer here, and we should be happy it’s not because it’s approximately the worst sort of knowledge possible. Not just behaviorist (i.e., not gears-level), but also subjective (relative to a map), and arbitrary (relative to my map). In any case, Dennett’s original intention with it was not to be the be-all end-all definition of agency. He was just trying to figure out where the “fact of the matter” resided. His conclusion: the predictive strategy. Not the agent itself, nor the map, but in this interaction between the two. 

But Dennett, like me, finds this unsatisfying. The real juice is in the question of why the intentional stance works so well. And the answer to that is, I think, almost entirely a territory question. What is it about the territory, such that this predictive strategy works so well? After all, if one analyzes the world through the logic of the intentional stance, then everything is defined relative to a predictive strategy: oranges, chairs, oceans, planets. And certainly, we have maps. But it seems to me that the way science has proceeded in the past is to treat such objects as “out there” in a fundamental way, and that this has fared pretty well so far. I don’t see much reason to abandon it when it comes to agents. 

I think a science of agency, to the extent it inherits the intentional stance, should focus not on defining agents this way, but on asking why it works at all. 

I'm not sure we are in disagreement. No one is negating that the territory shapes the maps (which are part of the territory). The central point is just that our perception of the territory is shaped by our perceptors, etc., and need not be the same. It is still conceivable that, due to how the territory shapes this process (due to the most likely perceptors to be found in evolved creatures, etc.), there ends up being a strong convergence so that all maps represent isomorphically certain territory properties. But this is not a given, and needs further argumentation. After all, it is conceivable for a territory to exist that incentivizes the creation of two very different and non-isomorphic types of maps. But of course, you can argue our territory is not such, by looking at its details.

Where “joint carvy-ness” will end up being, I suspect, related to “gears that move the world,” i.e., the bits of the territory that can do surprisingly much, have surprisingly much reach, etc.

I think this falls for the same circularity I point at in the post: you are defining "naturalness of a partition" as "usefulness to efficiently affect / control certain other partitions", so you already need to care about the latter. You could try to say something like "this one partition is useful for many partitions", but I think that's physically false, by combinatorics (in all cases you can always build as many partitions that are affected by another one). More on these philosophical subtleties here: Why does generalization work?

Great comment, I just wanted to share a thought on my perception of the why in relation to the intentional stance. 

Basically, my hypothesis that I stole from Karl Friston is that an agent is defined as something that applies the intentional stance to itself. Or, in other words, something that plans with its own planning capacity or itself in mind. 

One can relate it to the entire membranes/boundaries discussion here on LW as well in that if you plan as if you have a non-permeable boundary, then the informational complexity of the world goes down. By applying the intentional stance to yourself, you minimize the informational complexity of modelling the world as you kind of define a recursive function that acts within its own boundaries (your self). You will then act according to this, and then you have a kind of self-fulfilling prophecy as the evidence you get is based on your map which has a planning agent in it. 

(Literally self-fulfilling prophecy in this case as I think this is the "self"-loop that is talked about in meditation. It's quite cool to go outside of it.)

Can you give a link to wherever Friston talks about that definition of agency?

Uh, I binged like 5 MLST episodes with Friston, but I think it's a bit later in this one with Stephen Wolfram: https://open.spotify.com/episode/3Xk8yFWii47wnbXaaR5Jwr?si=NMdYu5dWRCeCdoKq9ZH_uQ

It might also be this one: https://open.spotify.com/episode/0NibQiHqIfRtLiIr4Mg40v?si=wesltttkSYSEkzO4lOZGaw

Sorry for the unsatisfactory answer :/

This seems somewhat relevant to a disagreement I've been having with @Zack_M_Davis about whether autogynephilia is a natural abstraction.

Some foundations for people who are not familiar with the topic: Autogynephilia is a sexual interest in being a woman, in a sense relatively analogous to other sexual interests, such as men's usual sexual interest in women (gynephilia). That is, autogynephiles will have sexual fantasies about being women, will want to be women, and so on.

I argue that autogynephilia is not a natural abstraction because most autogynephiles keep their sexuality private, so people can't tell who is autogynephilic and therefore develop a model of how autogynephilia works. (Which I take to be the "natural" criterion of "natural abstractions".) When people talk about "autogynephilia", they do so in certain specific contexts, but those contexts tend to mix "autogynephilia" together with other things. For example, in the context of trans women, they often mix together autogynephilia with gender progressivism, because gender progressive autogynephiles are more likely to transition genders than gender conservative autogynephiles are. (Since gender conservative ideology says that transition is bad.)

By my count (being about the only person in the world investigating this statistically with surveys, and therefore having the most birds-eye perspective on it out of all people), there's on the order of 6 major factors that people mix together with autogynephilia: gender progressivism, masochism (masochistic autogynephilic sexuality looks more striking and abnormal than ordinary autogynephilic sexuality), unrestricted sociosexuality (leads to sexual exhibitionism which reduces the likelihood of keeping one's sexuality private, and among trans women also leads to an interest in having more exaggerated feminine features and presenting oneself as sexually available), neuroticism (increases the likelihood of talking a lot about worrying about autogynephilia), autism (reasons are unclear but people associate it way more with autogynephilia than seems justified), and general antisocial tendencies (people mainly talk about autogynephilia because they have some issues with some trans women, and they share stories about problematic autogynephiles with each other, Chinese-Robber style).

Zack disagrees in the strongest terms with my claim that autogynephilia is not a natural abstraction, as he says that people being wrong about something because they don't get data on it doesn't make it not a natural abstraction. (I'd encourage him to explicate further on his argument here if he wants, as I'm not sure I can present it faithfully.) Since we've disagreed, I've been vaguely curious what @johnswentworth thinks about this, not empirically about autogynephilia but semantically about the meaning of the term "natural abstraction".

Consider the consensus genome of some species of tree.

Long before we were able to sequence that genome, we were able to deduce that something-like-it existed. Something had to be carrying whatever information made these trees so similar (inter-breedable). Eventually people isolated DNA as the relevant information-carrier, but even then it was most of a century before we knew the sequence.

That sequence is a natural latent: most of the members of the tree's species are ~independent of each other given the sequence (and some general background info about our world), and the sequence can be estimated pretty well from ~any moderate-sized sample of the trees. Furthermore, we could deduce the existence of that natural latent long before we knew the sequence.

Point of this example: there's a distinction between realizing a certain natural latent variable exists, and knowing the value of that variable. To pick a simpler example: it's the difference between realizing that (P, V, T) mediate between the state of a gas at one time and its state at a later time, vs actually measuring the values of pressure, volume and temperature for the gas.

Let's say the species is the whitebark pine P. albicaulis, which grows in a sprawling shrub-like form called krummholz in rough high-altitude environments, but looks like a conventional upright tree in more forgiving climates.

Suppose that a lot of people don't like krummholz and have taken to using the formal species name P. albicaulis as a disparaging term (even though a few other species can also grow as krummholz).

I think Tail is saying that "P. albicaulis" isn't a natural abstraction, because most people you encounter using that term on Twitter are talking about krummholz, without realizing that other species can grow as krummholz or that many P. albicaulis grow as upright trees.

I'm saying it's dumb to assert that P. albicaulis isn't a natural abstraction just because most people are ignorant of dendrology and are only paying attention to the shrub vs. tree subspace: if I look at more features of vegetation than just broad shape, I end up needing to formulate P. albicaulis to explain the things some of these woody plants have in common despite their shape.

I'm saying it's dumb to assert that P. albicaulis isn't a natural abstraction just because most people are ignorant of dendrology and are only paying attention to the shrub vs. tree subspace: if I look at more features of vegetation than just broad shape, I end up needing to formulate P. albicaulis to explain the things some of these woody plants have in common despite their shape.

And I think this is fine if you're one of the approximately 5 people (me, maybe Bailey, maybe Andura, maybe Hsu, maybe you - even this is generous since e.g. I think you naturally thought of autogynephilia and gender progressivism as being more closely related than they really are) in the world who do observe things this closely, but it also seems relevant to notice that the rest of the world doesn't and can't observe things this closely, and that therefore we really ought to refactor our abstractions such that they align better with what everyone else can think about.

Like if our autogynephilia discourse was mostly centered on talking with each other about it, it'd be a different story, but you mostly address the rationalist community, and Bailey mostly addresses GCs and HBDs, and so on. So "most people you encounter using that term on Twitter" doesn't refer to irrelevant outsiders, it refers to the people you're trying to have the conversation with.

And like a key part of my point is that they mostly couldn't fix it if they wanted to. P. albicaulis has visible indicators that allows you to diagnose the species by looking at it, but autogynephiles don't unless you get in private in a way that GCs and rationalists basically are never gonna do.

I agree with that, but I think it is complicated in the case of autogynephilia. My claim is that there are a handful of different conditions that people look at and go "something had to be carrying whatever information made these people so similar", and the something is not simply autogynephilia in these sense that I talk about it, but rather varies from condition to condition (typically including autogynephilia as one of the causes, but often not the most relevant one).

I think this is a philosophically important point that most people don't have in mind when thinking about natural abstractions (nor I had seen addressed in John's work): we have some vague intuition that an abstraction like pressure will always be useful, because of some fundamental statistical property of reality (non-dependent on the macrostates we are trying to track), and that's not quite true.

As discussed, it's unclear whether this philosophical point poses a pragmatic problem, or any hindrance to the realistic implementation of John's agenda. My intuition is no, but this is a subtle question.

My intuition is that there will tend to be convergence, but I agree that this is uncertain. Thus, I also agree that convergence of natural abstractions is something we should seek to empirically measure across a varying population of created agents.

Short answer: some goals incentivize general intelligence, which incentivizes tracking lots of abstractions and also includes the ability to pick up and use basically-any natural abstractions in the environment at run-time.

Longer answer: one qualitative idea from the Gooder Regulator Theorem is that, for some goals in some environments, the agent won't find out until later what its proximate goals are. As a somewhat-toy example: imagine playing a board game or video game in which you don't find out the win conditions until relatively late into the game. There's still a lot of useful stuff to do earlier on - instrumental convergence means that e.g. accumulating resources and gathering information and building general-purpose tools are all likely to be useful for whatever the win condition turns out to be.

 

As I understand this argument, even if an agent's abstractions depend on its goals, it doesn't matter because disparate agents will develop similar instrumental goals due to instrumental convergence. Those goals involve understanding and manipulating the world, and thus require natural abstractions. (And there's the further claim that a general intelligence can in fact pick up any needed natural abstraction as required.)

That covers instrumental goals, but what about final goals? These can be arbitrary, per the orthogonality thesis. Even if an agent develops a set of natural abstractions for instrumental purposes, if it has non-natural final goals, it will need to develop a supplementary set of non-natural goal-dependent abstractions to describe them as well.

When it comes to an AI modeling human abstractions, it does seem plausible to me that humans' lowest-level final goals/values can be described entirely in terms of natural abstractions, because they were produced by natural selection and so had to support survival & reproduction. It's a bit less obvious to me this still applies to high-level cultural values (would anyone besides a religious Jew naturally develop the abstraction of kosher animal?). In any case, if it's sufficiently important for the AI to model human behavior, it will develop these abstractions for instrumental purposes.

Going the other direction, can humans understand, in terms of our abstractions, those that an AI develops to fulfill its final goals? I think not necessarily, or at least not easily. An unaligned or deceptively aligned mesa-optimizer could have an arbitrary mesa-objective, with no compact description in terms of human abstractions. This matters if the plan is to retarget an AI's internal search process. Identifying the original search target seems like a relevant intermediate step. How else can you determine what to overwrite, and that you won't break things when you do it?

I claim that humans have that sort of "general intelligence". One implication is that, while there are many natural abstractions which we don't currently track (because the world is big, and I can't track every single object in it), there basically aren't any natural abstractions which we can't pick up on the fly if we need to. Even if an AI develops a goal involving molecular squiggles, I can still probably understand that abstraction just fine once I pay attention to it.

This conflates two different claims.

  1. A general intelligence trying to understand the world can develop any natural abstraction as needed. That is, regularities in observations / sensory data -> abstraction / mental representation.
  2. A general intelligence trying to understand another agent's abstraction can model its implications for the world as needed. That is, abstraction -> predicted observational regularities.

The second doesn't follow from the first. In general, if a new abstraction isn't formulated in terms of lower-level abstractions you already possess, integrating it into your world model (i.e. understanding it) is hard. You first need to understand the entire tower of prerequisite lower-level abstractions it relies on, and that might not be feasible for a bounded agent. This is true whether or not all these abstractions are natural.

In the first case, you have some implicit goal that's guiding your observations and the summary statistics you're extracting. The fundamental reason the second case can be much harder relates to this post's topic: the other agent's implicit goal is unknown, and the space of possible goals is vast. The "ideal gas" toy example misleads here. In that case, there's exactly one natural abstraction (P, V, T), no useful intermediate abstraction levels, and the individual particles are literally indistinguishable, making any non-natural abstractions incoherent. Virtually any goal routes through one abstraction. A realistic general situation may have a huge number of equally valid natural abstractions pertaining to different observables, at many levels of granularity (plus an enormous bestiary of mostly useless non-natural abstractions). A bounded agent learns and employs the tiny subset of these that helps achieve its goals. Even if all generally intelligent agents have the same potential instrumental goals that could enable them to learn the same natural abstractions, without the same actual instrumental goals, they won't.