Review

This is an entry for the Open Philanthropy AI Worldviews Contest.

The most promising direction for "steering cognition at all" (the key part of AI alignment), seems to be to develop and apply a theory about abstractions powerful enough to:

  1. Use on AI systems (to predict and thus control their thought patterns; basically to "do agent foundations"), and

  2. Use in AI systems (to "point them" at specific goals we desire at the human level.)

This "abstraction agenda" is mostly associated with John Wentworth, and his related idea of the "Natural Abstraction hypothesis" (NAH). The NAH claims that the real world can be abstracted (by filtering out all but some key information) in a way that's convergently-useful for both humans and AI systems. For example, a mind is likely to come up with "tree" as a useful idea, just as humans have, and this idea of "tree" actually filters for information that both the AI and the human would want to know about.

This essay investigates the logic of the NAH and the broader abstraction agenda. We find assumptions and scenarios that would make the abstraction agenda more or less promising to work on, whether or not the NAH is true. This agenda's usefulness has large implications for the likelihood of existential catastrophe caused by AGI.

If you’re already dead-set on the formal alignment agenda in general (or the Wentworth-style-abstractions-informed approach in particular), this post may seem repetitive and obvious. If so, that’s some evidence I’ve well-represented the case for doing such research.

A steering-entity will probably be an abstraction-entity

For our purposes, an "abstraction" is some entity that throws away information about some other entity, and returns the rest of the information about that entity. Equivalent definition: an "abstraction" is some entity that only returns some information about some other entity, implicitly not returning the rest of the information about that entity.

Similarly, to "steer cognition" is to change a mind (including and especially one that isn't yet operating) such that that mind (at some point during activation) has a terminal goal that is based on some input added by the "steering" operation.

We can think of steering as being done by an entity added to an AI system. Steering could also be "baked into" an AI system's architecture. However, it's hard to imagine how a steering-entity could be completely-inseparably from the AI system whose behavior we want to steer. That is, we would expect this steering-entity to be an entity in its own right, separable from the AI it steers.

Given these, we can guess that an entity used for steering cognition, will also be an entity that removes information about the mind being steered. A trivial case is "add a human-oversight yes/no check for every time a neuron's weight is changed whilst training an NN".

Why expect this? Well, humans have bounded cognition, currently (i.e. before the creation of an AGI). If an existing human mind could already perfectly simulate an AGI, we would not need AI alignment research (and/or the simulated AGI would have escaped the "boxing" of that human's mind).

An Illustrative Example

Let’s add a little “switch” to a neural network, so a human has to click “yes” or “no” every time a neuron’s weight is going to be changed, along with a complete description of what that value change would be!

Wait, where are you going? Why don’t you want to implement this? Doesn't this let us "steer" the system directly?

There are two main problems with such a switch:

  1. While it gives a human operator a comprehensive view of available information, no human has the mental processing power to figure out what each tiny weight-change implies. So the human operator’s “control” is much like that of a 5-year-old human trying to control a 747 using the cockpit controls the unlabeled circuits which the cockpit controls are hooked up to.
  2. The switch adds so much cost via its human-in-the-loop check-every-single-step-manually approach, that the first AGI that’s [powerful enough to be dangerous] can’t reasonably be expected to be using the switch.

(Actually, there are a few more problems, mostly “hacking” whatever process happens after a human clicks “no” on an update. The core problems remain.)

It turns out, what we humans want in “control” or “steering” or “the ‘align’ part of ‘alignment’”, is some number of desiderata to be filled by the mechanism we use for the steering. Specifying those desiderata is outside the scope of this essay, but one of them would make the steering-entity "model-able by the bounded human mind".

Basically, given how high-dimensional an AGI's mind is, aligning it to a human's desires requires a simplification that a human mind can make accurate predictions about. We turn the steering wheel left, so the car turns angles left. This simplification requires throwing away information (lest we exceed the limits of human working-memory), which means we need abstraction.

Do you want a large, powerful AI system to be “interpretable” to your bounded human mind? Do you want to be able to point at a part of an AI’s architecture, and say “this is Green, it should be Blue if we want to avoid doom”, and be right? Do you want a different mix of information than the printout of GPT-4’s NN weights? Then you inescapably want to extract some information about an AI system, that you deem to be more important than the rest of such information. In other words, you want abstraction.

Necessity of Abstraction Theory With and Without the NAH

Recall that the NAH claims many kinds of minds would converge on developing similar abstractions to summarize the world around them. If the NAH is true, then there is a clear vector for "translation" between the internal concepts used by humans and those used by AI systems: just convert both to the corresponding "natural" abstraction! Thus, we'd have a hard time doing such translation without a theory of abstraction itself.

While less-discussed, one assumption of the NAH is that an A(G)I would have use for concepts at all. That is, it would not be directly simulating reality. This seems overwhelmingly likely to be true, given how much computational power would be needed to accurately simulate reality at the quantum level. If an AGI's cognition is bounded, it will be processing reality in a way that throws away information, meaning it must be using abstractions in the sense noted above.

Nevertheless, we could still imagine abstraction theory being useful if an AI uses non-natural concepts (i.e. if the NAH were false). After all, a theory of abstraction would tell us, in more detail, how much information about AI systems we could throw away and still get useful predictions about their behavior. But would abstraction theory really be necessary?

After all, what about existing prediction/steering entities that didn't need Wentworth-style abstraction theory? Isaac Newton didn’t have “minimal maps” lying around when he created his Laws of Motion. Albert Einstein probably didn’t have a probability graph on his desk when he came up with General Relativity. So maybe we don’t need a “theory of abstraction” to steer cognition “well enough” to avoid catastrophe. After all, we probably shouldn’t wait around for the provably-correct interpretation of quantum mechanics or a solution to all of philosophy, either. Why bother with abstractions, especially if the NAH is false?

Because the absence of natural abstractions simply moves the which-information-to-throw-away problem to a different step. Instead of translating human concepts to "natural" concepts, and then into AI concepts, we have to translate human concepts directly into AI concepts. If (as conjectured above) an AI will use abstractions of any sort, we must understand those abstractions well enough to [modify them and reasonably expect [predictable changes in the AI's behavior]].

If steering cognition (as conjectured) is necessary to aligning AI, then AI alignment requires understanding abstraction deeply enough to translate our intentions (or, possibly, any intentions) into concepts that guide the AI. Even if there's multiple indirect layers between "steering mechanism" and "AI goals", as long as human minds are too bounded to "just use all the information at hand" to control an AI, we will need a theory of abstractions.

We're not just developing abstractions to apply to an AI; we are developing abstractions that must, themselves, work with whatever abstractions an AGI ends up having.

Implications for P(doom)

If we think the "abstraction agenda" is correct, then our estimate of AGI-caused catastrophe should decrease. After all, settling on this approach should make alignment more tractable. Additionally, if the NAH is true, AI alignment is simplified because human and AI minds could both reliably "point" to the same things in the world.

Still, this is no cause for complacency! "Tractable" does not mean "easy" or "quick to make progress on". One example: It took several months between John Wentworth posting his semiformal results, and the conversion of some of those results into formal proofs. If your timelines are short enough, this glacial pace of work should frighten you! Then again, some other alignment research (including interpretability) is somewhat converging on results and ideas that we'd expect to see in a mature "abstraction theory" subfield, so maybe a lot of breakthroughs are likely to happen quickly and soon.