This is a special post for quick takes by Jeremy Gillen. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
10 comments, sorted by Click to highlight new comments since:

Here's a mistake some people might be making with mechanistic interpretability theories of impact (and some other things, e.g. how much Neuroscience is useful for understanding AI or humans).

When there are multiple layers of abstraction that build up to a computation, understanding the low level doesn't help much with understanding the high level. 

1. Understanding semiconductors and transistors doesn't tell you much about programs running on the computer. The transistors can be reconfigured into a completely different computer, and you'll still be able to run the same programs. To understand a program, you don't need to be thinking about transistors or logic gates. Often you don't even need to be thinking about the bit level representation of data.

2. The computation happening in single neurons in an artificial neural network doesn't have have much relation to the computation happening at a high level. What I mean is that you can switch out activation functions, randomly connect neurons to other neurons, randomly share weights, replace small chunks of network with some other differentiable parameterized function. And assuming the thing is still trainable, the overall system will still learn to execute a function that is on a high level pretty similar to whatever high level function you started with.[1]

3. Understanding how neurons work doesn't tell you much about how the brain works. Neuroscientists understand a lot about how neurons work. There are models that make good predictions about the behavior of individual neurons or synapses. I bet that the high level algorithms that are running in the brain are most naturally understood without any details about neurons at all. Neurons probably aren't even a useful abstraction for that purpose. 


Probably directions in activation space are also usually a bad abstraction for understanding how humans work, kinda analogous to how bit-vectors of memory are a bad abstraction for understanding how program works.

Of course John has said this better.

  1. ^

    You can mess with inductive biases of the training process this way, which might change the function that gets learned, but (my impression is) usually not that much if you're just messing with activation functions.

Also note that (e.g.) Dan H.[1] has also advocated for some version of this take. See for instance the Open Problems in AI X-Risk (Pragmatic AI safety #5) section on criticisms of transparency:

When analyzing complex systems (such as deep networks), it is tempting to separate the system into events or components (“parts”), analyze those parts separately, and combine results or “divide and conquer.”

This approach often wrongly assumes (Leveson 2020):

  • Separation does not distort the system’s properties
  • Each part operates independently
  • Part acts the same when examined singly as when acting in the whole
  • Parts are not subject to feedback loops and nonlinear interactions
  • Interactions between parts can be examined pairwise 

Searching for mechanisms and reductionist analysis is too simplistic when dealing with complex systems (see our third post for more).

People hardly understand complex systems. Grad students in ML don’t even understand various aspects of their field, how to make a difference in it, what trends are emerging, or even what’s going on outside their small area. How will we understand an intelligence that moves more quickly and has more breadth? The reach of a human mind has limits. Perhaps a person could understand a small aspect of an agent’s actions (or components), but it’d be committing the composition fallacy to suggest a group of people that individually understand a part of an agent could understand the whole agent.

  1. ^

    I think Dan is the source of this take in the post I link rather than the other author Thomas W, but not super confident.

Yeah I think I agree. It also applies to most research about inductive biases of neural networks (and all of statistical learning theory). Not saying it won't be useful, just that there's a large mysterious gap between great learning theories and alignment solutions and inside that gap is (probably, usually) something like the levels-of-abstraction mistake.

This definitely sounds like a mistake someone could make while thinking about singular learning theory or neuroscience, but I don't think it sounds like a mistake that I'm making? It does in fact seem a lot easier to go from a theory of how model structure, geometry, & rewards maps to goal generalization to a theory of values than it does to go from the mechanics of transistors to tetris, or the mechanics of neurons to a theory of values.

The former problem (structure-geometry-&-rewards-to-goals to value-theory) is not trivial, but seems like only one abstraction leap, while the other problems seem like very many abstraction leaps (7 to be exact, in the case of transistors->tetris).

The problem is not that abstraction is impossible, but that abstraction is hard, and you should expect to be done in 50 years if launching a research program requiring crossing 7 layers of abstraction (the time between Turing's thesis, and making Tetris). If just crossing one layer, the same crude math says you should expect to be done in 7 years (edit: Though also going directly from a high level abstraction to an adjacently high level abstraction is a far easier search task than trying to connect a very low level abstraction to a very high level abstraction. This is also part of the mistake many make, and I claim its not a mistake that I'm making).

My guess is that it's pretty likely that all of:

  • There aren't really any non-extremely-leaky abstractions in big NNs on top of something like a "directions and simple functions on these directions" layer. (I originally heard this take from Buck)
  • It's very hard to piece together understanding of NNs from these low-level components[1]
  • It's even worse if your understanding of low-level components is poor (only a small fraction of the training compute is explained)

That said I also think it's plausible that understanding the low-level could help a lot with understanding the high level even if there is a bunch of other needed work.

  1. This will depend on the way in which you understand or operate on low level components of course. Like if you could predict behavior perfectly just from a short text description for all low level components then you'd be fine. But this is obviously impossible in the same way it's impossible for transistors. You'll have to make reference to other concepts etc, and then you'll probably have a hard time. ↩︎

There aren't really any non-extremely-leaky abstractions in big NNs on top of something like a "directions and simple functions on these directions" layer. (I originally heard this take from Buck)

Of course this depends on what it's trained to do? And it's false for humans and animals and corporations and markets, we have pretty good abstractions that allow us to predict and sometimes modify the behavior of these entities.

I'd be pretty shocked if this statement was true for AGI.

I think this is going to depend on exactly what you mean by non-extremely-leaky abstractions.

For the notion I was thinking of, I think humans, animals, corporations, and markets don't seem to have this.

I'm thinking of something like "some decomposition or guide which lets you accurately predict all behavior". And then the question is how good are the best abstractions in such a decomposition.

There are obviously less complete abstractions.

(Tbc, there are abstractions on top of "atoms" in humans and abstractions on top of chemicals. But I'm not sure if there are very good abstractions on top of neurons which let you really understand everything that is going on.)

Ah I see, I was referring to less complete abstractions. The "accurately predict all behavior" definition is fine, but this comes with a scale of how accurate the prediction is. "Directions and simple functions on these directions" probably misses some tiny details like floating point errors, and if you wanted a human to understand it you'd have to use approximations that lose way more accuracy. I'm happy to lose accuracy in exchange for better predictions about behavior in previously-unobserved situations. In particular, it's important to be able to work out what sort of previously-unobserved situation might lead to danger. We can do this with humans and animals etc, we can't do it with "directions and simple functions on these directions".

Technical alignment game tree draft

I've found it useful to organize my thoughts on technical alignment strategies approximately by the AGI architecture (and AGI design related assumptions). The target audience of this text is mostly myself, when I have it better worded I might put it in a post. Mostly ignoring corrigibility-style approaches here.

  • End-to-end trained RL: Optimizing too hard on any plausible training objective leads to gaming that objective
    • Scalable oversight: Adjust the objective to make it harder to game, as the agent gets more capable of gaming it.
      • Problem: Transparency is required to detect gaming/deception, and this problem gets harder as you scale capabilities. Additionally, you need to be careful to scale capabilities slowly.
    • Shard theory: Agents that can control their data distribution will hit a phase transition where they start controlling their data distribution to avoid goal changes. Plausibly we could make this phase transition happen before deception/gaming, and train it to value honesty + other standard human values.
      • Problem: Needs to be robust to “improving cognition for efficiency & generality”, i.e. goal directed part of mind overriding heuristic morality part of mind.
  • Transfer learning
    • Natural abstractions: Most of the work of identifying human values comes from predictive world modeling. This knowledge transfers, such that training the model to pursue human goals is relatively data-efficient.
      • Failure mode: Can't require much optimization to get high capabilities, otherwise the capabilities optimization will probably dominate the goal learning, and re-learn most of the goals.
  • We have an inner aligned scalable optimizer, which will optimize a given precisely defined objective (given a WM & action space).
    • Vanessa!IRL gives us an approach to detecting and finding the goals of agents modeled inside an incomprehensible ontology.
      • Unclear whether the way Vanessa!IRL draws a line between irrationalities and goal-quirks is equivalent to the way a human would want this line to be drawn on reflection.
  • We have human-level imitators
    • Do some version of HCH
      • But imitators probably won’t be well aligned off distribution, for normal goal misgeneralization reasons. HCH moves them off distribution, at least a bit.
  • End-to-end RL + we have unrealistically good interpretability
    • Re-target the search
      • Interpretability probably won’t ever be this good, and if it was, we might not need to learn the search algorithm (we could build it from scratch, probably with better inner alignment guarantees).