Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

I don't think any factored cognition proponents would disagree with

Composing interpretable pieces does not necessarily yield an interpretable system.

They just believe that we could, contingently, choose to compose interpretable pieces into an interpretable system. Just like we do all the time with

massive factories with billions of components, e.g. semiconductor fabs
large software projects with tens of millions of lines of code, e.g. the Linux kernel
military operations involving millions of soldiers and support personnel

Figuring out how to turn interpretability/tool-ness/alignment/corrigibility of the parts into interpretability/tool-ness/alignment/corrigibility of the whole is the central problem, and it’s a hard (and interesting) open research problem.

Agreed this is the central problem, though I would describe it more as engineering than research - the fact that we have examples of massively complicated yet interpretable systems means we collectively "know" how to solve it, and it's mostly a matter of assembling a large enough and coordinated-enough engineering project. (The real problem with factored cognition for AI safety is not that it won't work, but that equally-powerful uninterpretable systems might be much easier to build).

[-]Antoine de Scorraille2y41

Do we really have such good interpretations for such examples? It seems to me that we have big problems in the real world because we don't.
We do have very high-level interpretations, but not enough to have solid guarantees. After all, we have a very high-level trivial interpretation of our ML models: they learn! The challenge is not just to have clues, but clues that are relevant enough to address safety concerns in relation to impact scale (which is the unprecedented feature of the AI field).

[-]Jan_Kulveit3yΩ9178

The upside of this, or of "more is different" , is we don't necessarily even need the property in the parts, or detailed understanding of the parts. And how the composition works / what survives renormalization / ... is almost the whole problem.

[-]Adam Scholl7mo*2211

I spent some time learning about neural coding once, and while interesting it sure didn't help me e.g. better predict my girlfriend; I think in general neuroscience is fairly unhelpful for understanding psychology. For similar reasons, I'm default-skeptical of claims that work on the level of abstraction of ML is likely to help with figuring out whether powerful systems trained via ML are trying to screw us, or with preventing that.

[-][anonymous]3y93

The way I see it having a lower level understanding of things allows you to create abstractions about their behavior that you can use to understand them on a higher level. For example, if you understand how transistors work on a lower level you can abstract away their behavior and more efficiently examine how they wire together to create memory and processor. This is why I believe that a circuits-style approach is the most promising one we have for interpretability.

Do you agree that a lower level understanding of things is often the best way to achieve a higher level understanding, in particular regarding neural network interpretability, or would you advocate for a different approach?

[-]leogao3yΩ365

(Mostly just stating my understanding of your take back at you to see if I correctly got what you're saying:)

I agree this argument is obviously true in the limit, with the transistor case as an existence proof. I think things get weird at the in-between scales. The smaller the network of aligned components, the more likely it is to be aligned (obviously, in the limit if you have only one aligned thing, the entire system of that one thing is aligned); and also the more modular each component is (or I guess you would say the better the interfaces between the components), the more likely it is to be aligned. And in particular if the interfaces are good and have few weird interactions, then you can probably have a pretty big network of components without it implementing something egregiously misaligned (like actually secretly plotting to kill everyone).

And people who are optimistic about HCH-like things generally believe that language is a good interface and so conditional on that it makes sense to think that trees of humans would not implement egregiously misaligned cognition, whereas you're less optimistic about this and so your research agenda is trying to pin down the general theory of Where Good Interfaces/Abstractions Come From or something else more deconfusion-y along those lines.

Does this seem about right?

[-]johnswentworth3yΩ343

Good description.

Also I had never actually floated the hypothesis that "people who are optimistic about HCH-like things generally believe that language is a good interface" before; natural language seems like such an obviously leaky and lossy API that I had never actually considered that other people might think it's a good idea.

[-]Antoine de Scorraille2y50

Natural language is lossy because the communication channel is narrow, hence the need for lower-dimensional representation (see ML embeddings) of what we're trying to convey. Lossy representations is also what Abstractions are about.
But in practice, you expect Natural Abstractions (if discovered) cannot be expressed in natural language?

[-]johnswentworth2y20

I expect words are usually pointers to natural abstractions, so that part isn't the main issue - e.g. when we look at how natural language fails all the time in real-world coordination problems, the issue usually isn't that two people have different ideas of what "tree" means. (That kind of failure does sometimes happen, but it's unusual enough to be funny/notable.) The much more common failure mode is that a person is unable to clearly express what they want - e.g. a client failing to communicate what they want to a seller. That sort of thing is one reason why I'm highly uncertain about the extent to which human values (or other variations of "what humans want") are a natural abstraction.

[+][comment deleted]2y20

[-]DanielFilan3yΩ442

(see also this shortform, which makes a rudimentary version of the arguments in the first two subsections)

[-]Thomas Larsen3y40

Did you mean:

And yet, assuming **tool** AI is possible at all, it will be possible to assemble those tools into something agenty.

[-]johnswentworth3y56

Nope. The argument is:

Transistors/wires are tool-like (i.e. not agenty)
Therefore if we are able to build an AGI from transistors and wires at all...
... then it is possible to assemble tool-like things into agenty things.

[-]SimonBiggs2y*30

This reminds me of the problems that STPA are trying to solve in safe systems design:
https://psas.scripts.mit.edu/home/get_file.php?name=STPA_handbook.pdf

And, for those who prefer video, here's a good video intro to STPA:

Their approach is designed to handle complex systems, by decomposing the system into parts. However, they are not decomposed into functions or tasks, but instead they decompose the system into a control structure.

They approach this problem by, addressing a system as built up of a graph of controllers (internal mesa optimisers which are potentially nested) which control processes and then receive feedback (internal loss functions) from those processes. From there, they are then able to logically decompose the system in such a way for each controller component and present the ways in which the resulting overall system can be unsafe due to that particular controller.

Wouldn't it be amazing if one day we could make a neural network that when trained, the result is subsequently verifiably mappable via mech-int onto an STPA control structure. And then, potentially have verifiable systems in place that themselves undergo STPA analyses on larger yet systems, in order to flag potential hazards given a scenario, and given its current control structure.

Maybe this could look something like this?

LESSWRONG
LW

LESSWRONG
LW

148

Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

148

Ω 60

148

Ω 60

Interpretability

Tools

Alignment/Corrigibility

How These Arguments Work In Practice