I recently found a label I like for one of the obstacles to AI alignment that I think about frequently. I think of it as "compact models" vs. "wide models," which I'll try to define by suggestive naming, followed by example. It seems easy to verify that compact models have particular properties, and thus we hope it would be easy to verify that they have desirable ones, but it seems difficult to do the same verification for wide models. Unfortunately and importantly, it seems to me like the best models of human values and decision-making are wide, and so ensuring that a system is compatible with humans inherits the difficulties of reasoning about wide models, even if the artificial system is not itself wide.

This distinction is related to the "neats vs. scruffies" debate of old, but is (I hope) a somewhat different take. To some extent, the 'compact model' maps onto the 'neat' approach, whereas the 'wide model' maps closer to the *synthesis* of the neat and scruffy approach.

Some examples of compact models:

A Turing machine is a compact model of computation: easy to describe, easy to reason about, and easy to prove things about. But it's also somewhat detached from actual computers and programming.

The VNM-utility formulation is a compact model of preferences conjugate to decision-making under uncertainty. It is, in many respects, the simplest consequentialist to reason about, and likewise the argument for it being the ideal form is simple. But when trying to descriptively model actual human behavior with utility functions, one needs to include lots of wrinkles, not just in the utility function but also for the decision procedure and model of uncertainty.

In numerical optimization, linear programming refers to methods that solve constrained optimization problems where all variables are real numbers and the cost function and all constraints are linear in those variables. This allows for an extremely slick solution called the 'simplex method,' and both its mechanics and the argument for its correctness can be easily grasped. Almost all of the difficulty of applying it in practice has been compartmentalized to getting right the correspondence between the mathematical model being optimized and the external world where the solution is put into place.

To be a wide model, it still has to have a conceptually crisp core. But the power of wide models comes from the extensibility or composability of the core. The functionality is a predictable result of the careful arrangement of elements, but requires some minimal (and often large!) size to function.

The first thing that comes to mind is processor design, where a rather simple base element (the transistor) is combined with itself again and again, even to implement the simple functionality of arithmetic. Verifying a particular processor design behaves as intended is mostly not about verifying the basic logic of transistors, but that the billions of them are hooked up correctly.

For human psychology, I'm enamored of the hierarchical controls model, where small, easily understandable units are composed to regulate progressively more complicated states. But while it's clear how the units work, and clear how the units could be assembled into a functioning ensemble, it is also clear that an ensemble that can convincingly approximate the inner workings of a human mind needs to be of immense size. Compare to the situation in machine learning, where convolutional neural networks are able to do a broad subset of human visual processing using easily understandable subunits arranged in a particular way, with megabytes of parameter data necessary to make it work correctly.

In numerical optimization, many heuristic and metaheuristic methods have superior practical performance (in time-constrained settings, which is all settings for problems of meaningful size) to methods that seek to both find an optimal solution and a proof that the solution is optimal. But such methods may make use of complicated and deeply embedded knowledge, and it may be non-obvious or difficult to prove many desirable properties about them.

Why draw this distinction between models? It might seem that, because of their fundamental generality, a utility function can encode everything a hierarchical control model does, just like a Turing machine can be used to simulate a processor design (or used to reason about processor designs). What I like about the wide models is that they begin to incorporate the sticky, gritty nature of the real world into the model. It is clear from the description of a CNN that it requires a huge number of parameters, whereas the same is not as clear from the description of the utility function; the CNN contains mechanisms to deal with that complexity, whereas the utility function outsources them.

Bringing this obstacle into the model means bringing the method for addressing that obstacle into the model as well. With a CNN, we can quantify and formally describe the ways in which it is and isn't robust, and examine how varying the design varies that robustness.

It also raises other questions that might not have been obvious under other formalizations. How do we investigate and quantify the correspondence between wide models, so that we can be sure they agree on the essentials without requiring a check for exact equality? How do we build models of reasoning that let us confidently predict the future behavior of a wide model in unknown environments?

Some answers have been suggested to those questions, with varying tradeoffs and drawbacks, which I won't go into here. The primary use I've gotten so far from thinking in this way is having this as a question in my back pocket that I can use to engage with proposals for alignment; how does it deal with these sorts of difficulties? Where does the complexity reside?

This also seems useful for helping explain why I think alignment is likely difficult. If intelligence is shaped like an onion, then perhaps you can have an aligned seed that then grows out to be wide and have many layers. But if intelligence is instead shaped like a pyramid, then we can't place a capstone of alignment and build downwards; the correctness of the more abstract levels depends on the foundations on which those rest.

I definitely support having models that engage more with the messiness of the real world. I'm not sure if I would have used "wide models" -- it seems like even the assumption of a crisp core makes it not as capable of handling messiness as I want. But if you're trying to get formal guarantees and you need to use some model, a wide model seems probably useful to use.

This doesn't (yet) seem like an argument that alignment is likely difficult. Why should intelligence be shaped like a pyramid? Even if it is, how does alignment depend on the shape of intelligence? Intuitively, if intelligence is shaped like a pyramid, then it's just really hard to get intelligence, and so we don't build a superintelligent AI.

Agreed that the rest of the argument is undeveloped in the OP.

First is the argument that animal intelligence is approximately pyramidal in its construction, with neurons serving roles at varying levels of abstraction, and (importantly) layers that are higher up being expressed in terms of neurons at lower layers, in basically the way that neurons in a neural network work.

Alignment can (sort of) be viewed as a correspondence between intelligences. One might analogize this to comparing two programs and trying to figure out if they behave similarly. If the programs are neural networks, we can't just look at the last layer and see if the parameter weights line up; we have to look at all the parameters, and do some complicated math to see if they happen to be instantiating the same (or sufficiently similar) functions in different ways. For other types of programs, checking that they're the same is much easier; for example, consider the problem of showing that two formulations of a linear programming problem are equivalent.

I think "really hard" is an overstatement here. It looks like evolution built lizards then mammals then humans by gradually adding on layers, and it seems similarly possible that we could build a very intelligent system out of hooking together lots of subsystems that perform their roles 'well enough' but without the sort of meta-level systems that ensure the whole system does what we want it to do. Often, people have an intuition that either the system will fail to do anything at all, or it will do basically what we want, which I think is not true.

Cool, I think I mostly agree with you.

I'm not sure that this implies that alignment is hard -- if you're trying to prove that your system is aligned by looking at the details of how it is constructed and showing that it all works together, then yes, alignment is harder than it would be otherwise. But you could imagine other versions of alignment, eg. taking intelligence as a black box and pointing it in the right direction. (For example, if I magically knew the true human utility function, and I put that in the black box, the outcomes would probably be good.)

Here when I say "aligned" I mean "trying to help". It's still possible that the AI is incompetent and fails because it doesn't understand what the consequences of its actions are.

A possible example:

"Sleeping beauty" is a compact model of the Doomsday argument, and "mediocrity principle" is wide model of DA.

Not sure if this is helpful, but since you analogized to chip design. In chip design, you typically verify using a constrained random method when the state space grows too large to verify every input exhaustively. That is, you construct a distribution over the set of plausible strings and then sample it and feed it to your design. Then you compare the result to a model in a higher level language.

Of course, standard techniques like designing for modularity can make the state space more manageable too.