I recently found a label I like for one of the obstacles to AI alignment that I think about frequently. I think of it as "compact models" vs. "wide models," which I'll try to define by suggestive naming, followed by example. It seems easy to verify that compact models have particular properties, and thus we hope it would be easy to verify that they have desirable ones, but it seems difficult to do the same verification for wide models. Unfortunately and importantly, it seems to me like the best models of human values and decision-making are wide, and so ensuring that a system is compatible with humans inherits the difficulties of reasoning about wide models, even if the artificial system is not itself wide.
This distinction is related to the "neats vs. scruffies" debate of old, but is (I hope) a somewhat different take. To some extent, the 'compact model' maps onto the 'neat' approach, whereas the 'wide model' maps closer to the synthesis of the neat and scruffy approach.
Some examples of compact models:
A Turing machine is a compact model of computation: easy to describe, easy to reason about, and easy to prove things about. But it's also somewhat detached from actual computers and programming.
The VNM-utility formulation is a compact model of preferences conjugate to decision-making under uncertainty. It is, in many respects, the simplest consequentialist to reason about, and likewise the argument for it being the ideal form is simple. But when trying to descriptively model actual human behavior with utility functions, one needs to include lots of wrinkles, not just in the utility function but also for the decision procedure and model of uncertainty.
In numerical optimization, linear programming refers to methods that solve constrained optimization problems where all variables are real numbers and the cost function and all constraints are linear in those variables. This allows for an extremely slick solution called the 'simplex method,' and both its mechanics and the argument for its correctness can be easily grasped. Almost all of the difficulty of applying it in practice has been compartmentalized to getting right the correspondence between the mathematical model being optimized and the external world where the solution is put into place.
To be a wide model, it still has to have a conceptually crisp core. But the power of wide models comes from the extensibility or composability of the core. The functionality is a predictable result of the careful arrangement of elements, but requires some minimal (and often large!) size to function.
The first thing that comes to mind is processor design, where a rather simple base element (the transistor) is combined with itself again and again, even to implement the simple functionality of arithmetic. Verifying a particular processor design behaves as intended is mostly not about verifying the basic logic of transistors, but that the billions of them are hooked up correctly.
For human psychology, I'm enamored of the hierarchical controls model, where small, easily understandable units are composed to regulate progressively more complicated states. But while it's clear how the units work, and clear how the units could be assembled into a functioning ensemble, it is also clear that an ensemble that can convincingly approximate the inner workings of a human mind needs to be of immense size. Compare to the situation in machine learning, where convolutional neural networks are able to do a broad subset of human visual processing using easily understandable subunits arranged in a particular way, with megabytes of parameter data necessary to make it work correctly.
In numerical optimization, many heuristic and metaheuristic methods have superior practical performance (in time-constrained settings, which is all settings for problems of meaningful size) to methods that seek to both find an optimal solution and a proof that the solution is optimal. But such methods may make use of complicated and deeply embedded knowledge, and it may be non-obvious or difficult to prove many desirable properties about them.
Why draw this distinction between models? It might seem that, because of their fundamental generality, a utility function can encode everything a hierarchical control model does, just like a Turing machine can be used to simulate a processor design (or used to reason about processor designs). What I like about the wide models is that they begin to incorporate the sticky, gritty nature of the real world into the model. It is clear from the description of a CNN that it requires a huge number of parameters, whereas the same is not as clear from the description of the utility function; the CNN contains mechanisms to deal with that complexity, whereas the utility function outsources them.
Bringing this obstacle into the model means bringing the method for addressing that obstacle into the model as well. With a CNN, we can quantify and formally describe the ways in which it is and isn't robust, and examine how varying the design varies that robustness.
It also raises other questions that might not have been obvious under other formalizations. How do we investigate and quantify the correspondence between wide models, so that we can be sure they agree on the essentials without requiring a check for exact equality? How do we build models of reasoning that let us confidently predict the future behavior of a wide model in unknown environments?
Some answers have been suggested to those questions, with varying tradeoffs and drawbacks, which I won't go into here. The primary use I've gotten so far from thinking in this way is having this as a question in my back pocket that I can use to engage with proposals for alignment; how does it deal with these sorts of difficulties? Where does the complexity reside?
This also seems useful for helping explain why I think alignment is likely difficult. If intelligence is shaped like an onion, then perhaps you can have an aligned seed that then grows out to be wide and have many layers. But if intelligence is instead shaped like a pyramid, then we can't place a capstone of alignment and build downwards; the correctness of the more abstract levels depends on the foundations on which those rest.