Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

It has become common on LW to refer to "giant inscrutable matrices" as a problem with modern deep-learning systems.

To clarify: deep learning models are trained by creating giant blocks of random numbers -- blocks with dimensions like 4096 x 512 x 1024 -- and incrementally adjusting the values of these numbers with stochastic gradient descent (or some variant thereof). In raw form, these giant blocks of numbers are of course completely unintelligible. Many hold that the use of such giant SGD-trained blocks is why it is hard to understand or to control deep learning models, and therefore we should seek to make ML systems from other components.

There are several places where Yudkowsky or others state or strongly imply that because SGD-trained models with huge matrices are unintelligible, we should seek some more easily-interpretable paradigm.

I'm going to argue against that. I think that a better alternative is probably not possible; that the apparent inscrutability of these models actually has little-to-nothing to do with deep learning; and finally that this language -- particularly to the non-rationalist -- suggests unwarranted mystery.

0: It Is Probable That Generally Intelligent Systems Must be Connectionist

Imagine a universe in which it is impossible to build a generally intelligent system that is not massively connectionist. That is, imagine a world where the only way to get intelligence from atoms is to have a massive number of simple, uniform units connected to each other -- or something that is a functional equivalent of the same.

In such a world, all smart animals would have become smart by scaling up the number of such units that they have. The dominant evolutionary species might become intelligent by scaling up its head size, despite paying the evolutionary cost of making childbirth dangerous and painful by doing so. Flying species that could not afford this extra weight of scaling up skull volume might take another approach, shrinking their neurons to pack more of them into a given volume. Even animals far distant from the dominant species along the phylogenetic tree and in which the evolution of high levels of intelligence occurred entirely separately, would do so by scaling up their brains.

The dominant species, once it could make information-processing equipment, might try for many years to make some generally intelligent system without massive connectionism. They might scorn connectionism as brute force, or as lacking insight; thousands of PhDs and software engineers would spend time devising specialized systems of image classification, voice transcription, language translation, video analysis, natural language processing, and so on. But once they coded up connectionist software -- then in a handful of years, the prior systems built through hundreds of thousands of hours of effort would fall to simple systems that an undergrad could put together in his spare time. And connectionist systems would quickly vault out from the realm of such prior systems, to build things completely unprecedented to non-connectionist systems.

Of course, such a world would be indistinguishable from our world.

Is this proof that intelligence must be connectionist? Of course not. We still await a Newton who might build a detailed causal model of intelligence, which confirms or refutes the above.

But if the universal failure of nature and man to find non-connectionist forms of general intelligence does not move you, despite searching for millions of years and millions of man-hours -- well, you could be right, but I'd really like to see any predictions an alternate hypothesis makes.

1.a: Among Connectionist Systems That We Know To Be Possible, Synchronous Matrix Operations Are the Most Interpretable

Given that we might even require connectionist systems for general intelligence, what are the most interpretable connectionist systems that we can imagine? What alternatives to matrix multiplications do we know are out there?

Well, our current systems could be more biologically inspired! They could work through spike-timing-dependent-plasticity neurons. We know these are possible, because biological brains exist. But such systems would a nightmare to interpret, because they work asynchronously in time-separated bundles of neuronal firing. Interpreting asynchronous systems is almost always far more difficult than interpreting synchronous systems.

Or the calculations, our connectionist system could take place in non-digital systems! Rather than as arbitrarily transportable digital files, the weights could be stored in actual physical, artificial neurons that implement STDP or backpropagation on an analog device. Or you could use something even more biologically inspred -- something like Peter Watts imagined cloned-neocortex-in-a-pan. But in such a neuron-inspired substrate, it could be a massive undertaking to do something as simple as reading the synapse strength. Once again, interpretability would be harder.

I don't want to claim too much. I don't think current systems are at the theoretical apex of interpretability, not the least because people can suggest ways to make them more interpretable.

But -- of all the ways we know general intelligence can be built, synchronous matrix operations are by far the easiest to understand.

1.b: And the Hard-To-Interpret Part of Matrices Comes From the Domain They Train on, And Not Their Structure

(Even in worlds where the above two points are false, I think this one is still probably true, although it is less likely.)

There are many clear interpretability successes for deep learning.

Small cases of grokking have been successfully reversed engineered. The interpretability team at OpenAI could identify neurons as abstract as the "pokemon" neuron or "Catholicism" neuron two years ago -- the same people now at Anthropic work on transformer circuits. It is possible to modify an LLM so it thinks that the Eiffel tower is in Rome, or to mind control a maze-solving agent to pursue a wide range of goals with just a single activation, which reaches for the summit of interpretability to my mind, because understanding should enable control.

All this and more is true, but still -- the vast majority of weights in larger models like GPT-4 have not been so reverse engineered. Doesn't that point to something fundamentally wrong about the gradient-descent-over-big-matrices-paradigm?

Well, I ask you -- do you think any other ML model, trained over the domain of all human text, with sufficient success to reach GPT-4 level perplexity, would turn out to be simpler?

I propose that the deeply obvious answer, once you ask the question, is that they would not.

ML models form representations suitable to their domain. Image language models build up a hierarchy of feature detectors moving from the simpler to the more complex -- line detectors, curve detectors, eye detectors, face detectors, and so on. But the space of language is larger than the space of images! We can discuss anything that exists, that might exist, that did exist, that could exist, and that could not exist. So no matter what form your predict-the-next-token language model takes, if it is trained over the entire corpus of the written word, the representations it forms will be pretty hard to understand, because the representations encode an entire understanding of the entire world.

So, I predict with high confidence that any ML model that can reach the perplexity levels of Transformers will also present great initial interpretive difficulty.

2: Inscrutability is in Ourselves and Not the Stars

Imagine an astronomer in the year 1600, who frequently refers to the "giant inscrutable movements" of the stars. He looks at the vast tables of detailed astronomical data emerging from Tycho Brahe's observatory, and remark that we might need to seek an entirely different method of understanding the stars, because this does not look promising.

Such an astronomer might be very knowledgeable, and might know by heart the deep truths. Our confusion about a thing is not a part of the thing; it is a feature of our minds and not of the world. Mystery and awe in our understanding of a thing, is not in the thing itself. Nothing is inscrutable. Everything can be understood..

But in speaking of inscrutability to his students or to the less sophisticated, he would not be helping people towards knowledge. And of course, his advice would have been pointing directly away from the knowledge that helped Kepler discover his laws, because Tycho Brahe's plethora of tables directly enabled Kepler.


So, I think talking about "giant inscrutable matrices" promotes unclear thought confusing map and territory.

The frequently accompanying, action-relevant claim -- that substantially easier-to-interpret alternatives exist -- is probably false and distracts people with fake options. That's my main thesis.

This could be very bad news. Particularly if you're pessimistic about interpretability and have short timelines. Not having easier alternatives to G.I.M. doesn't actually make matrices any easier to interpret. Enormous quantities of intellectual labor have been done, and remain yet to do.

Still. In some of my favorite research shared on LW, some shard researchers speculate about the future, based off their experience of how easy it was to wipe knowledge from a maze-solving algorithm:

The cheese vector was easy to find. We immediately tried the dumbest, easiest first approach. We didn't even train the network ourselves, we just used one of Langosco et al.'s nets (the first and only net we looked at). If this is the amount of work it took to (mostly) patch out cheese-seeking, then perhaps a simple approach can patch out e.g. deception in sophisticated models.

Really simple stuff turned out to work for capabilities. It could also work out for interpretability.

New to LessWrong?

New Comment
35 comments, sorted by Click to highlight new comments since: Today at 6:22 PM

To the connectivists such as myself, your point 0 has seemed obvious for a while, so the EY/MIRI/LW anti-neural net groupthink was/is a strong sign of faulty beliefs. And saying "oh but EY/etc didn't really think neural nets wouldn't work, they just thought other paradigms would be safer" doesn't really help much if no other paradigms ever had a chance. Underlying much of the rationalist groupthink on AI safety is a set of correlated incorrect anti-connectivist beliefs which undermines much of the standard conclusions.

(Eliezer did think neural nets wouldn't work; he explicitly said it on the Lex Fridman podcast.)

Edit @request from gwern: at 11:30 in the podcast, Eliezer says,

back in the day I went around saying like, I do not think that just stacking more layers of transformers is going to get you all the way to AGI, and I think that GPT-4 is past where I thought this paradigm is going to take us, and I, you know, you want to notice when that happens, you want to say like "oops, I guess I was incorrect about what happens if you keep on stacking more transformer layers"

and then Fridman asks him whether he'd say that his intuition was wrong, and Eliezer says yes.

I think you should quote the bit you think shows that. Which 'neural nets wouldn't work', exactly? I realize that everyone now thinks there's only one kind (the kind which works and which we have now), but there's not.

The Fridman transcript I skimmed was him being skeptical that deep learning, one of several different waves of connectionism, would go from early successes like AlphaGo all the way to AGI, and consistent with what I had always understood him to believe, which was that connectionism could work someday but that would be bad because it would be unsafe (which I agreed then and still do agree with now, and to the extent that Eliezer says I was right to pickup on 'holy shit guys this may be it, after three-quarters of a century of failure, this time really is different, it's Just Working™' and he was wrong, I don't think it's because I was specially immune to 'groupthink' or somehow escaped 'faulty beliefs', but it was because I was paying much closer attention to the DL literature and evaluating whether progress favored Cannell & Moravec or critics, and cataloguing examples of the blessings of scale & evidence for the scaling hypothesis).

Yeah - of course the brain was always an example of a big neural net that worked, the question was how accessible that design is/was. The core of the crucial update for me - which I can't pinpoint precisely but I'd guess was somewhere between 2010 to 2014 - was the realization that GD with a few simple tricks really is a reasonable general approximation of bayesian inference, and a perfectly capable global optimizer in the overcomplete regime (the latter seems obvious now in retrospect, but apparently wasn't so obvious when nets were small: it was just sort of known/assumed that local optima were a major issue). Much else just falls out from that. The 'groupthink' I was referring to is that some here are still deriving much of their core AI/ML beliefs from reading the old sequences/lore rather than the DL literature and derivations.

Fair. Ok, I edited the original post, see there for the quote.

One reason I felt comfortable just stating the point is that Eliezer himself framed it as a wrong prediction. (And he actually refers to you as having been more correct, though I don't have the timestamp.)

Mm, I fear this argument is self-contradictory to a significant extent.

Interpretability is premised on the idea that it's possible to reduce a "connectionist" system to a more abstract, formalized representation.

  • Consider the successful interpretation of a curve detector. Once we know what function a bunch of neurons implements, we can tear out these neurons, implement that function in a high-level programming language, then splice that high-level implementation into the NN in place of the initial bunch-of-neurons. If the interpretation is correct, the NN's behavior won't change.
  • Scaling this trick up, the "full" interpretation of a NN should allow us to re-implement the entire network in high-level-programming manner; I think Neel Nanda even did something similar here.
  • Redwood Research's outline here agrees with this view. An "interpretation" of a NN is basically a transform of the initial weights-and-biases computational graph into a second, "simpler" higher-level computational graph.

So inasmuch as interpretability is possible, it implies the ability to transform a connectionist system into what's basically GOFAI. And if we grant that, it's entirely coherent to wish that we've followed the tech-development path where we're directly figuring out how to build advanced AI in a high-level manner. Instead, we're going about it in a round-about fashion: we're generating incomprehensible black-boxes whose internals we're then trying to translate into high-level representations.

The place where "connectionism" does outperform higher-level approaches is "blind development". If we don't know how the algorithm we want is supposed to work, only know what it should do, then such approaches may indeed be absolutely necessary. And in that view, it should be clear why evolution never stumbled on anything else: it has no idea what it's doing, so of course it'd favour algorithms that work even if you have no idea what you're doing.

(Though I don't think brains' uniformity is entirely downstream even of that. Computers can also be viewed as "a massive number of simple, uniform units connected to each other", the units being transistors. Which suggests that it's more of a requirement for general-purpose computational systems, imposed by the constraints of our reductionist universe. Not a constraint on the architecture of specifically intelligent systems.

Perhaps all good computational substrates have this architecture; but the software that's implemented on these substrates doesn't have to. And in the case of AI, the need for connectionism is already fulfilled by transistors, so AI should in principle be implementable via high-level programming that's understandable by humans. There's no absolute need for a second layer of connectionism in the form of NNs.)

I agree with your second point though, that complexity is largely the feature of problem domains, not agents navigating them. Agents' policy functions are likely very simple, compared to agents' world-models.

Generally speaking, that some already-learned ML algorithm can be transformed from from some form into alternate forms does not imply that it could easily -- or ever at all, in the lifetime of the universe -- be discovered in that second form.

For instance, it looks like any RELU-based neural network can be transformed into a decision tree, albeit potentially an extremely large one, while preserving exact functional equivalence. Nevertheless, for a many (though not all) substantial learning tasks, it seems likely you will wait until the continents collide and the sun cools before you are able to find that algorithm with decision-tree specific algorithms.

So, let's for now assume that a completed interpretability research program allows you to transform a NN into a much more interpretable system with few removing parts, basically GOFAI -- as you say. This in fact implies basically nothing about whether you could find such them without neural networks (let alone whether the idea that "This just-find-GOFAI-research-program is doomed" would be self-contradictory). You say it's coherent to wish for this tech tree, and it is coherent to so wish -- but do you think this is a promising research program?

The place where "connectionism" does outperform higher-level approaches is "blind development".

But -- isn't intelligence basically the ability to enter a domain blind and adapt to it? For men to be on the moon, octopuses to prank aquarium staff, ravens to to solve weird-puzzles. Like I'm not sure what you're getting at here, you seem to be saying that connectionism is inferior to other solutions except where intelligence -- i.e., quickly adapting to a domain that we might not know much about, or what kind of algorithms internally will work, and where we don't have adequate algos handed down from prior research or in instinct -- is required. Which is.... what I'm saying?

I'm not sure what predictions you're making that are different than mine, other than maybe "a research program that skips NN's and just try to build the representations that they build up directly without looking at NNs has reasonable chances of success." Which doesn't seem like one you'd actually want to make.

I'm not sure what predictions you're making that are different than mine, other than maybe "a research program that skips NN's and just try to build the representations that they build up directly without looking at NNs has reasonable chances of success." Which doesn't seem like one you'd actually want to make.

I think I would, actually, want to make this prediction. The problem is that I'd want to make it primarily in the counterfactual world where the NN approach had been abandoned and/or declared off-limits, since in any world where both approaches exist, I would also expect the connectionist approach to reach dividends faster (as has occurred in e.g. our own world). This doesn't make my position inconsistent with the notion that a GOFAI-style approach is workable; it merely requires that I think such an approach requires more mastery and is therefore slower (which, for what it's worth, seems true almost by definition)!

I do, however, think that "building the high-level representations", despite being slower, would not be astronomically slower than using SGD on connectionist models (which is what you seem to be gesturing at, with claims like "for a many (though not all) substantial learning tasks, it seems likely you will wait until the continents collide and the sun cools before you are able to find that algorithm"). To be fair, you did specify that you were talking about "decision-tree specific algorithms" there, which I agree are probably too crude to learn anything complex in a reasonable amount of time; but I don't think the sentiment you express there carries over to all manner of GOFAI-style approaches (which is the strength of claim you would actually need for [what looks to me like] your overall argument to carry through).

(A decision-tree based approach would likely also take "until the continents collide and the sun cools" to build a working chess evaluation function from scratch, for example, but humans coded by hand what were, essentially, decision trees for evaluating positions, and achieved reasonable success until that approach was obsoleted by neural network-based evaluation functions. This seems like it reasonably strongly suggests that whatever the humans were doing before they started using NNs was not a completely terrible way to code high-level feature-based descriptions of chess positions, and that—with further work—those representations would have continued to be refined. But of course, that didn't happen, because neural networks came along and replaced the old evaluation functions; hence, again, why I'd want primarily to predict GOFAI-style success in the counterfactual world where the connectionists had for some reason stopped doing that.)

Mm, I think there's some disconnect in what we mean by an "interpretation" of a ML model. The "interpretation" of a neural network is not just some computational graph that's behaviorally equivalent to the neural network. It's the actual algorithm found by the SGD and implemented on the weights-and-biases of the neural network. Again, see Neel Nanda's work here. The "interpretation" recovers the actual computations the neural network's forward pass is doing.

You seem to say that there's some special class of "connectionist" algorithms that are qualitatively and mechanically different from higher-level algorithms. Interpretability is more or less premised on the idea that it is not so; that artificial neurons are just the computational substrate on which the SGD is invited to write programs. And interpretability is hard because we, essentially, have to recover the high-level structure of SGD-written programs given just (the equivalent of) their machine code. Not because we're trying to find a merely-equivalent algorithm.

I think this also addresses your concern that higher-level design is not possible to find in a timely manner. SGD manages it, so the amount of computation needed is upper-bounded by whatever goes into a given training run. And the SGD is blind, so yes, I imagine deliberative design — given theoretical understanding of the domain — would be much faster than whatever the SGD is doing. (Well, maybe not faster in real-time, given that human brains work slower than modern processors. But in a shorter number of computation-steps.)

You say it's coherent to wish for this tech tree, and it is coherent to so wish -- but do you think this is a promising research program?

Basically, yes.

I feel like this is missing the point. To me, when I heard the description of neural nets as large hard-to-interpret matrices, this clicked with me. Why? Because I'd spent years studying the brain. It's not brain weights that make them more interpretable as systems. Heck no. All the points made about how biological networks are bad for interpretability are quite on point. The reason I think brains are more interpretable at a system level is for two reasons:

  1. Modularity Look at the substantia nigra in the brain. That's a module. It performs a critical, specific, hardwired task. The cortex is somewhat rewirable, but not very. And these modules are mostly located in the same places and performing mostly the same operations between people. Even between different species!
  2. Limited rewirability The patterns of function in a brain get much more locked in than in a neural network. If you were trying to hide your fear from a neuro scientist who had the ability to monitor your brain activity with implanted electrodes, could you do it? I think you couldn't, even with practice. You could train yourself to feel less fear, but you couldn't experience the same amount of fear but hide the activity in a different part of your brain. Even if you also got to monitor your own brain activity and practice. I just don't think that the brain is rewirable enough to manage that. Some stuff you could move or obfuscate, a bit, but it would be really hard and only partially successful.

So when people like Conjecture talk about building Cognitive Emulations that have highly verifiable functionality... I think that's an awesome idea! I think that will work! (If given enough time and resources, which I fear it won't be) And i think they should use big matrices of numbers to build them, but those matrices should be connected up to each other in specific hardcoded ways with deliberate information bottlenecks. I do not endorse spiking neural nets as inherently more interpretable. I endorse highly modular systems with deliberate restrictions on their ability to change their wiring.

I agree that modularity / information bottlenecks are desirable and compatible with connectionism, and a promising way of making a more interpretable AI. Separating agents out into world models and value functions, for instance, which don't share gradients, likely will result in more easily interpretable systems than systems share gradients through both them. Totally support such efforts!

I do think that there are likely limits in how modular you can make them, though.

There was an evolutionary point I thought bringing up above, which seems further evidence in favor of the limits of modularity / something I'll fuzzily point to as "strong connectionism":

It seems like the smarter animals generally have more of the undifferentiated parts of the brain, which can swap out in arbitrary ways. Substantia nigra, right, has been preserved for over a half billion years, and most of the animals which had it were pretty dumb - seems unlikely to be key to intelligence, at least higher intelligence. But the animals we tend to think of as smarter have a bunch of things like the cerebral cortex, which is super undifferentiated and very rewireable. My vague recollection is that smarter birds have similarly undifferentiated piece of their brain, which is not cortex? And I predict, without having checked, that things like the smarter octopuses will also have bigger non-differentiated zones (even if implemented differently) than stupider animals near them on the evolutionary tree.

I'd be curious if you agree with this? But this seems like further evidence that connectionism is just how intelligence gets implemented -- if you want something flexible, which can respond to a whole lot of situation (i.e., intelligence), you're going to need something super undifferentiated with some kinds of limit on modularity.

I quite disagree. I think that people imagine the cortex to be far more undifferentiated and reconfigurable than it actually is. What humans did to get extra intelligent and able to accumulate cultural knowledge was not just expand their cortex generally. In particular, we expanded our prefrontal cortex, and the lateral areas closely connected to it that are involved in language and abstract reasoning / math. And we developed larger neurons in our prefrontal cortex with more synapses, for better pooling of activity.

Impossible thought experiment: If you took a human infant and removed half their prefrontal cortex, and gave them the same amount new cortical in their visual cortex... then you don't get a human who rewires their cortex such that the same proportion of cortex is devoted to prefrontal cortex (executive function, planning) and the same proportion devoted to vision. What you get is a subtle shift. The remaining prefrontal cortex will expand its influence to nearby regions, and maybe get about 2% more area than it otherwise would have had, but still be woefully inadequate. The visual cortex will not shrink much. You'll have a human that's terrible at planning and great at visual perception of complex scenes.

When neuroscientists talk about how impressively reconfigurable the cortex is, you have to take into consideration that the cortex is impressively reconfigurable given that it's made up of neurons who are mostly locked into place before birth with very limited ability to change the location of their input zones or output zones.

For example: imagine a neuron is the size of a house. This particular neuron is located in San Diego. It's dendrites stretch throughout the neighborhoods of San Diego. The dendrites have flexibility, in that they can choose which houses in the neighborhood to connect to, but can't move more than a few houses in any direction.

Meanwhile, this neuron's axon travels across the entire United States to end up in New York city. It ends in a particular neighborhood in Manhattan. Again, the axon can move a few buildings in either direction, but it can't switch all the way from the northern end of Manhattan to the southern end. Much less choose to go to Washington DC instead. There is just no way that a neuron with its dendrites in Washington DC is going to end up directly connected to the San Diego neuron.

When someone looses a chunk of their cortex due to a stroke, they are often able to partially recover. The recovery is due largely to very local rewiring of the surviving mostly-closely-functionally-related neurons on the very edge of the damaged area. Like, if you place a dime on top of a penny. Imagine the dime is the area lost. The regain of function is mostly going to be dependent on the area represented by the amount of penny sticking out around the dime. Each neuron in that penny border region will shift its inputs and outputs more than neurons normally would, traveling maybe ten houses over instead of two. Still not a big change! 

And no new neurons are added to most of the brain throughout life. Memory and olfactory systems do a bit of neurogenesis (but not making long range changes, just short range changes). For the rest of the brain, it's just these very subtle rewiring changes, or changes to the strengths of the connections that do all of the learning. Neurons can be deleted, but not added. That's a huge restriction that substantially reduces the ability of the network to change the way it works. Especially given that the original long-range wiring laid down in fetal development was almost entirely controlled by a small subset of our genetic code. So you don't get to start with a random configuration, you start with a locked-in hardwired configuration. This is why we all have a visual cortex in the back of our heads. If the long-range wiring wasn't hardcoded, some people would end up with it someplace else just by chance.

I don't think the fact that neuronal wiring is mostly fixed provides much evidence that the cortex is not reconfigurable in the relevant sense. After all, neural networks have completely fixed wiring and can only change the strength of their connections, but this is enough to support great variability in function.

I have a response, but my response involves diagrams of the "downstream influence" of neurons in various brain circuits vs parameters in transformers. So, I'm working on a post about it. Sorry for the delay.

That's a very interesting observation. As far as I understand as well, deep neural networks have completely unlimited rewirability - a particular "function" can exist anywhere in the network, in multiple places, or spread out between and within layers. It can be duplicated in multiple places. And if you retrain that same network, it will then be found in another place in another form. It makes it seem like you need something like a CNN to be able to successfully identify functional groups within another model, if it's even possible.

The frequently accompanying, action-relevant claim -- that substantially easier-to-interpret alternatives exist -- is probably false and distracts people with fake options. That's my main thesis.

I agree with this claim (anything inherently interpretable in the conventional seems totally doomed). I do want to push back on an implicit vibe of "these models are hard to interpret because of the domain, not because of the structure" though - interpretability is really fucking hard! It's possible, but these models are weird and cursed and rife with bullshit like superposition, it's hard to figure out how to break them down into interpretable units, they're full of illusions, and confused and janky representations, etc.

I don't really have a coherent and rigorous argument here, I just want to push back on the vibe I got from your "interpretability has many wins" section - it's really hard!

Yeah, you're just right about vibes.

I was trying to give "possible but hard" vibes, and the end result just tilts too far one way and doesn't speak enough about concrete difficulties.

First of all, I strongly agree that intelligence requires (or is exponentially easier to develop as) connectionist systems. However, I think that while big, inscrutable matrices may be unavoidable, there is plenty of room to make models more interpretable at an architectural level.

Well, I ask you -- do you think any other ML model, trained over the domain of all human text, with sufficient success to reach GPT-4 level perplexity, would turn out to be simpler?

I have long thought that Transformer models are actually too general purpose for their own good. By that I mean that the  operations they do, using all-to-all token comparisons for self-attention, is actually extreme overkill for what an LLM needs to do.

Sure, you can use this architecture for moving tokens around and building implicit parse trees and semantic maps and a bunch of other things, but all these functions are jumbled together in the same operations and are really hard to tease out. Recurrent models with well-partitioned internal states and disentangled token operations could probably do more with less. Sure, you can build a computer in Conway's Game of Life (which is Turing-complete), but using a von Neumann architecture would be much easier to work with.

Embedded within Transformer circuits, you can find implicit representations of world models, but you could do even better from an interpretability standpoint by making such maps explicit. Give an AI a mental scratchpad that it depends on for reasoning (DALL-E, Stable Diffusion, etc. sort of do this already, except that the mental scratchpad is the output of the model [an image] rather than an internal map of conceptual/planning space), and you can probe that directly to see what the AI is thinking about.

Real brains tend to be highly modular, as Nathan Helm-Burger pointed out. The cortex maps out different spaces (visual, somatosensory, conceptual, etc.). The basal ganglia perform action selection and general information routing. The cerebellum fine-tunes top-down control signals. Various nuclei control global and local neuromodulation. And so on. I would argue that such modular constraints actually made it easier for evolution to explore the space of possible cognitive architectures.

I think the problem is that many things along the rough lines of what you're describing have been attempted in the past, and have turned out to work not-so-well (TBF, they also were attempted with older systems: not even sure if anyone's tried to make something like an expert system with a full-fledged transformer). The common wisdom the field has derived from those experiences seems to have been "stochastic gradient descent knows best, just throw your data into a function and let RNJesus sort it out". Which is... not the best lesson IMO. I think there might be worth in the things you suggest and they are intuitively appealing to me too. But as it turns out when an industry is driven by racing to the goal rather than a genuine commitment to proper scientific understanding they end up taking all sorts of shortcuts. Who could have guessed.

  • Model-based RL is more naturally interpretable than end-to-end-trained systems because there’s a data structure called “world-model” and a data structure called “value function”, and maybe each of those data structures is individually inscrutable, but that’s still a step in the right direction compared to having them mixed together into just one data structure. For example, it’s central to this proposal.
  • More generally, we don’t really know much about what other kinds of hooks and modularity can be put into a realistic AI. You say “probably not possible” but I don’t think the “probably” is warranted. Evolution wasn’t going for human-interpretability, so we just don’t know either way. I would have said “We should be open to the possibility that giant inscrutable matrices are the least-bad of all possible worlds” or something.
  • If Architecture A can represent the same capabilities as Architecture B with fewer unlabeled nodes (and maybe a richer space of relationships between nodes), then that’s a step in the right direction.
  • I think you’re saying that “asynchronous” neural networks (like in biology) are more inscrutable than “synchronous” matrix multiplication, but I don’t think that claim is really based on anything, except for your intuitions, but your intuitions are biased by the fact that you’re never tried to interpret an “asynchronous” neural network, which in turn is closely tied to the fact that nobody knows how to program one that works. Actually, my hunch is that the asynchronicity is an implementation detail that could be easily abstracted away.
  • If we think of interpretability as a UI into the trained model, then the problem is really to simultaneously co-design a learning algorithm & interpretability approach that work together and dynamically scale up to sufficient AI intelligence. I think you would describe success at that design problem as “Ha! The inscrutable matrix approach worked after all!”, and that Eliezer would describe success at that same design problem as “Ha! We figured out a way to build AI without giant inscrutable matrices!” (The matrices are giant but not inscrutable.)

On the downside, if matrix multiplication is the absolute simplest, most interpretable AI blueprint there is -- if it's surrounded on all sides by asynchronous non-digital biological-style designs, or other equally weird alien architectures -- that sounds like pretty bad news. Instead of hoping that we luck out and get a simpler, more interpretable architecture in future generations of more-powerful AI, we will be more likely to switch to something much more inscrutable, perhaps at just the wrong time. (ie, maybe matrix-based AIs help design more-powerful successor systems that are less interpretable and more alien to us.)

I would genuinely be surprised if there weren't ways to at least simplify parts of it.

Like, a mind is able to do multiplication or addition. But any and all algorithms for those kind of operations that we regularly use are much simpler and more reliable than whatever a GPT does for its own primitive math capabilities (which is probably a lot more like a lookup table of known operations). You'd think you could disentangle some functionality into its own modules and make it more reliable that way.

I like this post overall. I'd guess that connectionism is the real-world way to do things, but I don't think I'm quite as confident that there aren't alternatives. I basically agree with the main points, though.

to mind control a [maze-solving agent] to pursue whatever you want with just a single activation

I want to flag that this isn't quite what we demonstrated. It's overselling our results a bit. Specifically, we can't get the maze-solving agent to go "wherever we want" via the demonstrated method. Rather, in many mazes, I'd estimate we can retarget the agent to end up in about half of the maze locations. Probably a slight edit would fix this, like changing "whatever you want" to "a huge range of goals."

Yup agreed, all intelligence must be connectionist; but I think you can go further than empirical and claim that getting ahead of "vanilla" connectionism is possible only by other things that are, themselves, also connectionism. In other words, the interesting paradigms are the subsets of connectionism. After all, category theory is connectionist too.

  1. Please enable this for promotion to the front page
  2. Is this proof that intelligence must be connectionist? Of course not.

Well it seems to be strong evidence that intelligence is connectionist; "absence of evidence is evidence of absence".

To be strong evidence for intelligence being connectionist, it would also have to be the case that the non-connectionist model of intelligence strongly expected evolved intelligence to look different.

Current NN matrices are dense and continuous weighted. A significant part of the difficulty of interpretability is that they have all to all connections; it is difficult to verify that one activation does or does not affect another activation.

However we can quantize the weights to 3 bit and then we can probably melt the whole thing into pure combinational logic. While I am not entirely confident that this form is strictly better from an interpretability perspective, it is differently difficult.

"Giant inscrutable matrices" are probably not the final form of current NNs, we can potentially turn them into different and nicer form.

While I am optimistic about simple algorithmic changes improving the interpretability situation (the difference between L1 and L2 regularization seems like a big example of hope here, for example), I think the difficulty floor is determined by the complexity of the underlying subject matter that needs to be encoded, and for LLMs / natural language that's going to be very complex. (And if you use an architecture that can't support things that are as complex as the underlying subject matter, the optimal model for that architecture will correspondingly have high loss.)

Well, I ask you -- do you think any other ML model, trained over the domain of all human text, with sufficient success to reach GPT-4 level perplexity, would turn out to be simpler?

If we're literally considering a universal quantifier over the set of all possible ML models, then I'd think it extremely likely that there does exist a simpler model with perplexity no worse than GPT-4. I'm confused as to how you (seem to have) arrived at the opposite conclusion.

Imagine an astronomer in the year 1600, who frequently refers to the "giant inscrutable movements" of the stars. [...]

I think the analogy to {building intelligent systems} is unclear/weak. There seem to be many disanalogies:

In the astronomy case, we have

  • A phenomenon (the stars and their motions) that we cannot affect

  • That phenomenon is describable with simple laws.

  • The "piles of numbers" are detailed measurements of that phenomenon.

  • It is useful to take more measurements, doing so is helpful for finding the simple laws.

In the {building AIs} case, we have

  • A phenomenon (intelligence) which we can choose to implement in different ways, and which we want to harness for some purpose.

  • That phenomenon is probably not entirely implementable with simple programs. (As you point out yourself.)

  • The "piles of numbers" are the means by which some people are implementing the thing; as opposed to measurements of the one and only way the thing is implemented.

  • And so: implementing AIs as piles-of-numbers is not clearly helpful (and might be harmful) to finding better/simpler alternatives.

So, I predict with high confidence that any ML model that can reach the perplexity levels of Transformers will also present great initial interpretive difficulty.

I do agree that any realistic ML model that achieves GPT-4-level perplexity would probably have to have at least some parts that are hard to interpret. However, I believe it should (in principle) be possible to build ML systems that have highly interpretable policies (or analogues thereof), despite having hard-to-interpret models.

I think if our goal was to build understandable/controllable/safe AI, then it would make sense to factor the AI's mind into various "parts", such as e.g. a policy, a set of models, and a (set of sub-)goals.

In contrast, implementing AIs as giant Transformers precludes making architectural distinctions between any such "parts"; the whole AI is in a(n architectural) sense one big uniform soup. Giant Transformers don't even have the level of modularity of biological brains designed by evolution.

Consequently, I still think the "giant inscrutable tensors"-approach to building AIs is terrible from a safety perspective, not only in an absolute sense, but also in a relative sense (relative to saner approaches that I can see).

Tentative GPT4's summary. This is part of an experiment. 
Up/Downvote "Overall" if the summary is useful/harmful.
Up/Downvote "Agreement" if the summary is correct/wrong.
If so, please let me know why you think this is harmful. 
(OpenAI doesn't use customers' data anymore for training, and this API account previously opted out of data retention)

The article argues that deep learning models based on giant stochastic gradient descent (SGD)-trained matrices might be the most interpretable approach to general intelligence, given what we currently know. The author claims that seeking more easily interpretable alternatives could be misguided and distract us from practical efforts towards AI safety.

1. Generally intelligent systems might inherently require a connectionist approach.
2. Among known connectionist systems, synchronous matrix operations are the most interpretable.
3. The hard-to-interpret part of matrices comes from the domain they train on and not their structure.
4. Inscrutability is a feature of our minds and not the world, so talking about "giant inscrutable matrices" promotes unclear thought.

1. Deep learning models' inscrutability may stem from their complex training domain, rather than their structure.
2. Synchronous matrix operations appear to be the easiest-to-understand, known approach for building generally intelligent systems.
3. We should not be seeking alternative, easier-to-interpret paradigms that might distract us from practical AI safety efforts.

1. The author provides convincing examples from the real world, such as the evolution of brain sizes in various species, to argue that connectionism is a plausible route to general intelligence.
2. The argument that synchronous matrix operations are more interpretable than their alternatives, such as biologically inspired approaches, is well-supported.
3. The discussion on inscrutability emphasizes that our understanding of a phenomenon should focus on its underlying mechanisms, rather than being misled by language and intuition.

1. Some arguments, such as the claim that ML models' inscrutability is due to their training domain and not their structure, are less certain and based on the assumption that the phenomenon will extend to other models.
2. The arguments presented are ultimately speculative and not based on proven theories.

1. The content of this article may interact with concepts in AI interpretability, such as feature importance and attribution, which mehtods aim to improve our understanding of AI models.

Factual mistakes:
I am not aware of factual mistakes in my summary.

Missing arguments:
1. The article does not address how any improvements in interpretability would affect AI alignment efforts or the risks associated with AGI.
2. The article does not explore other potential interpretability approaches that could complement or augment the synchronous matrix operations paradigm.

A couple points. Your bit about connectivism IMO ignores the fact that biological systems have their own constraints. If you looked at animals and guessed that the most energy efficient, simple to maintain way to move on most terrains is probably legs, you'd be wrong, it's wheels or threads. But legs are what organic life that needs to keep all its moving parts connected could come up with.

A similar problem might be at work with connectivism. We designed MLPs in the first place by drawing inspiration from neurons, so we were kinda biased. As a general rule, we know that, if made arbitrarily large, they're essentially universal functions. Any universal function ought to be potentially equivalent; you could make it a crazily high dimensional spline or what have you, the point is you have a domain and a codomain and a bunch of parameters you can tune to make that function fit your data as closely as possible. What determines the choice is then practicality of both computation and parameter fitting. Connected systems we use in ML happen to be very convenient for this, but I'd be fairly surprised if that was the same reason why biology uses them. It would mean our brain does gradient descent and backpropagation too. And for ML too there might be better choices that we just haven't discovered yet.

That said, I don't think connected systems are necessarily a bad paradigm. I agree with you that the complexity is likely due to the fact that we're fitting a really really complex function in a space of dimensionality so high the human mind can't even begin to visualize it, so no wonder the parameters are hard to figure out. What might be possible though is to design a connected system to have dedicated, distinct functional areas that make it more "modular" in its overall structure (either by prearranging those, or by rearranging blocks between stages of training). That could make the process more complex or the final result less efficient, the same way if you compile a program for debugging it isn't as performant as one optimized for speed. But the program compiled for debugging you can look inside, and that's kind of important and useful. The essence of the complaint here is IMO that as far as these trade-offs go, researchers seem all-in on the performance aspect and not spending nearly enough effort debugging their "code". It may be that the alternative to giant inscrutable matrices might just be a bunch of smaller, more scrutable ones, but hey. Would be progress!

we can make matrices more intelligible if we make their dimensions meaning not random, but corresponding to some known in advance dimensions, like [levels of dog, level of cat, level of human, level of ape]