Cross-posted from New Savanna.

I don’t mean “understand” in any deep philosophical sense. I mean only a rough and ready sense of the word. We understand how toasters work, automobiles, moon rockets, digital computers, and so forth. We know how to design and construct these things, how to diagnose problems, how to maintain and repair them. Not perfectly to be sure, but well enough to use these devices to get things done.

LLMs, however, are said to be opaque. We don’t know how they work. We feed them prompts, they produce output, but how the model works from the prompt to produce the output, that’s mysterious. There are people working on mechanical interpretability, trying to understand the LLM as though it were a machine, or at least, a computer program of the ordinary kind, where we know, more or less, how it works on data – if it is the kind of program that works from data – to produce output. But what would it mean to understand the operational characteristics of 175 billion parameters, as in the case of GPT-3.5?

It means, I suppose, how those parameters mediate between the input, a prompt, and the output, whatever “follows from” a given prompt. At the lowest level we are told that LLMs are prediction machines. So, the output string is simply a continuation of the input string. And I suppose that, technically, that’s true. But it’s not very helpful, as I’ve argued at some length.

Let’s set that aside.

What could we possibly want by way of understanding?

We’ve got three things: There is the underlying engine, let’s call it, which is a computer program like any other. It’s created by programmers working with some language or languages and is designed to achieve a certain purpose. In this case, it’s designed to create a language model over a corpus of texts and then to use that model in generating new chunks of language given an input prompt.

It's that model that’s problematic, that’s said to be opaque. We, us humans, didn’t create that model. The engine did. And, in the case of GPT-3, that model’s got 175 billion parameters. More recent models have even more. And there are also models with only millions of parameters. But even those smaller models are huge.

But, here’s the thing, how can we understand how that opaque model operates unless we understanding what it’s trying to do? Sure, we can pop the hood and take a look. We see a bunch of gizmos, widgets, framblasts, and other things, but so what? They’re just whirling around, engaging with one another, in intricate patterns? But what are they trying to do? We know what car engines are supposed to do; they supply power to the wheels (and the wheels move the car).

Well, LLMs are supposed to produce language – and computer code and math as well, but let’s stick with ordinary language for the purposes of these notes. But, alas, the mechanisms of language are themselves opaque. The relationship between cars wheels and car motion is transparent. The relationship between nouns and verbs and adjectives and prepositions and sentences and, you know, knowledge, understanding, entertainment, the things language is for, those relationships are not so obvious.

Of course, linguists have been working on language mechanisms for years. But it’s not at all clear what the field has come up with. There are major disagreements on how one is to understand syntax. And when we move beyond sentences to discourse of various kinds, we know even less about mechanisms.

I figure that there’s almost zero chance that we’re going to find those mechanisms by mucking around in LLMs. Yes, I know that LLMs are quite different from the human brain and mind. But, the fact is, LLMs do a very convincing imitation of human language. Given the complexity of language, they wouldn’t be able to do that if they hadn’t absorbed some (perhaps) useful approximation to human mechanisms. I’m willing to proceed on the default understanding that, whatever the model is doing, it has some resemblance to what humans do. If I make that assumption, that gives me some tools to think with. Without it, I got nothing.

Still, a grammar is a large and complex thing. The Cambridge Grammar of the English Language is 1860 pages long, and it is merely a descriptive grammar and not meant to account for the underlying mechanisms, however they might best be characterized. Is that what we want from a mechanistic understanding of an LLM? And that only gets us sentences. What about paragraphs, stories, histories, repair manuals, accounts of exotic astronomical objects, and who knows what else? Do we expect students of mechanistic interpretability to eventually give us detailed accounts of such wonders?

Understanding stories

What would it mean to understand how ChatGPT tells stories?

This morning I logged onto ChatGPT, not GPT Plus, just plain old ChatGPT, and prompted it with one word: “Story.” What do you think it did? Right, it told me a story. The story began with this sentence: “Once upon a time, in a quaint little village nestled at the foot of a towering mountain range, there lived a young girl named Lily.” I don’t think it’s very useful to think of that sentence as the natural continuation of a string beginning with the word, “story.” Yes, I know, I’m not prompting the “naked” underling LLM. ChatGPT has been prompt-engineered and RLHFed (RLHF: reinforcement learning with human feedback) to death to be a congenial conversational partner. But that doesn’t change the basic situation.

In this case, the situation is that, in some sense, ChatGPT “knows” what a story is and knows how to tell one. By this time I’ve prompted it to produce 100s, though probably not yet 1000s of stories. In a few cases the prompt was just that one word. More often it was something like one of these:

Tell me a story. 
Tell me a story about a hero. 
Tell me a realistic story. 
Tell me a true story about a hero.

ChatGPT also told me a well-formed story. The stories were relatively short and simple, and the first two prompts produced stories with a fairytale feel, supernatural creatures and events were typical. Those were absent in realistic stories. As for true stories, sometimes they read more like short newspaper articles than like stories.

But where did ChatGPT learn to tell stories? Well, it consumed I don’t know how many stories during training. Whatever it knows about story-telling was distilled from those stories. I note that, to a first approximation, that’s how humans learn to tell stories as well. We are told stories as toddlers and children and, in time, begin telling our own stories, based on the models we’ve been exposed to. New stories are based on old stories, on remembered and half-remembered stories.

Now, as you may know, at some point I began to have ChatGPT tell stories based on rather elaborate prompts of a simple form consisting of 1) a request to tell a new story based on an existing one, but with one change (which I specified) and 2) the existing story. For example:

I am going to tell you a story about a princess named Aurora. I want you to retell the same story, but replace her with prince Henry the Eloquent. While keeping to the basic story, make other changes if you think they are necessary. Here’s the story:

Once upon a time, in a land far, far away, there was a young princess named Aurora. Aurora was a kind and gentle soul, loved by all who knew her. She had long, golden hair and sparkling blue eyes, and was known for her beautiful singing voice.

One day, a terrible dragon came to the kingdom and began to terrorize the people. The dragon was fierce and powerful, and none of the knights or soldiers were able to defeat it. The people lived in fear, not knowing what to do.

But Aurora was not afraid. She knew that she had to do something to save her people, and so she set out to find the dragon. She traveled for days, braving the cold and the snow, until she finally reached the dragon's lair.

With her heart pounding in her chest, Aurora approached the dragon and began to sing. The dragon was enchanted by her beautiful voice, and it slowly began to calm down. Aurora continued to sing, soothing the beast until it was completely calm.

With the dragon no longer a threat, Aurora returned to the kingdom as a hero. The people cheered and celebrated her bravery, and Aurora was hailed as a champion. She lived happily ever after, knowing that she had saved her kingdom and its people.

The first paragraph is an instruction to ChatGPT and the next five paragraphs are the story it is to work from. Here’s the new story it produced:

Once upon a time, in a land far, far away, there was a young prince named Henry the Eloquent. Henry was a kind and gentle soul, loved by all who knew him. He had short, dark hair and piercing green eyes, and was known for his eloquence and way with words.

One day, a terrible dragon came to the kingdom and began to terrorize the people. The dragon was fierce and powerful, and none of the knights or soldiers were able to defeat it. The people lived in fear, not knowing what to do.

But Henry was not afraid. He knew that he had to do something to save his people, and so he set out to find the dragon. He traveled for days, braving the cold and the snow, until he finally reached the dragon's lair.

With his heart pounding in his chest, Henry approached the dragon and began to speak. He used his eloquence and way with words to calm the beast, soothing it with his voice until it was completely calm.

With the dragon no longer a threat, Henry returned to the kingdom as a hero. The people cheered and celebrated his bravery, and Henry was hailed as a champion. He lived happily ever after, knowing that he had saved his kingdom and its people.

You’ll notice that the new story follows the original quite closely; in fact, the second paragraphs in each are identical. There is nothing terribly surprising about this.

Well, as you may know, I have played this game many times, and got some interesting and surprising stories out of it. I wrote up some results in a working paper, ChatGPT tells stories, and a note about reverse engineering. Here’s the abstract:

I examine a set of stories that are organized on three levels: 1) the entire story trajectory, 2) segments within the trajectory, and 3) sentences within individual segments. I conjecture that the probability distribution from which ChatGPT draws next tokens seems to follow a hierarchy nested according to those three levels and that is encoded in the weights of ChatGPT’s parameters. I arrived at this conjecture to account for the results of experiments in which I give ChatGPT a prompt with two components: 1) a story and, 2) instructions to create a new story based on that story but changing a key character: the protagonist or the antagonist. That one change ripples through the rest of the story. The pattern of differences between the old and the new story indicates how ChatGPT maintains story coherence. The nature and extent of the differences between the original story and the new one depends roughly on the degree of difference between the original key character and the one substituted for it. I end with a methodological coda: ChatGPT’s behavior must be described and analyzed on three strata: 1) The experiments exhibit behavior at the phenomenal level. 2) The conjecture is about a middle stratum, the matrix, that generates the nested hierarchy of probability distributions. 3) The transformer virtual machine is the bottom, the engine stratum.

That kind of work gives us some clues about what the underlying engine is doing. For example, I rather expect that the induction heads identified by researchers at Anthropic are involved. But this work gives us some other things to look for when we pop the hood. There’s more work to be done along those lines.

More recently I’ve been exploring ChatGPT’s ability to identify well-known speeches given prompts from those speeches. I was not at all surprised that it identified Hamlet’s famous soliloquy given “To be or not to be” as a prompt, or that it associated “Four score and seven years ago” with Lincoln’s Gettysburg address. But I also prompted it from strings from within those speeches and got various results depending on whether or not the strings were syntactically coherent or not. In the case of the Gettysburg Address, when I prompted it with “long endure. We are” and “in vain—that this,” ChatGPT was able to link them to the speech, but when it quoted passages giving the contexts, the quoted passages didn’t contain those phrases. That suggests that, however it associated those phrases with those speeches, it wasn’t using a mechanism that searched through those speeches in the way the search function works in a word processing program.

What kind of mechanism can make a link between a short string and a longer text containing that string, but not know just where the short text is located in the longer one? That suggested some kind of associative memory to me, perhaps holographic (there are references to the literature on this point). This is not the place to explicitly argue the matter. That certainly does need to be done. And the argument will take more examples as well.

But, for the moment, I’m entertaining the idea that holographic principles are involved in ChatGPT’s underlying language model. That certainly has implications for mechanistic interpretability.

So, what can we expect from mechanistic understanding of LLMs?

That’s hard to say. But I’m not looking to see a detailed grammar of English or any other language any time soon. Nor, for that matter, do I expect to find a complete story grammar – heck, for that matter, I don’t even think story grammars, as “traditionally” understood, are a reasonable way to think about stories.

On the whole, I would imagine that the pursuit of mechanistic understanding is a long-term and open-ended project. Still, I expect significant progress in less than five years, less than two perhaps. What we really need is a better handle on the general capacities of LLMs. For example, is symbolic computing of the kind advocated by Gary Marcus (and others) within the capabilities of LLMs? I suspect that it is not and I think that we should be able to offer explicit mechanistic arguments on the point rather than simply point to failure after failure. While Marcus has such arguments in his The Algebraic Mind (2001), they need to be linked specifically to the (as yet unknown) mechanisms of current LLMs.

More later.

New to LessWrong?

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 9:21 PM
[-]Portia7mo203

I attended an AI conference a while ago that was hosted in a historic railway museum. A practical railway museum, with lots of working machines, and you could see them at work, their plans, their components, how they were made, shipped, assembled.

It hit me really hard.

What humans had for those trains, that was understanding. They genuinely knew how they worked. Not just knew how to operate them. Not just had a vague understanding of the general principles behind it. They could build the whole thing, from scratch, by hand. They could explain what each tiny little part did. They could replace each part with something else serving the same function that they would themselves make. They could take the whole damn thing apart, and literally see each bit, move them by hand, watch them intersect. They could predict how the whole would behave, exactly. 

It really struck me how much is required already just to do this for an old train. Just to get a wheel to turn smoothly under so much pressure. Just to weld it together so securely it wouldn't blow up. Just to coordinate the different train times and movements. Just to calculate the needed fuel and material strength. It wasn't something you could half ass based on having read a Wiki article; it needed exact precision and perfect understanding. 

I got the impression that it must felt beautiful for the humans able to do that. They seemed so proud, so in control. They were close to these machines, knew them like their back hand. In a constant exchange with data, rationality, common sense. Back then, every time something wasn't quite perfect, there was an explanation to be found, and they could each find it. Everything around them was understood and made sense.

I was passing between the AI talks, and the trains... and really understood, for the first time, what is meant with sufficiently advanced technology being indistinguishable from magic.

We don't understand these AI systems.

We no longer can. Not as a collective, most certainly not as an individual.

We learn how to add bits that make them better. How to act when things go wrong. How to get them to reliably do the very things we like. That alone already takes a bunch of knowledge, and so we treat people who have achieved it like they have understood the thing in question, rather than just having learned how to handle this complex madness their predecessors handed them.

But this is akin to praying to a deity, hoping to get the right words, and then chucking a mix of sacred materials on the grounds in order to make plants grow, without having any understanding of the role of bacteria and fungi and fertiliser, and then being praised because you have memorised magic materials that so happen to contain more nitrogen - which in turn is something people don't understand, they just see that the plants grow, and are satisfied.

Akin to an alien handing you a magic device, and you learning how to operate the buttons to get it to give you candy, and maybe even realising how to push more power into the system to get more candy more quickly. And then people praise you, because you have optimised the candy maker you do not understand.

I went home, and walked through my house, through my things, and asked myself... if these devices broke, and I had to replace a component from scratch - not just identify a broken one, reorder it, and insert it, but know enough about how it works to be able to build a replacement component from scratch, without browsing the internet... how many of these devices could I actually fix?

If you try, by hand, to build something as simple as a stable shelter, or a smoothly turning wheel, even things you thought were utterly trivial turn out to be bloody hard, and to have a lot of detail you need to actually properly understand to make something good. The ugly plastic containers I keep my rice in, from Ikea - those have a designer name written on them. And first, I laughed, it is just a container, how much thought can go into that, who "designs" such a thing? And then I noticed the feet that allow it to stand stable, even if there are grains of rice on the floor of my cupboard. The grooves that allow you fingers to get under the lid. The rubber so it makes a proper seal against food moths. It is just a bloody container, and yet getting that right required proper thought. How much would go into an actually complicated device?

I was sorta confident I could fix my toaster, for scenarios like replacing wires with other high resistance metal components, and some simpler problems in my washing machine and dishwasher, the kind that require visible tubes, or maybe soldering a missing connection. But for things beyond that... the components have become too tiny for me to work with my hands, sometimes too tiny to see. They require ingredients that I have no way of acquiring short of buying more machines and disassembling them; I have zero idea how to mine them, or to extract the ore, and even if I did, I would not be able to make the components out of it anyhow, they have become so fine. And I do not understand how they work. When I select a dishwasher program, how does that control heat and duration and spin, exactly? I don't mean a vague answer based on programming a pentabug on the general principles of how this could be done. That horrid shell of understanding is what I got used to confusing with understanding - as though having a vague memory of a motor blueprint in school would allow you to actually make a motor on the first go. No, how does it actually work? What does that button do? What does opening the door do? If I got zero advice, could I reinvent it from scratch? How long would it take me until I had a safe, effective dishwasher? And then I look from there, to my bloody laptop, and it is a leap to the sky. I look from my laptop to the LLMs that I can run on it. Then to the LLMs that are being run on better devices... and the depth of my ignorance makes me choke. Understanding LLMs feels like primitive people trying to predict the weather and harvests through observing patterns, and then invoking an angry thunder God. Like cargo cult people building something that looks like a plane because they have observed a relation to travel. It is horrifying.

If right now, we had an apocalypse, where knowledge and manufacturing ability and human population was lost, and me and a few hundred survivors - let's say really smart, really well educated, really hard-working survivors - were left with a bunch of laptops and antibiotic pills and air planes... I highly, highly doubt we'd learn how to fix, let alone replace and reproduce, the things we had inherited in time before they broke down and/or we died. We would sit there with a bunch of magic tools, rapidly diminishing. Our kids would be handed things they could not understand, could not make, could not fix, that would fail them, one by one. If they had the insane luck to have access to a remaining LLM... they would be begging the LLM to tell them what to do to save things, and one day, when the LLM stopped working, they would despair, as one after another, their devices lost their glow. - These things I am using, they are not my things, the way things I understand are my things. They are things I inherited, and their continued working is fragile.

LOL! Yes, we are not in the world of mechanical or electro-mechanical devices anymore, are we?

And yet I don't think things are hopeless. Understanding LLMs is certainly no worse than understanding brains. After all, we can manipulate and inspect LLMs in a way we cannot manipulate and inspect brains. And I think we've made progress understanding brains. Back in the old days people used to believe in neurons that were facetiously called "grandmother cells." The idea is that the was one neuron that recognized your grandmother, another one that recognized your dog Spot, yet another one for your Barbie doll, and so forth. I think the field has pretty much gotten over that fantasy to the idea of collective representation. Each neuron participates in recognizing many different things and each thing is recognized by a collectivity of neurons. Just how that might work, well, we're working on it. I hear things might be like that inside LLMs as well.

 _____________ 

PS. I just looked at your profile and noticed that you have ADHD. Some years ago I took a look at the technical (and not so technical) literature on the subject and wrote up some notes: Music and the Prevention and Amelioration of ADHD: A Theoretical Perspective.

How to act when things go wrong. (...) That alone already takes a bunch of knowledge, and so we treat people who have achieved it like they have understood the thing in question, rather than just having learned how to handle it.

I have felt the difference recently, twice. First, my old computer broke. Something with hard disk, not sure what exactly, but it wouldn't boot up anymore. So I took the hard disk to a repair shop, and asked them to try salvage whatever is possible and copy it to an external disk.

I am not an expert on these things, but I expected them to do something like connect the disk by a cable to their computer, and run some Linux commands that would read the data from the disk and store it somewhere else. But instead, the guys there just tried to boot up the computer, and yes they confirmed that it wouldn't work, and... they couldn't do anything with it. After asking a bit more, my impression was that all they know to do is, basically, reinstall Windows and run some diagnostic and antivirus programs.

Second story, I bought a new smartphone. Then I realized that my contacts and SMS messages from the old one didn't transfer, because they were stored in the phone or the SIM card, rather than in Google cloud. (I got a new SIM card for the new phone, because it required different size.) There was a smartphone repair shop nearby and I was lazy, so I thought "let's just pay those guys to extract the messages and contacts from the old phone, and send them to me by an e-mail or something" (I wanted to have a backup on my computer, not just to transfer them to the new phone).

Again, I expected those guys to just connect something to my old phone, or put the SIM card in some machine, and extract the data. And again, it turned out that there was nothing they could do about it. After asking a bit more, I concluded that their business model is basically just replacing broken glass on the phones, or sending them to an authorized service provider if something more serious happens.

The thing that made me angry was realizing that despite what I perceived as deep incompetence, their business models actually make sense. In the spirit of the "80:20" rule, yes, 80% of problems average people have with computers do not require more than reinstalling Windows or maybe just changing some configuration, and 80% of problems average people have with smartphones are broken or scratched glass or battery dead, or something else that requires replacing a piece of hardware. Probably much more than 80%.

So yes, you can teach a bunch of guys how to reinstall Windows or replace a broken glass, and have a profitable repair shop. That is the economically rational thing to do. And yet I miss the old-style repairmen, most of them people from the older generation, who actually understood the things they worked with.

So, let me tell you a story about how I 'fixed' my first computer. This was the Ancient Days and that first machine was a Northstar Horizon, based on the S-100 bus and the Zilog Z80 microprocessor. You could take the lid off of the machine and see the circuit boards. Here's a description from the Wikipedia article:

The computer consists of a thick aluminium chassis separated into left and right compartments with a plywood cover which sat on the top and draped over the left and right sides. (It is one of only a handful of computers to be sold in a wooden cabinet. Later versions featured an all-metal case which met safety standards.[5]) The rear section of the compartment on the right held a linear power supply, including a large transformer and power capacitors, comprising much of the bulk and weight of the system. The empty section in front of the power supply normally housed one or two floppy disk drives, placed on their side so the slots were vertical. The compartment on the left held the S-100 motherboard, rotated so the slots ran left-right. Although a few logic circuits were on the motherboard, primarily for I/O functions, both the processor and the memory resided in separate daughterboards.

The manual that came with the computer had circuit diagrams for the boards.

Now, I knew little or nothing about such things. But my good friend, Rich Fritzon, he lived and breathed computers. He knew a thing or two. So, once I got the machine I turned it over to Rich and he wrote some software for it. The most important piece was a WYSWYG text editor that took advantage of the special associative memory board from Syd Lamb's company, the name of which escapes me.

Anyhow, I had this beast with me when I spent the summer of 1981 on a NASA project. One day the display went all wonky; the images just didn't make sense. Well, I knew that the CPU board had a synch (synchronization) chip and, well, those wonky images looked like something that would happen if signals weren't properly synchronized. I mean, I didn't actually KNOW anything, I was just guessing based on bits and scraps of things I'd heard and read. Based on this guess I removed the motherboard, located the sync chip in the corresponding diagram, removed the synch chip and reseated it, and then put the board back into the machine. When I turned it on, voilà! problem solved. The display was back.

That's the first and last time I ever fixed one of my machines. That sort of thing would be utterly impossible with today's machines.