Scaffolded LLMs as natural language computers

[-]Insub3y70

Wow. When I use GPT-4, Ive had a distinct sense of "I bet this is what it would have felt like to use one of the earliest computers". Until this post I didnt realize how literal that sense might be.

This is a really cool and apt analogy - computers and LLM scaffolding really do seem like the same abstraction. Thinking this way seems illuminating as to where we might be heading.

[-]Evan R. Murphy3y50

Post summary

I was interested in your post and noticed it didn't have a summary, so I generated one using a summarizer script I've been working on and iteratively improving:

Scaffolded Language Models (LLMs) have emerged as a new type of general-purpose natural language computer. With the advent of GPT-4, these systems have become viable at scale, wrapping a programmatic scaffold around an LLM core to achieve complex tasks. Scaffolded LLMs resemble the von-Neumann architecture, operating on natural language text rather than bits.
The LLM serves as the CPU, while the prompt and context function as RAM. The memory in digital computers is analogous to the vector database memory of scaffolded LLMs. The scaffolding code surrounding the LLM core implements protocols for chaining individual LLM calls, acting as the "programs" that run on the natural language computer.
Performance metrics for natural language computers include context length (RAM) and Natural Language OPerations (NLOPs) per second. Exponential improvements in these metrics are expected to continue for the next few years, driven by the increasing scale and cost of LLMs and their training runs.
Programming languages for natural language computers are in their early stages, with primitives like Chain of Thought, Selection-Inference, and Reflection serving as assembly languages. As LLMs improve and become more reliable, better abstractions and programming languages will emerge.
The execution model of natural language computers is an expanding Directed Acyclic Graph (DAG) of parallel NLOPs, resembling a dataflow architecture. Memory hierarchy in scaffolded LLMs currently has two levels, but as designs mature, additional levels may be developed.
Unlike digital computers, scaffolded LLMs face challenges in reliability, underspecifiability, and non-determinism. Improving the reliability of individual NLOPs is crucial for building powerful abstractions and abstract languages. Error correction mechanisms may be necessary to create coherent and consistent sequences of NLOPs.
Despite these challenges, the flexibility of LLMs offers great opportunities. The set of op-codes is not fixed but ever-growing, allowing for the creation of entire languages based on prompt templating schemes. As natural language programs become more sophisticated, they will likely delegate specific ops to the smallest and cheapest language models capable of reliably performing them.

If you have feedback on the quality of this summary, you can easily indicate that using LessWrong's agree/disagree voting.

[-]Alexandre Variengien3y40

Thanks for this rich analogy! Some comments about the analogy between context window and RAM:

Typo in the model name

GPT3 currently has an 8K context or an 8kbit RAM (theoretically expanding to 32kbit soon). This gets us to the Commodore 64 in digital computer terms, and places us in the early 80s.

I guess you meant GPT4 instead of GPT3.

Equivalence token to bits

Why did you decide to go with the equivalence of 1 token = 1 bit? Since a token can usually take on the order of 10k to 100k possible values, wouldn't 1 token equal 13-17 bits a more accurate equivalence?

Processor register as a better analog for the context window

One caveat I'd like to discuss: in the post, you describe the context window of NLPU as the analog for the RAM of computers. I think a more accurate analog could be processor registers.

Similarly to the context window, they are the memory bits directly connected to the computing unit. Whereas, it takes an instruction to load information from RAM before it can be used by the CPU. The RAM sits in the middle of the memory hierarchy, while registers are at its top.

If we accept this new analog, then NLPUs have by default (without external memory) access to much more data than CPUs. Modern CPUs have around 32 32-bit registers, so around 1kbit of space to store inputs, compared to the 80kbit in the context length of current LLM (using 1 token = 10 bits).

I think this might be an additional factor -- on top of the increased power and reliability of LLM -- that made us wait for so long after GPT3 before beginning to design complicated chaining of LLM calls. A single LM can store enough data in its context window to do many useful tasks: as you describe, there are many NLPU primitives to discover and exploit. On the other hand, a CPU with no RAM is basically an over-engineered calculator. It becomes truly useful once embedded in a von-Neumann architecture.

Multimodal models

If the natural type signature of a CPU is bits -> bits, the natural type of the natural language processing unit (NLPU) is strings -> strings.

With the rise of multimodal (image + text) models, NLPU could be required to deal with other data types than "string" like image embeddings, as images cannot be efficiently converted into natural text.

[-]Tao Lin3y40

>Why did you decide to go with the equivalence of 1 token = 1 bit? Since a token can usually take on the order of 10k to 100k possible values, wouldn't 1 token equal 13-17 bits a more accurate equivalence?

LLMs make very inneficient use of their context size because they're writing human-like text which is predictable. Human text is like 0.6 bits/byte, so maybe 2.5 bits per token. Text used in language model scaffolding and such tends to be even more predictable (by maybe 30%)

[-]beren3y30

Thanks for these points!

Equivalence token to bits
Why did you decide to go with the equivalence of 1 token = 1 bit? Since a token can usually take on the order of 10k to 100k possible values, wouldn't 1 token equal 13-17 bits a more accurate equivalence?

My thinking here is that the scaffolded LLM is a computer which operates directly in the natural language semantic space so it makes more sense to define the units of its context in terms of its fundamental units such as tokens. Of course each token has a lot more information-theoretic content than a single bit -- but this is why a single NLOP is much more powerful than a single FLOP. I agree that tokens directly are probably not the correct measure since they are too object level and there is likely some kind of 'semantic bit' idealisation which needs to be worked out.

Processor register as a better analog for the context window

One caveat I'd like to discuss: in the post, you describe the context window of NLPU as the analog for the RAM of computers. I think a more accurate analog could be processor registers.
Similarly to the context window, they are the memory bits directly connected to the computing unit. Whereas, it takes an instruction to load information from RAM before it can be used by the CPU. The RAM sits in the middle of the memory hierarchy, while registers are at its top.

I think I discuss this in the memory hierarchy section of the post. I agree that it is unclear what the best conceptualisation of the context window is. I agree it is not necessarily directly compatible with the RAM and may be more like processor registers. I think the main point is that currently scaffolded LLM systems have a 2 level memory hierarchy and computers have evolved a fairly complex and highly optimised multi-step system. It may be that we also eventually develop such a system or its equivalent for LLMs. I actually do not know how the memory hierarchy for the earliest computers worked -- did they already have a register -> RAM -> disk distinction?

I think this might be an additional factor -- on top of the increased power and reliability of LLM -- that made us wait for so long after GPT3 before beginning to design complicated chaining of LLM calls. A single LM can store enough data in its context window to do many useful tasks: as you describe, there are many NLPU primitives to discover and exploit. On the other hand, a CPU with no RAM is basically an over-engineered calculator. It becomes truly useful once embedded in a von-Neumann architecture.

This is an interesting hypothesis. My alternate hypothesis is essentially a combination of a.) reliability and instruction following with GPT3 was just too bad for this to work appreciably and we broke through some kind of barrier with GPT4 and secondly just that there actually was not that much time. GPT3 API only became widely useable in mid-2021 IIRC so that is about a year and a bit between that and ChatGPT release which is hardly any time to start iterating on this stuff.

Multimodal models

If the natural type signature of a CPU is bits -> bits, the natural type of the natural language processing unit (NLPU) is strings -> strings.
With the rise of multimodal (image + text) models, NLPU could be required to deal with other data types than "string" like image embeddings, as images cannot be efficiently converted into natural text.

Indeed. Should be interesting to see if we converge to some canonical datatype or not. The reason strings are so nice is that they compose easily and are incredibly flexible. The alternative is having directly chained architectures which communicate in embeddings, which can then be arbitrarily multimodal. Whether this works or not depends on how 'internalised' the cognition of the system is. Current agentic LLM trend is to externalise which is, imho, good from an interpretability and steer ability perspective. It may reverse.

[-]Bill Benzon3y40

I've just taken a quick look. & have a quick and crude reaction.

Consider how natural language is learned. The infant & toddler is surrounded be people who speak. They begin to babble and eventually manage to babble in a way that intends meaning. So they've got a device for producing tokens as motor output that produces audio tokens that can intermingle with the audio input tokens being produced by others.

We're now dealing with two token streams. There's a large audio stream, with input from various sources. And the smaller motor stream, which is closely correlated with some of the tokens in the audio stream because it has 'produced' them.

You need to take a look at Lev Vygotsky's account of language learning as a process of internalizing the speech streams of others. Here's a quick intro. Also, think of language as an index over one's conceptual space. & one LLM can index the space of another.

[-]Bill Benzon3y30

I"m somewhat more interested in similarity to (human) brains than von Neumann computers. This is from a relatively recent blog post, where I suggest that the generation of a single token is analogous to a single whole brain "frame" of neural computation:

I’m thinking in particular of the work of the late Walter Freeman, who is a pioneer in the field of complex neurodynamics. Toward the end of his career he began developing a concept of “cinematic consciousness.” As you know the movement in motion pictures is an illusion created by the fact the individual frames of the image are projected on the screen more rapidly than the mind can resolve them. So, while the frames are in fact still, they change so rapidly that we see motion.

First I’ll give you some quotes from Freeman’s article to give you a feel for his thinking (alas, you’ll have to read the article to see how those things connect up), then I’ll explain what that has to do with LLMs. The paragraph numbers are from Freeman’s article.

[20] EEG evidence shows that the process in the various parts occurs in discontinuous steps (Figure 2), like frames in a motion picture (Freeman, 1975; Barrie, Freeman and Lenhart, 1996).
[23] Everything that a human or an animal knows comes from the circular causality of action, preafference, perception, and up-date. It is done by successive frames of self-organized activity patterns in the sensory and limbic cortices. [...]
[35] EEG measurements show that multiple patterns self-organize independently in overlapping time frames in the several sensory and limbic cortices, coexisting with stimulus-driven activity in different areas of the neocortex, which structurally is an undivided sheet of neuropil in each hemisphere receiving the projections of sensory pathways in separated areas. [...]
[86] Science provides knowledge of relations among objects in the world, whereas technology provides tools for intervention into the relations by humans with intent to control the objects. The acausal science of understanding the self distinctively differs from the causal technology of self-control. "Circular causality" in self-organizing systems is a concept that is useful to describe interactions between microscopic neurons in assemblies and the macroscopic emergent state variable that organizes them. In this review intentional action is ascribed to the activities of the subsystems. Awareness (fleeting frames) and consciousness (continual operator) are ascribed to a hemisphere-wide order parameter constituting a global brain state. Linear causal inference is appropriate and essential for planning and interpreting human actions and personal relations, but it can be misleading when it is applied to microscopic- microscopic relations in brains.

Notice that Freeman refers to “a hemisphere-wide order parameter constituting a global brain state.” The cerebral cortex consists of 16B neurons, each with roughly 10K connections. Further, all areas of the cortex have connections with subcortical regions. That’s an awful-lot of neurons communicating in parallel in a single time step. As I recall from another article, these frames occur at a rate of 6-7 Hz.

The nervous system operates in parallel. I believe it is known that the brain exhibits a small world topology, so all neurons are within a relatively small number links from one another. Though at any moment some neurons will be more active than others, they are all active – the only inactive neuron is a dead neuron. Similarly, ANNs exhibit a high degree of parallelism. LLMs are parallel virtual machines being simulated by so-called von Neumann machines. The use of multiple cores gives a small degree of parallelism, but that’s quite small in relation to the overall number of parameters the system has.

I propose that the process of generating a single token in an LLM is comparable to a single “frame” of consciousness in Freeman’s model. All the parameters in the system are visited during a single time-step for the system. In the case of ChatGPT I believe that’s 175B parameters.

[-]EduardoNeville2y10

Very interesting! Similar to error correcting codes used in hardware, research can be conducted to create "semantic" error correcting codes for LLM outputs. These codes would help repair errors or inconsistencies in the outputs of NLOPs, enabling the stitching together of extended sequences of operations in a coherent and consistent manner.

[-]AI-doom3y10

Great and informative post! It seems to me that this architecture could enhance safety to some extent in the short term. Let's imagine an AI system similar to Auto-GPT, consisting of three parts: a large language model agent focused on creating stamps, a smaller language model dedicated to producing paperclips, and an even smaller scaffolding agent that leverages the language models to devise plans for world domination. Individually, none of these systems possess the intelligence to trigger an intelligence explosion or take over the world. If such a system reaches a point where it is capable of planning world domination, it is likely less dangerous than a simple language model with that goal would be, since the agent providing the goal is too simple to comprehend the importance of self-preservation and is further from superintelligence than the other parts. If so, scaffolding-like structures could be employed as a safety measure, and stop buttons might actually prove effective. Am I mistaken in my intuition? What would likely be the result of an intelligence explosion in the above example? Paperclip maximizers?

[-]Seth Herd3y10

Excellent writeup. I like the term scaffolding. As you note, agentizing is only the beginning; even the first efforts go beyond agentizing LLMs and provide other cognitive capacities.

We have entered the age of cognitive engineering, in the sense that it is now possible to relatively easily assemble multiple cognitive machines into a greater whole. That ease will increase as people develop the equivalent of APIs for multiple software tools.

[+][comment deleted]2y10

^{^}

The reason I am saying 'scaffolded' LLMs instead of 'agentized' LLMs as in a recent post is that, while agents are hot right now, the idea is broader. Not all natural language programs need to be agents. Agents are a natural abstraction suited to a particular type of task. But there are others.

^{^}

An interesting aspect of this analogy is that it clarifies the role and economic status of current foundation model providers like OpenAI. These essentially occupy an identical economic niche to the big chip-makers of the digital computer era such as Intel. The structure of their business is very similar. Training foundation models incurs massive fixed capital costs (as does building new chip fabs). They face constantly improving technology and new generations of tech which is vastly more powerful (Moore's law vs contemporary rapid AI scaling). They sell a commodity product (chips vs API calls) at large volume with a high margin but also substantial marginal costs (actually manufacturing each chip vs inferencing a model). If these equivalences hold then we can get some idea about what the likely long run shape of this industry will look like -- namely, the current and historical semiconductor industry. We should expect consolidation into a few main oligopolic players, where each have massive fixed costs and remain in fairly fierce competition, but that they never print money with extremely high margins in the same way that SAAS or software based companies tend to.

^{^}

NLOPs also differ crucially from more standard FLOPs in that they have different levels of 'intrinsic difficulty'. A small language model might be capable of some tasks, but others might require a large state of the art one. As NL programs become more sophisticated and elaborate, it is likely that there will be an increasing understanding of the difficulty of specific ops and a delegation of each op to the smallest and cheapest language model with the capability to reliably perform this op. Thus, NL programs will not have a uniform 'CPU' (LLM) core but will be comprised of a number of heterogenous calls to many different language models of different scales and specializations.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

97

Scaffolded LLMs as natural language computers

97

97

Programming languages

Execution model

Memory hierarchy

Theory

Foundation models as cognitive hardware

Fundamental differences from digital computers