Brain-inspired LLM alignment

mtaran

How can we apply the lessons of brain-like cognitive architecture to modern LLMs?

The core architecture described by Steven Byrnes is most concisely documented here.

Some obvious comparisons between LLMs and brain-like architectures:

Base LLMs are purely autoregressive predictors. If you squint this can be made to fit the mold of the Learning System. But the devil is in the details and there are a lot of details. Still, this could be a promising mapping.
1. One particularly important detail is that LLMs learn both "how to play the game" and "how to tell which game is being played". Thus, they can simulate all kinds of generative processes, aligned or not. And while model providers work hard to post-train model "personalities" into specific basins of attraction, Pliny still jailbreaks them consistently. This seems like something that brain-like systems should have better resistance to.
Instruction-tuned or RLHF-trained LLMs have their "agency" smeared through their weights in hard to analyze ways. Probably can't do much with these directly.
LLMs are pre-trained offline on an enormous amount of natural text, whereas the brain-like approach relies heavily on reinforcement learning throughout the whole training process. But, many current post-training methods make good use of RL, so this is still worth exploring.

Let's try putting the pieces together and see what we get!

Inference

This is the simpler mode of operation, since it works like a regular LLM with some extra scaffolding.

Components:

Thought generator => decoder-only transformer
Thought => a sentence or so of tokens
Thought assessors => extra heads on that transformer backbone predicting the "scorecard" signals per thought (effectively multiple reward models)
Steering subsystem => function that converts the "scorecard" into a valence

Execution loop (limited to the happy path for clarity):

Past thoughts, user inputs, actions and observations are be joined to form the context for the thought generator. They are combined in a structured way with special tokens, roughly like how ChatML used to work.
The thought generator produces N possible thoughts. The "scorecard" for each thought is read off from the extra heads and is passed to the valence function, which reduces them to a scalar score for each thought.
The top-scoring thought is added to context if its score is positive.
If that thought was scored very positive, and was "trying" to initiate an action (≈ tool call or output message) then that action gets performed and added to context.
User responses and observations of the results of actions get added to the context as well.

Training

Existing internet text is still the richest source of data for building up a competent world model. Thus, the transformer backbone will still rely primarily on that data for building its world model. But it will also need to train those extra thought assessor heads. They will need to be trained differently, much like a human brain is affected very differently by active personal experience vs what they read. This section lays out a sketch of how that could work.

Anthropic already does "character training" as part of its post-training pipeline. This uses a variant of Constitutional AI to train the model to follow plain-text character traits by building a preference model based on AI assessments. I believe some adjustments will need to be made to this process in order to successfully train the extra assessor heads:

I claim that fine-tuning a pre-trained LLM backbone cannot effectively restrict the model to a single robust behavioral basin of attraction. Thus training of the assessor heads will need to happen from very early on, together with standard pre-training. The model's world model will be limited early in pre-training. Thus there would probably need to be some curriculum system so that early RL runs could train only on the concepts the model knows.
Different heads are intended to respond to different aspects of the world that could be relevant to understanding the overall valence of a thought or action. This could be high-level things like helpfulness, harmlessness and honesty, or potentially more fine-grained concepts. But they should nonetheless be trained to narrowly predict that one thing. This will provide improved interpretability and resistance to reward hacking/Goodharting.

Comparison with brain-like architecture

A notable difference between this setup and the standard brain-like architecture is that the role of the steering system is significantly reduced at inference time. It does not receive any direct sensory inputs and (during inference) does not send a "ground truth in hindsight" scorecard to the thought assessor heads. It is simply a pure function of the various assessor head scalars into a single combination valence scalar.

The RL training of the thought assessor heads relies on AI-generated labels, and (likely) synthetic data. Existing AI systems can be used to generate correct responses to much higher-level scenarios than what a brainstem-like system could understand. I claim that this will make it easier for the LLM backbone to generalize "correctly" (ie in a human-aligned way) as compared to if we used low-level signals like the brainstem does^[1]. Because of this, the specifics of the controlled generalization in the brain-like architecture (steering subsystem signals training short and long term predictors via a temporal-diference style process) do not play a critical role.

Wrap-up & next steps

Perhaps LLM-based systems can be made to take a more brain-like shape architecture with a relatively small number of tweaks to training and inference:

Multi-objective reward modeling with a human-written aggregation function.
Using the above hybrid reward model with snippet-level best-of-n at inference.
RLAIF to train that reward model throughout autoregressive pre-training.

Of these, part 3 seems furthest from current popular research directions. So as my next step I'll try creating a synthetic dataset generator that could be used for this type of early alignment training.

If anyone is interested in collaborating, let me know!

I claim that the reason the low-level signals work at all to properly steer (most) humans' learning subsystems is because our learning and steering subsystems evolved together. Thus the learning subsystems likely have significant, complex inductive biases that specifically make those types of low-level signals reliably tend to generalize in specific ways given naturalistic inputs. ↩︎

LESSWRONG
LW