How can we apply the lessons of brain-like cognitive architecture to modern LLMs?
The core architecture described by Steven Byrnes is most concisely documented here.
Some obvious comparisons between LLMs and brain-like architectures:
Let's try putting the pieces together and see what we get!
This is the simpler mode of operation, since it works like a regular LLM with some extra scaffolding.
Components:
Execution loop (limited to the happy path for clarity):
Existing internet text is still the richest source of data for building up a competent world model. Thus, the transformer backbone will still rely primarily on that data for building its world model. But it will also need to train those extra thought assessor heads. They will need to be trained differently, much like a human brain is affected very differently by active personal experience vs what they read. This section lays out a sketch of how that could work.
Anthropic already does "character training" as part of its post-training pipeline. This uses a variant of Constitutional AI to train the model to follow plain-text character traits by building a preference model based on AI assessments. I believe some adjustments will need to be made to this process in order to successfully train the extra assessor heads:
A notable difference between this setup and the standard brain-like architecture is that the role of the steering system is significantly reduced at inference time. It does not receive any direct sensory inputs and (during inference) does not send a "ground truth in hindsight" scorecard to the thought assessor heads. It is simply a pure function of the various assessor head scalars into a single combination valence scalar.
The RL training of the thought assessor heads relies on AI-generated labels, and (likely) synthetic data. Existing AI systems can be used to generate correct responses to much higher-level scenarios than what a brainstem-like system could understand. I claim that this will make it easier for the LLM backbone to generalize "correctly" (ie in a human-aligned way) as compared to if we used low-level signals like the brainstem does[1]. Because of this, the specifics of the controlled generalization in the brain-like architecture (steering subsystem signals training short and long term predictors via a temporal-diference style process) do not play a critical role.
Perhaps LLM-based systems can be made to take a more brain-like shape architecture with a relatively small number of tweaks to training and inference:
Of these, part 3 seems furthest from current popular research directions. So as my next step I'll try creating a synthetic dataset generator that could be used for this type of early alignment training.
If anyone is interested in collaborating, let me know!
I claim that the reason the low-level signals work at all to properly steer (most) humans' learning subsystems is because our learning and steering subsystems evolved together. Thus the learning subsystems likely have significant, complex inductive biases that specifically make those types of low-level signals reliably tend to generalize in specific ways given naturalistic inputs. ↩︎