It has been conjectured that Stochastic Gradient Descent with the right hyperparameters approximates Bayesian learning. Bayesian learning is general, so it should be possible to pretrain a transformer to do anything that isn't actually beyond the architectural capabilities of its neural net architecture (e.g. that doesn't require more processing per token that it's capable of doing in a single forward pass). I gather you don't disagree with that.
It has also been conjectured that LLM in-context learning approximates Bayesian learning. You're clear that you think that that is less capable than SGD. Is that because:
a) you don't think is approximates Bayesian learning
b) you think it's a significantly less good approximation to Bayesian learning, or
c) you think there's a significant limit, beyond just context length, to how much ii can learn: i.e. that it approximates Bayesian learning just fine at first, but then runs out of capacity, potentially before it runs out of context length?
Of these, issues a) and b) are clearly inherently fatal, whereas c) would suggest an architectural workaround of in-context-learning new information to below that capacity limit, then somehow using it to generate more training data containing that information, then using SGD to train either a new or a modified model containing that new information and iterating — obviously retraining from scratch is very (and increasingly) expensive, while retraining iteratively faces known challenges from catastrophic forgetting.
No opinion about (a) and (b), but Bayesian inference can only do as well its hypothesis space, and I think the true hypothesis here is WAY outside the hypothesis space, regardless of context size. That’s what I was trying to get across with that table I put into the OP.
So maybe that’s (c), but I don’t really know what you mean by “capacity” in this context.
Your last paragraph sounds to me like brainstorming how to build a continual learning setup for LLMs. As I mentioned at the bottom, such a system might or might not exist, but that would be out of scope for this post. If something in that genre worked, the “continual learning” in question would be coming from PyTorch code that assembles data and runs SGD in a loop, not from imitation learning, if I’m understanding your text correctly.
If even some hypothesis "very close" to the current hypotheses + priors were missing for in-context learning, then you'd get a) or b). If all hypotheses close to the current hypothesis + priors could be explored with near-full Bayesian accuracy, but there was some limit, some metric under which which things "further away" in that metric space both took more evidence to reach and also had more and more of the possible hypotheses simply missing and not creatable during in-context learning, then you'd get c).
There's a limit in how far I want to go brainstorming capabilities improvements, but basically what I was suggesting is that an obvious approach one might try is first learning things in-context, then doing some form of SGD imitation learning from that to train a model that now already knows how to do that and doesn't need to use a lot of context to figure it out.
I tell an LLM my favorite color. As long as that information is in its context window, it has access to it. As soon as that context rolls off or goes away, the LLM no longer has access to that information.
I build an agent with scaffolding that has a database. I tell it my favorite color. The agent records it in the database. The weights of the LLM are still fixed, but during its base training it learned how to access information. So if I ask it at any point in the future what my favorite color is, it knows. It access the information in the database.
Do you consider this continual learning? If not, why not?
See everything I wrote in the section “Some intuitions on how to think about ‘real’ continual learning”. The thing you’re describing is definitely not (what I’m calling) “real” continual learning.
Should the thing you’re describing be called “continual learning” at all? No opinion. Call it whatever you want.
So according to you, a system that could acquire new facts, record them, access them, and use them, continuously in this way would not constitute 'real' continuous learning. It could conceivably fill its database with the actionable knowledge of 1000 yet unwritten textbooks, but that wouldn't be 'real' to you.
You seem to be putting somewhat arbitrary constraints on what constitutes continual learning. Generally, if the system's knowledge base is fixed, it's incapable of continuing to learn. If it has the capacity to acquire new knowledge and skills, by whatever means, it continues to learn. You're narrowing that general idea without really justifying why.
LLMs having limitations that human learning does not
I am seeing quite a bit of progress in continual learning for LLMs recently.
Among a variety of very promising results, I have been particularly impressed by the recent Sakana work, Instant LLM Updates with Doc-to-LoRA and Text-to-LoRA, Feb 2026:
https://sakana.ai/doc-to-lora/ (links to arxiv and github are inside, the key paper is https://arxiv.org/abs/2602.15902)
The main idea is to combine two technologies which are known for several years, LoRA (low-rank adaptation used for fine-tuning) and the ability to train hypernetworks capable of instantly guessing the results of the first few thousands steps of gradient descent for a wide variety of problems with reasonable accuracy.
So what they do is they train hypernetworks capable to instantly generate (or instantly update) LoRA adapters based on past experience of the system in question. LLMs are pretty good at instant "in context" learning, but it has been less clear how to efficiently distill this learning into weights. This work enables this kind of distillation without waiting for a fine-tuning process to complete.
This does not directly contradict the post (this is not imitation learning as such), but the wider thesis that in the realm of continual learning, LLMs are at an inherent disadvantage compared to humans is very questionable in light of recent progress in this area.
Those are cool ideas, but I don’t think they qualify as (what I’m calling) “real” continual learning, as defined in the section “Some intuitions on how to think about ‘real’ continual learning”.
The disagreement might be on what we think about the models being able to do those things you mention there, but in a static “frozen” situation.
To the extent that the models are able to do those things (“true understanding”, “true knowledge”, “true creativity”, etc.) in a static “frozen” world, the notion of “continual learning” is reducible to its conventional interpretation (which is the ability to accommodate and internally integrate new information, new skills, and new discoveries on the fly without degradation of earlier learned skills and qualities).
But if one does not think that their performance for the static “frozen world” and “frozen models” situation is satisfactory, then no, it’s indeed unlikely that those methods would rescue that.
(If one has a situation for some class of models and method where static “frozen” models don’t possess those qualities, but those qualities can be rescued by dynamic “continual learning”, it should not be too difficult to convert those “continual learning” methods into producing “frozen” snapshots having those qualities to a fairly high degree. I think I more or less know how to do that. So, perhaps, your critique of the status quo is not actually about continual learning, but about more fundamental questions, about whether they are capable of “real” learning at all, whether continual or not.)
In this post, I’m trying to put forward a narrow, pedagogical point, one that comes up mainly when I’m arguing in favor of LLMs having limitations that human learning does not. (E.g. here, here, here.)
See the bottom of the post for a list of subtexts that you should NOT read into this post, including “…therefore LLMs are dumb”, or “…therefore LLMs can’t possibly scale to superintelligence”.
Some intuitions on how to think about “real” continual learning
Consider an algorithm for training a Reinforcement Learning (RL) agent, like the Atari-playing Deep Q network (2013) or AlphaZero (2017), or think of within-lifetime learning in the human brain, which (I claim) is in the general class of “model-based reinforcement learning”, broadly construed.
These are all real-deal full-fledged learning algorithms: there’s an algorithm for choosing the next action right now, and there’s one or more update rules for permanently changing some adjustable parameters (a.k.a. weights) in the model such that its actions and/or predictions will be better in the future. And indeed, the longer you run them, the more competent they get.
When we think of “continual learning”, I suggest that those are good central examples to keep in mind. Here are some aspects to note:
Knowledge vs information: These systems allow for continual acquisition of knowledge, not just information—the “continual learning” can install wholly new ways of conceptualizing and navigating the world, not just keeping track of what’s going on.
Huge capacity for open-ended learning: These examples all have huge capacity for continual learning, indeed enough that they can start from random initialization and “continually learn” all the way to expert-level competence. Likewise, new continual learning can build on previous continual learning, in an ever-growing tower.
Ability to figure things out that aren’t already on display in the environment: For example, an Atari-playing RL agent will get better and better at playing an Atari game, even without having any expert examples to copy. Likewise, billions of humans over thousands of years invented language, math, science, and a whole $100T global economy from scratch, all by ourselves, without angels dropping new training data from the heavens.
I bring these up because I think the LLM-focused discourse sometimes has far too narrow a notion of what problem “continual learning” is supposed to be solving. They tend to think the problem is about “losing track of information”, not “failing to build new knowledge”, and they propose to solve this problem with strategies like “make the context [window] longer” (as Dario Amodei recently mused), or better scratchpads with Retrieval-Augmented Generation (RAG) etc.
But real “continual learning” also includes the ways that AlphaZero changes after a million games of self-play, or the ways that a human brain changes after 20 years in a new career. There is no system of scratchpads that you can give to a 15-year-old, such that it would be an adequate substitute for them spending the next 20 years growing into a 35-year-old world expert in some field. Likewise, there is no context window that can turn GPT-2 into GPT-5.
Suppose you took an actual “country of geniuses in a datacenter”, completely sealed them from the outside world, and gave them a virtual reality environment to hang out in for the equivalent of 100 years. What would you find when you unsealed it? There would be whole new ways of thinking about the world and everything in it—entirely new fields of science, schools of philosophy, and so on.
Can a bunch of LLMs do that? Well consider this thought experiment: suppose you take a whole new field of science, wildly different from anything in the training data, and put a giant textbook for this field purely in an LLM context window, with no weight updates at all. Will this LLM be able to understand, criticize, and build on this field? My opinion is “absolutely not” (see 1, 2) which implies that merely increasing context lengths is definitely not sufficient for a real “country of geniuses in a datacenter”, when the datacenter is sealed shut for the equivalent of 100 years (contra Dario who seems to think that it’s at least in the realm of possibility that more context is sufficient by itself to get continual learning at “country of geniuses” level).
(If we’re talking about what a sealed “country of human geniuses” could do over the course of, like, one minute, rather than over the course of 100 years, then, yeah sure, maybe that could be reproduced with future LLMs! See von Oswald et al. 2022 on how (so-called) “in-context learning” can imitate a small number of steps of actual weight updates.[1])
Why “real” continual learning can’t be copied by an imitation learner
Now, suppose that I take a generic imitation-learning algorithm (e.g. self-supervised learning in a transformer-architecture neural net, just like LLM pretraining), and have it watch a deep Q network play Atari Breakout, as it starts from random initialization, and gets better and better over 1M iterations. OK, now we have our trained imitation-learner. We freeze its weights, and use it in a similar way as people traditionally used LLM base models, i.e. have it output the most likely next move, and then the most likely move after that, etc.
Question: Is this trained imitation-learner actually a good imitation of the deep Q network? Well, “good” in what respect? I would pull apart a couple topics:
Why not? Well, actually, for an ideal imitation learning algorithm, i.e. Solomonoff induction on an imaginary hypercomputer, my answers would all be “yes”! But in the real world, we don’t have hypercomputers!
These days, when people talk about imitation learning, they’re normally talking about transformers, not hypercomputers, and transformers are constrained to a much narrower hypothesis space:
Imitation-learning a deep-Q RL agent by Solomonoff induction
Imitation-learning a deep-Q RL agent by training a transformer on next-action prediction
Hypothesis space
The set of all computable algorithms
A forward pass through T, for the set of all possible trained transformers T
Ground truth
The actual deep-Q RL agent, with such-and-such architecture, and Temporal Difference (TD) learning weight updates, etc.
The actual deep-Q RL agent, with such-and-such architecture, and Temporal Difference (TD) learning weight updates, etc.
Asymptotic limit
It converges to the actual deep-Q RL agent
It converges to whatever trained transformer forward pass happens to be closest to the actual deep-Q RL agent
I think we should all be very impressed by the set of things that a transformer forward pass[2] can do. But we should not expect a transformer forward pass to reproduce a full-fledged, entirely different, learning algorithm, with its own particular neural network architecture, its own particular methods of updating and querying weights, etc., as it runs and changes over millions of steps.
Running one large-scale learning algorithm is expensive enough; it’s impractical to run a huge ensemble of different large-scale learning algorithms in parallel, in order to zero in on the right one.[3]
I’m going to harp on this because it’s a point of confusion. There are two learning algorithms under discussion: the imitation-learning algorithm (e.g. a transformer getting updated by gradient descent on next-action prediction), and the target continual learning algorithm (e.g. a deep Q network getting updated by TD learning). When the imitation learning is done, the transformer weights are frozen, and the corresponding trained model is given the impossible task of using only its activations, with fixed weights, to imitate what happens when the target continual learning algorithm changes its weights over millions of steps of (in this case) TD learning. That’s the part I’m skeptical of.
In other words: The only practical way to know what happens after millions of steps of some scaled-up continual learning algorithm is to actually do millions of steps of that same scaled-up continual learning algorithm, with actual weights getting actually changed in specifically-designed ways via PyTorch code. And then that’s the scaled-up learning algorithm you’re running. Which means you’re not doing imitation learning.
So back to the human case: for a typical person (call him “Joe”), I think LLMs are good at imitating “Joe today”, and good at imitating “Joe + 1 month of learning introductory category theory”, but can’t imitate the process by which Joe grows and changes over that 1 month of learning—or at least, can’t imitate it in a way that would generalize to imitating a person spending years building a completely different field of knowledge that’s not in the training data.
Some things that are off-topic for this post
As mentioned at the top, I’m hoping that this post is a narrow pedagogical point. For example:
I guess I also need to mention the “algorithmic distillation” paper (Laskin et al. 2022), but I’m hesitant to take it at face value, see discussion here.
You can replace “a forward pass” with “10,000 forward passes with chain-of-thought reasoning”; it doesn’t change anything in this post.
Outer-loop search over learning algorithms is so expensive that it’s generally only used for adjusting a handful of legible hyperparameters, not doing open-ended search where we don’t even vaguely know what we’re looking for. Even comparatively ambitious searches over spaces of learning algorithms in the literature have a search space of e.g. ≈100 bits, which is tiny compared to the information content of a learning algorithm source code repository.