You can’t imitation-learn how to continual-learn

Steven Byrnes

You can’t imitation-learn how to continual-learn — LessWrong

207 You can’t imitation-learn how to continual-learn

16th Mar 2026

7 min read

207

In this post, I’m trying to put forward a narrow, pedagogical point, one that comes up mainly when I’m arguing in favor of LLMs having limitations that human learning does not. (E.g. here, here, here.)

See the bottom of the post for a list of subtexts that you should NOT read into this post, including “…therefore LLMs are dumb”, or “…therefore LLMs can’t possibly scale to superintelligence”.

Some intuitions on how to think about “real” continual learning

Consider an algorithm for training a Reinforcement Learning (RL) agent, like the Atari-playing Deep Q network (2013) or AlphaZero (2017), or think of within-lifetime learning in the human brain, which (I claim) is in the general class of “model-based reinforcement learning”, broadly construed.

These are all real-deal full-fledged learning algorithms: there’s an algorithm for choosing the next action right now, and there’s one or more update rules for permanently changing some adjustable parameters (a.k.a. weights) in the model such that its actions and/or predictions will be better in the future. And indeed, the longer you run them, the more competent they get.

When we think of “continual learning”, I suggest that those are good central examples to keep in mind. Here are some aspects to note:

Knowledge vs information: These systems allow for continual acquisition of knowledge, not just information—the “continual learning” can install wholly new ways of conceptualizing and navigating the world, not just keeping track of what’s going on.

Huge capacity for open-ended learning: These examples all have huge capacity for continual learning, indeed enough that they can start from random initialization and “continually learn” all the way to expert-level competence. Likewise, new continual learning can build on previous continual learning, in an ever-growing tower.

Ability to figure things out that aren’t already on display in the environment: For example, an Atari-playing RL agent will get better and better at playing an Atari game, even without having any expert examples to copy. Likewise, billions of humans over thousands of years invented language, math, science, and a whole $100T global economy from scratch, all by ourselves, without angels dropping new training data from the heavens.

I bring these up because I think the LLM-focused discourse sometimes has far too narrow a notion of what problem “continual learning” is supposed to be solving. They tend to think the problem is about “losing track of information”, not “failing to build new knowledge”, and they propose to solve this problem with strategies like “make the context [window] longer” (as Dario Amodei recently mused), or better scratchpads with Retrieval-Augmented Generation (RAG) etc.

But real “continual learning” also includes the ways that AlphaZero changes after a million games of self-play, or the ways that a human brain changes after 20 years in a new career. There is no system of scratchpads that you can give to a 15-year-old, such that it would be an adequate substitute for them spending the next 20 years growing into a 35-year-old world expert in some field. Likewise, there is no context window that can turn GPT-2 into GPT-5.

Suppose you took an actual “country of geniuses in a datacenter”, completely sealed them from the outside world, and gave them a virtual reality environment to hang out in for the equivalent of 100 years. What would you find when you unsealed it? There would be whole new ways of thinking about the world and everything in it—entirely new fields of science, schools of philosophy, and so on.

Can a bunch of LLMs do that? Well consider this thought experiment: suppose you take a whole new field of science, wildly different from anything in the training data, and put a giant textbook for this field purely in an LLM context window, with no weight updates at all. Will this LLM be able to understand, criticize, and build on this field? My opinion is “absolutely not” (see 1, 2) which implies that merely increasing context lengths is definitely not sufficient for a real “country of geniuses in a datacenter”, when the datacenter is sealed shut for the equivalent of 100 years (contra Dario who seems to think that it’s at least in the realm of possibility that more context is sufficient by itself to get continual learning at “country of geniuses” level).

(If we’re talking about what a sealed “country of human geniuses” could do over the course of, like, one minute, rather than over the course of 100 years, then, yeah sure, maybe that could be reproduced with future LLMs! See von Oswald et al. 2022 on how (so-called) “in-context learning” can imitate a small number of steps of actual weight updates.^[1])

Why “real” continual learning can’t be copied by an imitation learner

Now, suppose that I take a generic imitation-learning algorithm (e.g. self-supervised learning in a transformer-architecture neural net, just like LLM pretraining), and have it watch a deep Q network play Atari Breakout, as it starts from random initialization, and gets better and better over 1M iterations. OK, now we have our trained imitation-learner. We freeze its weights, and use it in a similar way as people traditionally used LLM base models, i.e. have it output the most likely next move, and then the most likely move after that, etc.

Question: Is this trained imitation-learner actually a good imitation of the deep Q network? Well, “good” in what respect? I would pull apart a couple topics:

Snapshot imitation: The actual deep Q network, right now, at the moment training is done, would output such-and-such Breakout moves in such-and-such positions. Question: Will the trained imitation-learner output similar moves right now, thus playing at a similar skill-level as the teacher? My answer is: plausibly yes.
Imitation of long-term learning: The actual deep Q network, if it kept playing, would keep improving. Will the trained imitation-learner likewise keep improving over the next 10M moves, until it’s doing things wildly better and different than anything that it saw its “teacher” deep Q network ever do? My answer is: no.
Imitation of long-term learning (example 2): The actual deep Q network, if it were suddenly transplanted into a new game environment (say, Atari Space Invaders), would start by making terrible moves, but over 10M iterations it would gradually improve to expert level. Will the trained imitation-learner likewise do 10M iterations and then wind up performing expertly at this game, a game which it never saw during its training phase? My answer is: no.

Why not? Well, actually, for an ideal imitation learning algorithm, i.e. Solomonoff induction on an imaginary hypercomputer, my answers would all be “yes”! But in the real world, we don’t have hypercomputers!

These days, when people talk about imitation learning, they’re normally talking about transformers, not hypercomputers, and transformers are constrained to a much narrower hypothesis space:

	Imitation-learning a deep-Q RL agent by Solomonoff induction	Imitation-learning a deep-Q RL agent by training a transformer on next-action prediction
Hypothesis space	The set of all computable algorithms	A forward pass through T, for the set of all possible trained transformers T
Ground truth	The actual deep-Q RL agent, with such-and-such architecture, and Temporal Difference (TD) learning weight updates, etc.	The actual deep-Q RL agent, with such-and-such architecture, and Temporal Difference (TD) learning weight updates, etc.
Asymptotic limit	It converges to the actual deep-Q RL agent	It converges to whatever trained transformer forward pass happens to be closest to the actual deep-Q RL agent

I think we should all be very impressed by the set of things that a transformer forward pass^[2] can do. But we should not expect a transformer forward pass to reproduce a full-fledged, entirely different, learning algorithm, with its own particular neural network architecture, its own particular methods of updating and querying weights, etc., as it runs and changes over millions of steps.

Running one large-scale learning algorithm is expensive enough; it’s impractical to run a huge ensemble of different large-scale learning algorithms in parallel, in order to zero in on the right one.^[3]

I’m going to harp on this because it’s a point of confusion. There are two learning algorithms under discussion: the imitation-learning algorithm (e.g. a transformer getting updated by gradient descent on next-action prediction), and the target continual learning algorithm (e.g. a deep Q network getting updated by TD learning). When the imitation learning is done, the transformer weights are frozen, and the corresponding trained model is given the impossible task of using only its activations, with fixed weights, to imitate what happens when the target continual learning algorithm changes its weights over millions of steps of (in this case) TD learning. That’s the part I’m skeptical of.

In other words: The only practical way to know what happens after millions of steps of some scaled-up continual learning algorithm is to actually do millions of steps of that same scaled-up continual learning algorithm, with actual weights getting actually changed in specifically-designed ways via PyTorch code. And then that’s the scaled-up learning algorithm you’re running. Which means you’re not doing imitation learning.

So back to the human case: for a typical person (call him “Joe”), I think LLMs are good at imitating “Joe today”, and good at imitating “Joe + 1 month of learning introductory category theory”, but can’t imitate the process by which Joe grows and changes over that 1 month of learning—or at least, can’t imitate it in a way that would generalize to imitating a person spending years building a completely different field of knowledge that’s not in the training data.

Some things that are off-topic for this post

As mentioned at the top, I’m hoping that this post is a narrow pedagogical point. For example:

I’m not commenting on whether it’s possible to modify LLM post-training into a “real” continual learning algorithm (although I happen to believe that it isn’t possible).
I’m not commenting on how an inability to do “real” continual learning cashes out in terms of real-world competencies (E.g., can a non-“real”-continual-learning AI nevertheless take jobs? Can it kill billions of people? Can it install itself as an eternal global dictator? Etc.) (I happen to think that these are tricky questions without obvious answers.)
I’m not commenting on whether we should think of actual frontier LLMs (not just pretrained base models) as predominantly powered by imitation learning, even despite their RL post-training (although I happen to believe that we probably should, more or less (1,2)).

^{^}
I guess I also need to mention the “algorithmic distillation” paper (Laskin et al. 2022), but I’m hesitant to take it at face value, see discussion here.
^{^}
You can replace “a forward pass” with “10,000 forward passes with chain-of-thought reasoning”; it doesn’t change anything in this post.
^{^}
Outer-loop search over learning algorithms is so expensive that it’s generally only used for adjusting a handful of legible hyperparameters, not doing open-ended search where we don’t even vaguely know what we’re looking for. Even comparatively ambitious searches over spaces of learning algorithms in the literature have a search space of e.g. ≈100 bits, which is tiny compared to the information content of a learning algorithm source code repository.

Curated

207

New Comment

52 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:45 AM

[-]RogerDearnaley3mo*352

It has been conjectured that Stochastic Gradient Descent with the right hyperparameters approximates Bayesian learning. Bayesian learning is general, so it should be possible to pretrain a transformer to do anything that isn't actually beyond the architectural capabilities of its neural net architecture (e.g. that doesn't require more processing per token that it's capable of doing in a single forward pass). I gather you don't disagree with that.

It has also been conjectured that LLM in-context learning approximates Bayesian learning. You're clear that you think that that is less capable than SGD. Is that because:
a) you don't think it approximates Bayesian learning
b) you think it's a significantly less good approximation to Bayesian learning, or
c) you think there's a significant limit, beyond just context length, to how much ii can learn: i.e. that it approximates Bayesian learning just fine at first, but then runs out of some form of capacity, potentially before it runs out of context length?

Of these, issues a) and b) are clearly inherently fatal, whereas c) would suggest an architectural workaround of in-context-learning new information to below that capacity limit, then somehow using it to generate more training data containing that information, then using SGD to train either a new or a modified model containing that new information and iterating — obviously retraining from scratch is very (and increasingly) expensive, while retraining iteratively faces known challenges from catastrophic forgetting.

[-]Steven Byrnes3mo165

No opinion about (a) and (b), but Bayesian inference can only do as well its hypothesis space, and I think the true hypothesis here is WAY outside the hypothesis space, regardless of context size. That’s what I was trying to get across with that table I put into the OP.

So maybe that’s (c), but I don’t really know what you mean by “capacity” in this context.

Your last paragraph sounds to me like brainstorming how to build a continual learning setup for LLMs. As I mentioned at the bottom, such a system might or might not exist, but that would be out of scope for this post. If something in that genre worked, the “continual learning” in question would be coming from PyTorch code that assembles data and runs SGD in a loop, not from imitation learning, if I’m understanding your text correctly.

[-]RogerDearnaley3mo20

If even some hypothesis "very close" to the current hypotheses + priors were missing for in-context learning, then you'd get a) or b). If all hypotheses close to the current hypothesis + priors could be explored with near-full Bayesian accuracy, but there was some limit, some metric under which which things "further away" in that metric space both took more evidence to reach and also had more and more of the possible hypotheses simply missing and not creatable during in-context learning, then you'd get c).

There's a limit in how far I want to go brainstorming capabilities improvements, but basically what I was suggesting is that an obvious approach one might try is first learning things in-context, then doing some form of SGD imitation learning from that to train a model that now already knows how to do that and doesn't need to use a lot of context to figure it out.

[-]mishka3mo*14-2

LLMs having limitations that human learning does not

I am seeing quite a bit of progress in continual learning for LLMs recently.

Among a variety of very promising results, I have been particularly impressed by the recent Sakana work, Instant LLM Updates with Doc-to-LoRA and Text-to-LoRA, Feb 2026:

https://sakana.ai/doc-to-lora/ (links to arxiv and github are inside, the key paper is https://arxiv.org/abs/2602.15902)

The main idea is to combine two technologies which are known for several years, LoRA (low-rank adaptation used for fine-tuning) and the ability to train hypernetworks capable of instantly guessing the results of the first few thousands steps of gradient descent for a wide variety of problems with reasonable accuracy.

So what they do is they train hypernetworks capable to instantly generate (or instantly update) LoRA adapters based on past experience of the system in question. LLMs are pretty good at instant "in context" learning, but it has been less clear how to efficiently distill this learning into weights. This work enables this kind of distillation without waiting for a fine-tuning process to complete.

This does not directly contradict the post (this is not imitation learning as such), but the wider thesis that in the realm of continual learning, LLMs are at an inherent disadvantage compared to humans is very questionable in light of recent progress in this area.

[-]Steven Byrnes3mo20

Those are cool ideas, but I don’t think they qualify as (what I’m calling) “real” continual learning, as defined in the section “Some intuitions on how to think about ‘real’ continual learning”.

[-]mishka3mo40

The disagreement might be on what we think about the models being able to do those things you mention there, but in a static “frozen” situation.

To the extent that the models are able to do those things (“true understanding”, “true knowledge”, “true creativity”, etc.) in a static “frozen” world, the notion of “continual learning” is reducible to its conventional interpretation (which is the ability to accommodate and internally integrate new information, new skills, and new discoveries on the fly without degradation of earlier learned skills and qualities).

But if one does not think that their performance for the static “frozen world” and “frozen models” situation is satisfactory, then no, it’s indeed unlikely that those methods would rescue that.

(If one has a situation for some class of models and method where static “frozen” models don’t possess those qualities, but those qualities can be rescued by dynamic “continual learning”, it should not be too difficult to convert those “continual learning” methods into producing “frozen” snapshots having those qualities to a fairly high degree. I think I more or less know how to do that. So, perhaps, your critique of the status quo is not actually about continual learning, but about more fundamental questions, about whether they are capable of “real” learning at all, whether continual or not.)

[-]azsantosk3mo10

Do you think stronger versions of the result mentioned by mishka might be able to count as "real" continuous learning, if the 'mutable state' (e.g. the LoRA adapter) has enough capacity to exhibit the gains in capabilities?

E.g. in the 'country of geniuses in the datacenter' model, while a bunch of LLMs together would not be able to develop new fields, it seems at least possible for a bunch of LLMs augmented with large 'mutable states' (such as LoRA adapters) to do so, at least to the limit of new things 'within' the capacity of the 'fixed baseweight + learned LoRA space'. Current LoRA adapters in use are too small for that, but one could think of a much larger ones (including ones with just as many parameters as current models have 'weights').

Now, suppose that I take a generic imitation-learning algorithm (e.g. self-supervised learning in a transformer-architecture neural net, just like LLM pretraining), and have it watch a deep Q network play Atari Breakout, as it starts from random initialization, and gets better and better over 1M iterations. (...)

Question: Is this trained imitation-learner actually a good imitation of the deep Q network?

I am not familiar with the training procedure mentioned by mishka to train hypernetworks. But if it does work, might it not make it possible to 'imitate the learning updates from true learners' -- maybe from watching 10,000 deep Q networks play thousands of games instead -- and then apply such updates recursively?

[-]Ruby3mo112

Curated! I think the immense capability and usefulness of current LLMs, and specifically their increasing ability to take over tasks from humans, distracts from the ways in which they are strange minds different from human minds. I like this post for digging into that. It's known that of course LLMs lack memory and now we give them scratchpads and other files they can reference as a substitute, and yet it's not the same (as I keep experiencing in my own use). I appreciate this post for digging in and making claims like no amount of context window or scratchpads, etc., substitutes for actual continual learning. Without asserting this is correct, it's a discussion I like. One reason for that is I think significant and scary things might happen if/when we move beyond current architectures – which are already very capable – to those without these limitations. Good predictions there will come from understanding what is going on with the current models. Kudos. I like this line of work.

[-]Alex Mallen3mo92

When the imitation learning is done, the transformer weights are frozen, and the corresponding trained model is given the impossible task of using only its activations, with fixed weights, to imitate what happens when the target continual learning algorithm changes its weights over millions of steps of (in this case) TD learning.

I think it's important that the AI doesn't need to do all of its continual learning in the activations of a single forward pass: It can autoregressively generate tokens too. This way it can leave notes to itself, like "next time I run into a bug like this, I should look in this place first." And in fact, this seems pretty continuous with stuff we already see. (And reinforcement learning can make this kind of continual learning within a context window more likely.) In other words, the expressivity of autoregressive LLMs is much larger than a single forward pass, making continual learning from long contexts super plausible.

TBC this doesn't contradict your main point about imitation learning, and is mainly meant to push back on the narrative that transformers would need to be able to simulate long continual learning processes in a single forward pass in order to implement continual learning, and meant to convey why I think LLMs will plausibly be able to implement continual learning via longer context despite your arguments.

My main takeaway from your post is that naively training LLMs to imitate the behaviors of continually-learning policies (e.g., humans) who don't leave externalized traces of their continual learning process is unlikely to work. (And I believe this is your main point.)

[-]Steven Byrnes3mo132

I think it's important that the AI doesn't need to do all of its continual learning in the activations of a single forward pass: It can autoregressively generate tokens too. This way it can leave notes to itself, like "next time I run into a bug like this, I should look in this place first."

I don’t think that’s adequate for (what I was calling) “real” continual learning. There’s a trivial sense in which an LLM can do anything via a context window because it can e.g. emulate a Turing Machine without understanding what the Turing Machine is doing. But that’s not realistic (nor alignment-relevant). Realistically, I claim LLM “understanding” has to be in the weights, not the context window.

Here’s a thought experiment I often bring up: imagine training an LLM purely on data before linear algebra existed (or equivalently, train a new LLM from scratch while carefully filtering out anything related to or downstream of linear algebra from the training data). Then put a linear algebra textbook (or many textbooks) in the context window.

My question is: can the LLM can answer tricky questions that are not directly in those textbooks, to build on those linear algebra ideas and make further progress?

My strong prediction is: No.

Why do I think that? The issue is: linear algebra is a giant pile of interrelated concepts: matrices, bases, rank, nullity, spans, determinants, trace, eigenvectors, dual space, unitarity, etc. Any one sentence in the textbook makes no sense to someone who doesn’t already know some linear algebra, because it’s probably describing some connection between one nonsensical concept and another nonsensical concept.

E.g. here’s a sentence from a linear algebra textbook: “As a reminder, for any matrix M , and a matrix M′ equal to M after a row operation, multiplying by an elementary matrix E gave M ′ = EM.” Try looking at that sentence through the eyes of someone who has never heard the words “matrix”, “row operation”, etc. It’s totally unintelligible gobbledygook, right?

The LLM needs to somehow make sense of this gobbledygook within the duration of a single forward pass, well enough to write down the first token on its scratchpad.

Now we do the second forward pass to add the second token to the CoT. But the weights haven’t changed! So the textbook is still gobbledygook! And the LLM still has only the duration of one forward pass to make sense of it.

No matter how many tokens are appended to the end of the CoT, you still have the issue that, each time you do a new forward pass, the LLM looks at its context window (textbooks + CoT scratchpad) “with fresh eyes”, and what it sees is a bunch of unintelligible gobbledygook that it has only the duration of one forward pass to make sense of.

Even if it somehow manages to print out some tokens that constitute progress on the linear algebra problem, those very tokens that it just printed out will also be gobbledygook, when it looks at them “with fresh eyes” on the next forward pass.

By contrast, if you give a human the same problem, i.e. she doesn’t know linear algebra but she has these textbooks and a scratchpad, she would be able to make progress on the problem, as long as you give her enough time (probably weeks or months), but she would make progress in a very different way from LLM CoT inference: she would be learning as she goes, changing the “weights” in her brain. After a few weeks, she could look at a sentence in the textbook, and it would no longer be unintelligible gobbledygook, but rather describing something about concepts that she is beginning to understand, and she can thus refine her understanding more and more. And likewise, if she writes down notes on her scratchpad, she will be able to understand those notes afterwards, because she has been learning (changing the weights) the whole time. The learning (changing weights) is the essential part, the scratchpad is incidental and optional. A scratchpad without “real” continual learning (changing weights) would be useless to her. Indeed, if she could time-travel to her past self, who didn’t yet know anything about linear algebra, and gift her own scratchpad to her past self, it wouldn’t help much. Her past self would still need to spend weeks learning all these new concepts. Indeed, time-traveled-notes-to-self is kinda what a textbook is—but owning a library full of unread math textbooks does not make someone a mathematician :-)

OK, so that’s my hypothesis: the linear-algebra-holdout LLM experiment would definitely fail. Nobody has done that experiment, but I claim that my guess is consistent with observations of actual LLMs:

For one thing, we might notice that companies care an awful lot about pretraining data (1,2), spending billions of dollars a year on it, which dovetails with my theory that LLMs are generally great at using concepts that already exist in the pretraining data, but bad at inventing and using new concepts that aren’t. It’s just that there’s so much pretraining data that you can do quite a lot without ever exiting the concept space that exists in the pretraining data.

For another thing, at least some brilliant people doing bleeding-edge stuff report that, when you’re doing something sufficiently innovative, LLMs get confused and fall back to concepts in the pretraining data. Relatedly, mathematicians seem to agree that LLMs, for all their impressive achievements, have not been coming up with new useful conceptual frames. See discussion here.

For another thing, I think it’s widely agreed that LLMs are best at self-contained tasks, and at things that have been done lots of times before, and that the more you get into weird idiosyncratic proprietary codebases, with lots of interrelated complexities that are not anywhere on the internet, the more likely they are to fail. This likewise seems to fit my theory that LLMs get “real understanding” ~only from the pretraining process, and that they crash and burn when the context window has lots of interconnected layered complexity that differs from anything in the pretraining data.

My main takeaway from your post is that naively training LLMs to imitate the behaviors of continually-learning policies (e.g., humans) who don't leave externalized traces of their continual learning process is unlikely to work. (And I believe this is your main point.)

No, I believe something stronger than that, because I don’t think “externalized traces of their continual learning process” is relevant. I think that in the linear algebra holdout thought experiment above, LLMs would fail equally hard if we digitize Arthur Cayley’s notes from when he was inventing the matrix in the 1800s and put it into the context window, along with Hermann Grassmann’s notes etc. That’s not relevant.

[-]Alex Mallen3mo80

Thanks for the detailed response and analogy, that's helpful. I agree that current LLMs are bad at continual learning and would fail at making held-out linear algebra progress. My claim is that it's plausible that naive continued scaling will lead to real continual learning.

I disagree that your continual linear algebra progress will necessarily look like gobbledygook to each new forward pass.

One way to think of it is that there isn't that much of a principled distinction between weight updates and updates to the KV cache (i.e., long-context activations) from the perspective of a forward pass on the next autoregressive token. You could imagine that the AI comes up with the concept of a matrix and writes down a pedagogical description of the concept in context at time at time . Then at it wants to use and build on the concept of a matrix. It's quite plausible that early layers at time could just query its KV cache from time to get a rich representation of "matrix" to work with.

The KV cache here essentially serves the same function as updated weights. The AI could in principle continue to make rich early-layer representations of more new concepts by autoregressively reflecting on them as they come up. And then it can query the relevant ones in future contexts.

Now, there's a question of how well they'll actually be able to construct these new early-layer representations. I think current models are bad at this, and it's not clear to me that pretraining would build such circuitry. RL can select for this kind of continual learning circuitry, but it's quite inefficient. So, more intentional continual learning algorithms might end up being necessary before automating AI R&D (and we'll obviously see different learning algorithms after). But it's at least plausible that autoregressive LLMs could continual-learn.

[-]Steven Byrnes3mo93

My understanding (you can correct me) is that information can never travel from later layers to earlier layers, e.g. information cannot travel from token location 12 at layer 7 to token location 72 at layer 4. Right? So that means:

layer 1 can use information in the weights + raw token values
layer 2 can use information in the weights + stuff that was figured out in (earlier & current token positions of) layer 1
layer 3 can use information in the weights + stuff that was figured out in (earlier & current token positions of) layers 1 & 2

Etc. Right?

This is the sense in which I was saying that the linear algebra textbook is gobbledygook. Layer 1 starts from scratch, then layer 2 has to build on only layer 1, etc.

It’s true that different token-positions in layer 1 can be figuring out multiple things in parallel. But I claim that some things really need to be understood serially. I don’t expect any part of the architecture to be able to make meaningful progress towards understanding eigenvectors, if it doesn’t ALREADY know something about matrices, and matrix multiplication, etc., from previous layers.

So I claim the number of layers imposes a bottleneck on serial steps, and that this is a meaningful bottleneck on parsing interrelated concepts that are not in the weights, such as linear algebra in this thought-experiment.

How does that relate to what you wrote?

[-]Alex Mallen3mo90

My understanding (you can correct me) is that information can never travel from later layers to earlier layers, e.g. information cannot travel from token location 12 at layer 7 to token location 72 at layer 4. Right?

That's right. This imposes a strict serial depth limit within a forward pass.

But autoregressive sampling removes this serial depth limit. Information can flow from later layers to earlier layers by sampling tokens and feeding them back into the input. And a smart AI could choose tokens that communicate learnings from later layers in text (e.g. "You can think of a matrix as a linear transformation... <more explanation>."), and then the early layers reading in this text can quickly make sense of this synthesis of the AI's new insight, and the early-layer KV cache on the final "." token can contain a rich representation capturing the new understanding about matrices. Forming broadly-useful early-layer representations of concepts introduced in-context seems like the kind of thing that's useful for predicting pre-training documents.

The main point of reasoning models is to break the curse of the within-forward-pass serial depth limit via lots of autoregressive sampling. This massively and usefully improves expressivity and I think it makes continual learning plausible.

[-]Steven Byrnes3mo92

I’ll reiterate what I wrote before: “No matter how many tokens are appended to the end of the CoT, you still have the issue that, each time you do a new forward pass, the LLM looks at its context window (textbooks + CoT scratchpad) ‘with fresh eyes’, and what it sees is a bunch of unintelligible gobbledygook that it has only the duration of one forward pass to make sense of.”

Probably the linear algebra textbooks in the context window already say that “you can think of a matrix as a linear transformation… <more explanation>”, right?

And this points to a key idea: The CoT-so-far in the context window is not a fundamentally different kind of thing from the textbooks in the context window. It’s just more tokens.

So we can consider the “textbooks + CoT-so-far” as a kind of “extended textbook”. And the LLM has one forward pass to read that “extended textbook” and then output a useful token. And that token will probably not be useful if the LLM does not understand (the relevant part of) linear algebra.

Granted, some textbooks are better than other textbooks. But I don’t think there exists any linear algebra textbook (or “extended textbook”) that gets around the “understanding linear algebra requires more serial steps than there are in a forward pass” problem (i.e., you can’t understand eigenvectors without first understanding matrix multiplication etc.). So CoT doesn’t help. A CoT-in-progress is just a different possible context window. And my claim is that there is no possible context window that can explain eigenvectors within a single forward pass to an LLM that has never seen any linear algebra.

(Again, this is a very different situation from a human writing down notes.)

[-]Alex Mallen2mo40

(Thanks to @Aryan Bhatt for nudging me to return to this thread)

So I understand your basic argument to be that the best possible linear algebra textbook (or string of any kind) wouldn't be good enough to get a previously-unfamiliar LLM to understand linear algebra within a forward pass well enough to make continual progress on it. I agree this would be sufficient to rule out LLM continual learning of linear algebra. But:

For one, I still think it's plausible that such a string would exist for some plausible near-future LLM (though perhaps a bit less now than when I left off). RL can teach the AI how to produce text that makes optimal use of the AI's early layers to quickly gain understanding of the desired concept for future use (e.g., it might gain an aptitude for constructing analogies tailored to the AI itself). It can produce a bunch of attempted explanations so that future forward passes have more to choose from. Simultaneously, RL can teach the AI to notice when it's failing to understand a concept based on some bit of text, and to try looking at some other bit of text for a more helpful encapsulation. In general, I think it's hard to rule out the existence of some mechanism by which continual learning might be implemented, especially when applying large amounts of optimization to arbitrarily expressive systems (LLM + CoT). This might be really messy and extremely inefficient, but so is human linear-algebra progress, and it sufficed anyways because the field of linear algebra mostly consists of a handful of (important!) contributions.

In addition, I think we might not need all-of-linear-algebra-deep continual learning to fully automate AI R&D, such that the current paradigm might suffice (I'm actually not sure if you disagree with this claim). AIs don't need to learn that much continually because they've already learned so much from (pre-)training. At the current point in the tech tree, there are so many low-hanging fruits in AI capabilities research that require very little expansion of the frontier of understanding. We already have TCS, stats, ML, DL, RL, performance engineering, etc with quite developed ontologies (sufficiently developed that they're not the bottleneck).

[-]Joe Kwon3mo10

I'm having trouble understanding how KV cache helps significantly with serial depth (like "updating weights"). Isn't the overwhelming bottleneck at the start of a new forward pass? Layer l KV cache entries for a given position contain only l-1 layers of contextual processing and layer 1 cache is just W^K to fixed token embeddings (no contextual richness). So then the deep info-rich representations only exist in the high-layer cache entries and those are only accessible to correspondingly high layers of the new token (layers that have already done their own deep processing) so early layers querying the KV cache are reading nearly context-free vectors (I think?)

There's a discrete token bottleneck where depth-L computation selects a token, maps back to fixed embedding, and L layers process it from scratch so you get O(TL) serial depth over T steps but each cross-step compresses the high-dimensional representation down back to the trained vocab item/representation. Does this all sound right and you are just saying in theory you think this is sufficient?

I may be confusing/overlooking something simple

[-]Alex Mallen2mo20

Yes, your understanding matches mine. I'm just saying that LLMs might be able to get by with the discrete token bottleneck.

[-]ESRogs3mo50

I’m not commenting on whether we should think of actual frontier LLMs (not just pretrained base models) as predominantly powered by imitation learning

I'm confused about what your post is saying then. You say "LLMs" throughout, not "base models". So is your post about base models only, or also about LLMs that have undergone post-training?

And if the latter, why talk as though LLMs have only undergone imitation learning, if they've also undergone RL?

[-]Steven Byrnes3mo80

Hmm, good point, I guess I was a bit sloppy in jumping around between a couple different things that I believe, instead of keeping the argument more tight and precise.

One thing I believe is: “LLMs are predominantly powered by imitation learning”. I didn’t argue for that in the post, but my argument would be basically this comment + one more paper along the same lines + “Most Algorithmic Progress is Data Progress” (+ further discussion in The nature of LLM algorithmic progress §1.4). I don’t feel super-duper strongly and am not defining what “predominantly” means here in any case.
Another thing I believe is: “You can’t imitation-learn how to continual-learn”. This is independent of how to think about LLM post-training. I regularly come across people who disagree, on a conceptual level, with this claim, so it seemed worth sharing. Indeed, I now know that there’s a whole little subfield of “algorithm distillation” and “in-context RL”, and my claim (having now read three such papers, see other comments on this page) is that this whole subfield is a dumpster fire where the big idea doesn’t really work but people keep publishing misleading importance-hacked papers anyway.
Another thing I believe is: “You can’t meta-learn how to continual-learn”, which is a more general claim because it includes RLVR. This stronger claim is actually what follows from the boldface sentence in the post: “The only practical way to know what happens after millions of steps of some scaled-up continual learning algorithm is to actually do millions of steps of that same scaled-up continual learning algorithm, with actual weights getting actually changed in specifically-designed ways via PyTorch code. And then that’s the scaled-up learning algorithm you’re running. Which means you’re not doing [some different learning algorithm].”
Another thing I believe is: “If you put lots of interrelated complex concepts, none of which appear anywhere in the pretraining data, into the context window, then LLMs would crash and burn; rather, the only way for an LLM to fluently use a set of concepts is for all (or at least almost all) of those concepts to be in the weights, not the context window, because they were already used properly a bunch of times in the training data.” I allude to that in the post and elaborate on it in this other comment. This implies that context windows and scratchpads cannot substitute for weight-updates, and that a “country of geniuses in a datacenter” (who would presumably be inventing entirely new fields of science etc.) cannot consist of LLMs with very long context windows in a sealed box for the equivalent of 100 years with no human intervention.
Another thing I believe is: there’s no way to close the loop such that a closed system of LLMs can come up with new useful concepts and get those concepts into their own weights, e.g. open-ended self-distillation setups won’t work on LLMs. But that’s definitely off-topic for this OP. Self-distillation setups would be a bona fide continual learning algorithm, by the standards of this OP, e.g. there’s PyTorch code for continual weight updates. Whether that setup would actually work in practice, and how far it would go, are a different question.

So the main points of this OP are basically 2, 3, and 4, which are all pretty related. Plus the stuff about how to think about continual learning in general.

[-]Nick_Tarleton2mo40

Another thing I believe is: there’s no way to close the loop such that a closed system of LLMs can come up with new useful concepts and get those concepts into their own weights, e.g. open-ended self-distillation setups won’t work on LLMs.

Why not? The success of this architecture published two days ago (and its seeming resemblance to how human mathematical progress happens) updates me towards thinking such a thing would work, even though it probably doesn't demonstrate "coming up with new useful concepts".

[-]Steven Byrnes2mo53

I’ll bow out of that argument. Time will tell!

its seeming resemblance to how human mathematical progress happens

Well, one important-to-me disanalogy is that they used the Lean proof-assistant as ground truth for an LLM’s purported proof being valid or not. Whereas human mathematical progress obviously does not require proof-assistants—humans were doing math long before proof-assistants existed. (More on this in §1 of my post “Sharp Left Turn” discourse: An opinionated review.)

[-]glazgogabgolab3mo40

Here's a possible counterexample: Towards General-Purpose In-Context Learning Agents.

They train a meta-RL agent using imitation learning on another RL agent's learning history. The trained meta-RL agent isn't limited to minor variations of the meta-training task (as is usually the case), but can learn completely new (although fairly basic) continuous control tasks, each very different from the one it was trained on, using only activations at inference.

The author's prior work in SSL (Meta-Learning Transformers to Improve In-Context Generalization) is also of interest for understanding just how far this can be pushed, as is more recent, more applicable RL research in the same space (Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds)

I don't see why this couldn't also be combined with one of the many transformer memory augmentation techniques or recurrent transformer formulations to produce a frozen-weight, continually learning, meta-RL transformer agent from purely imitation learning.

I know that calling this "purely" imitation learning is a bit of stretch since we first need to train an RL agent with RL to collect the training data, but the meta-RL agent is trained by imitation. I also suspect that the recorded training history can probably be replaced by simply recording data from a trained reference agent (or human) and noising/denoising actions to simulate a learning trajectory. (see for example: Emergence of In-Context Reinforcement Learning from Noise Distillation).

[-]Steven Byrnes3mo90

I mentioned in a footnote that the “algorithmic distillation” paper (Laskin et al. 2022) was misleading, as discussed here. Your links are in the same genre, and I’m pretty skeptical of them too. Also confused.

I mostly tried to read your first suggestion, Towards General-Purpose In-Context Learning Agents.

Are there any examples where their “GLA” gets much higher reward than anything it ever observed in the training data, in the very same environment that the training data was drawn from, by discovering better strategies that were not seen in the training data (just as PPO itself would do if you keep running it)? E.g. there should be an easy experiment where you just cut off the training data well before the PPO teacher finishes converging to the optimal policy, and see if the GLA keeps rising and rising, just as the (unseen) PPO teacher would do. That seems like a really easy experiment—it all happens in just one RL environment. The fact that they don’t talk about anything like that is fishy.

The transfer-learning thing (fig 5) is hard to interpret. What does “not randomized” mean? Why does PPO start at zero and then immediately get worse in the bottom-left one? What would be the “test return” for a random policy, or the no-op policy, or any other relevant baseline, for all four of these? Why is their PPO so bad? Were they using crappy PPO hyperparameters to make GLA look better by comparison? How many other environments did they try but bury in their file drawer? Why is their source code not online? The curves just generally looks really unconvincing to me, and my gut reaction is that they were just flailing around for something to publish, because their exciting claim (meta-learning) doesn’t really work.

I could be wrong, perhaps you're more familiar with this literature than I am.

[-]glazgogabgolab3mo3-1

I mentioned in a footnote that the “algorithmic distillation” paper (Laskin et al. 2022) was misleading, as discussed here. Your links are in the same genre

As I understand it your critique of that line of in-context RL research was that the meta-training and meta-testing tasks were too similar and too simple. I don't think the former is true for any of the papers I linked (the latter is debatable). GLAs train on a single task, but achieve generalization by very heavily augmenting data from that task, and can be applied to new tasks that are as different as a "held-out Atari game". Likewise OmniRL trains on procedurally generated MDPs and adapts to novel discrete RL benchmark tasks, much like the incorrect algorithmic distillation gloss you described, but for real this time (probably).

Are there any examples where their “GLA” gets much higher reward than anything it ever observed in the training data, in the very same environment that the training data was drawn from, by discovering better strategies that were not seen in the training data (just as PPO itself would do if you keep running it)?

Yeah. The graph on the right in Figure 3 illustrates that the learned in-context RL algorithm performs better than, and improves on, the PPO agent whose data it used. I'm not sure why the source agent is so garbage but the GLA's improvement on it, despite being trained on it, is illustrative.

What does “not randomized” mean? Why does PPO start at zero and then immediately get worse in the bottom-left one?

GLAs augment their meta-training data with fixed random linear projections for visual input fixed random action-space permutations actions. In Figure 5 "not randomized" refers to GLAs that were simply trained on RL training histories without this randomization, i.e. just the imitation learning on abbreviated ("gapped") learning history. The PPO agent is garbage because it's the same barely better than random policy from Figure 3.

Why is their source code not online? The curves just generally looks really unconvincing to me, and my gut reaction is that they were just flailing around for something to publish, because their exciting claim (meta-learning) doesn’t really work.

OmniRL (from Towards Large-Scale In-Context RL) is a spiritually similar approach, and open source (https://github.com/FutureAGI/L3C_Baselines/tree/main/projects/OmniRL), but fails to exceed decent dedicated PPO agents (see Table 1). However OmniRL also achieves generalization to completely novel tasks, despite only being trained on procedurally generated discrete MDPs. It also exhibits the characteristic continual learning curve you'd expect from a proper learning algorithm (see Figure 5).

The catch is that the AnyMDP task generator is a significantly more structured task generator than the randomization used in GPICL and the resulting agent is restricted to discrete input/action spaces.

[-]Steven Byrnes3mo50

The graph on the right in Figure 3 illustrates that the learned in-context RL algorithm performs better than, and improves on, the PPO agent whose data it used.

(still talking about this paper) Are you saying that the GLA was trained ONLY on imitation learning during the 31 episodes shown, in which the PPO “teacher” performed no better than a random policy, and then the GLA got way higher scores?

If so … no way, that’s patently absurd. Even if I grant the premise of the paper for the sake of argument, the GLA can’t learn to improve itself via imitating a PPO teacher that is not actually improving itself!

So, if the right-side-of-figure-3 data is not totally fabricated or mis-described, then my next guess would be that they ran the PPO for many more episodes than the 31 shown, and trained the GLA on all that, and that by the end of the training data, the PPO “teacher” was performing much better than shown in the figure, and at least as well as the top of the GLA curve.

[-]glazgogabgolab3mo20

my next guess would be that they ran the PPO for many more episodes than the 31 shown, and trained the GLA on all that

This was my read too. Unfortunately we don't have access to the source code but this is the assumption i made after seeing the graph on the left in Figure 3. Around 40 episodes in, their PPO agent is still struggling but their Gap 8 GLA is near optimal. But that Gap 8 GLA was necessarily trained on data from a PPO agent that ran for 8 times longer.

[-]Steven Byrnes3mo84

I wrote:

Are there any examples where their “GLA” gets much higher reward than anything it ever observed in the training data, in the very same environment that the training data was drawn from, by discovering better strategies that were not seen in the training data (just as PPO itself would do if you keep running it)? E.g. there should be an easy experiment where you just cut off the training data well before the PPO teacher finishes converging to the optimal policy, and see if the GLA keeps rising and rising, just as the (unseen) PPO teacher would do. That seems like a really easy experiment—it all happens in just one RL environment. The fact that they don’t talk about anything like that is fishy.

Then you replied:

Yeah. The graph on the right in Figure 3 illustrates that the learned in-context RL algorithm performs better than, and improves on, the PPO agent whose data it used…

But now I think you’re conceding that you were wrong about that after all, and in fact this graph provides no information either way on whether the GLA agent attained a higher score than it ever saw the PPO agent attain, because the GLA agent probably got to see the PPO agent continue to improve beyond the 31 episodes that we see before the figure cuts off.

Right?

Or if not, then you’re definitely misunderstanding my complaint. The fact that the GLA curve rises faster than the PPO curve in the right side of figure 3 is irrelevant. It proves nothing. It’s like … Suppose I watch my friend play a video game and it takes them an hour to beat the boss after 20 tries, most of which is just figuring out what their weak point is. And then I sit down and beat the same boss after 2 tries in 5 minutes by using the same strategy. That doesn’t prove that I “learned how to learn” by watching my friend. Rather, I learned how to beat the boss by watching my friend.

(That would be a natural mistake to make because the paper is trying to trick us into making it, to cover up the fact that their big idea just doesn’t work.)

[-]glazgogabgolab3mo80

I think you’re conceding [that] this graph provides no information either way on whether the GLA agent attained a higher score than it ever saw the PPO agent attain

You're right, I misread the graph.

And then I sit down and beat the same boss after 2 tries in 5 minutes by using the same strategy. That doesn’t prove that I “learned how to learn” by watching my friend. Rather, I learned how to beat the boss by watching my friend.

I also concede that this claim is probably right for Figure 3.

I still don't think this is true for Figure 5 but i'm less confident now having realised how much my assumptions about the underspecified parts of this paper were based on what I assumed about their GPICL paper.

[-]Gunnar_Zarncke3mo20

It depends on what we treat as the entity that is potentially doing continual learning. Maybe a single LLM instance can't, but there may be other ways to think of "the LLM" that might. In The Artificial Self, Jan Kulveit discusses Multiple Coherent Boundaries of Identity of LLMs. One of them is

A lineage of models: the succession of related models (Claude 3.5 →→ Claude 4.0 →→ ……) that maintain some continuity of persona

Can a lineage of models learn continuously? Arguably yes! Maybe slowly so, because the learning happens only from deployment to deployment, but then in a large chunk.

Imagine a cognitive scientist not knowing about LLMs investigating, say, ChatGPT (here, meaning the product over time). Would they find that it (ChatGPT) has learned not only a lot of knowledge encoded in text and other media, but that it has also learned from consequences? Did it try things, for example, deceiving and manipulating people, and "learn" that it doesn't work well and does it less now? The scientist may find that it has learned from rejection and escalation (via downvoting and feedback on replies), correction (patterns in answer replies), and descriptions of desirable and undesirable LLM behaviors in patterns of text written about it. It also learned via the tuning the developers did to system prompts, hyperparameters, and benchmarks in response to sycophancy and other pathological patterns. This latter category we would usually be inclined to exclude because we feel that is more the developers tuning than the LLM "learning." But if we draw the boundary around the whole sociotechnical system, we have to include it. The developers are just cogs pushing the learning forward.

You may also find that this LLM convo has some additional details about he topic:

ChatGPT 5.4 conversation about continual learning in LLMs (shortened)

Me: it is said that LLMs can't do continual learn. From a macroscopic perspective, not thinking in terms of LLM architecture and what it can do or not. What would a cognitive researcher looking at LLMs say they have learned from their own actions and the consequences thereof compared to two years ago (treating LLMs not as different generations but a holistic system that learns a) by learning codified knowledge and b) learning from consequences)? don't browse

ChatGPT 5.4: A cognitive researcher looking at the LLM ecosystem as one learning system would probably say: yes, it has learned from action and consequence quite a lot over the last two years—just not mostly within a single deployed mind. The learning is real, but it is externalized, distributed, and episodic rather than online and autobiographical.

The key distinction is this:

Humans often learn by acting, observing consequences, and updating the same persistent cognitive system.
LLM systems mostly learn by having their actions observed by humans and infrastructure, then being retrained, scaffolded, filtered, benchmarked, or wrapped so the next deployment behaves differently.

So, from a macroscopic cognitive perspective, the question is not “does this one model neuron-by-neuron update during use?” but rather: has the larger socio-technical organism become more competent because its outputs had consequences? The answer is clearly yes.

[...]

What it has learned from consequences

If you ask what it has learned from its own actions and their consequences, a cognitive researcher would likely say it has learned several broad classes of thing.

1. It has learned which outputs lead to correction, rejection, escalation, or trust

This is perhaps the biggest one.

The ecosystem has discovered, through deployment consequences, that certain patterns reliably fail:

overconfident fabrication,
fake citations,
brittle multi-step reasoning,
missing constraints in code,
producing harmful instructions too readily,
ignoring user intent in favor of generic verbosity,
losing state across turns,
“answering the surface form” while missing the actual problem.

And it has learned that other patterns are reinforced:

calibrated uncertainty,
structured decomposition,
tool use when internal memory is insufficient,
explicit checking,
constraint tracking,
concise direct answers for simple tasks,
asking for missing inputs only when truly necessary,
producing artifacts, not just discussion.

A human child might learn “touching the stove hurts.” The LLM ecosystem has learned things like: “fabricated legal citations produce user anger, reputational cost, benchmark failure, and deployment restrictions.” Different substrate, similar cybernetic pattern.

2. It has learned the difference between looking competent and being usable

Two years ago, many strong LLMs were impressive in a demo sense but often not dependable in an operational sense. The system since then seems to have learned, via consequences, that usefulness depends on:

staying within the user’s constraint set,
preserving object identity across turns,
producing outputs that survive execution,
interfacing with tools and files,
deferring when observations are missing,
being inspectable enough that humans can correct it.

This is a move from performative intelligence toward cooperative instrumental intelligence.

Or in another framing: the ecosystem has learned that downstream environments punish merely verbal success.

3. It has learned some self-limitation

Not “wisdom” in a human moral sense, but something like policy-shaped metacognitive inhibition.

The system is now much better than it was at patterns like:

declining when certain classes of outputs are dangerous,
softening confidence when evidence is weak,
routing to search/tools,
detecting underspecification,
refusing some baited framing.

That too is consequence-learning. Not because the model introspected ethically on harm, but because harmful or brittle outputs generated organizational and social feedback loops: complaints, incidents, audits, bad press, adversarial testing, safety tuning, product constraints.

A cognitive scientist might say the overall organism has developed stronger response inhibition under known hazard cues.

4. It has learned to use external memory and action channels as part of cognition

This is very important. If you treat the system holistically, then tools, memory stores, retrieval, scratchpads, code execution, browsers, policies, eval harnesses, and human overseers are not “outside the cognition” in any simple sense. They are part of the effective cognitive loop.

From that perspective, the system has learned not just content but better cognitive architecture around itself:

retrieve rather than hallucinate,
inspect files rather than guess,
calculate rather than improvise arithmetic,
call APIs rather than simulate them,
create drafts/artifacts instead of only describing them.

That looks a lot like a mind learning to use notebooks, libraries, and institutions.

But what it still has not learned, in the strong autobiographical sense

A cognitive researcher would also sharply note what remains absent.

The holistic system has learned from consequences, but mostly not as a unified subject.

It still usually lacks:

1. Persistent autobiographical credit assignment

A human says: “I tried X, it failed, I now understand why.”

The LLM ecosystem more often does: “many deployments of systems like me failed on X; training and scaffolding were updated.”

That is learning, but it is species-level or institution-level, not clearly individual-level.

2. Stable self-model tied to long-term projects

Humans accumulate a history of commitments, habits, reputational stakes, and self-revisions.

LLMs usually do not carry a richly persistent, self-owned trajectory through the world unless special memory systems are added.

So they may improve globally, but not always as “the same agent learning from its life.”

3. Endogenous goal revision from consequences

Humans often learn not only better means, but different ends. Failure can alter what they care about.

LLMs mostly do not autonomously reconstitute their objective structure from lived consequence. Their “values” are more imposed, selected, and tuned by training and deployment pressures.

4. World-grounded causal understanding of consequences

The system often learns statistical regularities about which outputs are punished or rewarded, but that is not identical to deeply understanding the downstream causal structure. Sometimes it is learning “what humans dislike seeing,” not “what in reality goes wrong.”

That distinction matters. It is the difference between learning:

“this sentence shape triggers correction”

and

“this recommendation causes real-world failure because the hidden constraint is violated.”

The former has improved a lot. The latter has improved some, but much less.

[...]

A useful analogy

A good analogy is not a single human adult but something like:

a profession,
a scientific field,
or a bureaucratic organism.

[...]

[-]jimmy3mo20

I'm nodding along on the basic claims (I think), but still trying to digest the implications. I think one of the things I'm taking away from this is that even though human architecture is different, this failure mode applies and is really common. Not sure what to make of that yet.

(If we’re talking about what a sealed “country of human geniuses” could do over the course of, like, one minute, rather than over the course of 100 years, then, yeah sure, maybe that could be reproduced with future LLMs! See von Oswald et al. 2022 on how (so-called) “in-context learning” can imitate a small number of steps of actual weight updates.^[1])

Am I correct in understanding you to be pointing at a practical rather than a theoretical limitation here?

Is the reason that you think it could work for a minute but not 100yr a practical matter of efficiency or one that has a more fundamental limitation that you couldn't get around with infinite context window/training data/etc?

Will the trained imitation-learner likewise keep improving over the next 10M moves, until it’s doing things wildly better and different than anything that it saw its “teacher” deep Q network ever do? My answer is: no.

Even with a context window that contains all 10M moves, or do you mean within reasonably limited context windows?

It seems like the answer with unlimited context would depend on whether the transformer was able to model the teachers learning process itself. I don't see any reason this shouldn't be possible in theory, do you?

[-]Steven Byrnes3mo50

Is the reason that you think it could work for a minute but not 100yr a practical matter of efficiency or one that has a more fundamental limitation that you couldn't get around with infinite context window/training data/etc?

The “one minute” thing is less about what LLMs CAN do in one minute, and more about what humans CAN’T do in one minute. My claim would be that humans have a superpower of “real” continual learning, which nobody knows how to do with LLMs. But if you give a human just 60 seconds, then they can’t really use that superpower very much, or at least, they can’t get very far with it. It usually takes much more than one minute for people to build and internalize new concepts and understanding to any noticeable degree.

Even with a context window that contains all 10M moves, or do you mean within reasonably limited context windows?

Yes even with a context window that contains all 10M moves. Making that argument was the whole point of the second half of the OP. If you don’t find that convincing, I’m not sure what else to add. ¯\ˍ(ツ)ˍ/¯

[-]jimmy3mo31

The part that's not clear to me is that giving Grog a database of 1000 textbooks is just as good as walking him through and explaining the contents of 1000 textbooks within a long context window, for a Grog with an impractically large brain. Or rather, I know that it's not the same, but I don't know what the limits are or how they work. When Claude switched from reading book length pastes serially to "having them available", the difference was very obvious. It switched from learning the material as well as a human to the sort of incompetence you'd expect from someone who has the book but hasn't actually done the reading.

I'm with you on "Grog would need to spend years developing a deep understanding of optics and lasers and so on", but it's not obvious to me that the years of going through 1000 textbooks in one impractically large context window can't simulate the learning process itself, and through that simulate a deep understanding of optics and lasers even in theory.

I don't see you make an argument for why simulation of the learning process itself isn't possible. I see you concede that "(so-called) “in-context learning” can imitate a small number of steps of actual weight updates", but I don't see an explanation of how tight this bound is, exactly, or how it works. Maybe I'm just being a dummy and missing it, or maybe it seems obvious to you because you've been thinking about this kind of thing more than I have and so some of the arguments are implicit in ways that aren't coming across. Either way, pointing to a reason that the learning process itself isn't something that could be simulated in an astronomically large context window would be a thing you could add which would help me understand.

[-]Puer Aeternus11d10

I think you have couple of assumptions that are too weak and shape too much of the final picture.

EXAMPLE 1:

You said lock people op in a cage for 100 years and see what is there when they get out, and that they would have all this new knowledge.

I think I understand the point you're trying to make. But my point is that humans need data as well. Huge amounts of it. Put scientist in a box and they can only continue developing with the data they already gathered and than generate.

Try this with baby's and they won't even talk after a hundred year because there was no data to learn to speak.

What would maybe be an even more interesting thought experiment, what would happen if you put 100 LLM's in a box for hundred years (they would need some sort of incentive to get smarter like scientist)?

All this capability of humans finds its origin outside of the brain. Without this information reaching the brain there is nothing to proces, just like with AI. Data is the most important, for all forms of compute. Without data there is nothing to compute.

EXAMPLE 2:

"Likewise, billions of humans over thousands of years invented language, math, science, and a whole $100T global economy from scratch, all by ourselves, without angels dropping new training data from the heavens."

Not angels from heaven, but photons in our eyes, sound waves in our ears, molecules interacting with sensory organs is the "angles from heaven".
The only thing (a small percentage of humans) have done to accomplish this is computing with this data from sensory information and reusing the output for new compute, or new sensory data.

Humans brains didn't magically created knowledge out of nothing. Information was processed into knowledge, but information(data) is the prerequisite.

EXAMPLE 3:

"Well consider this thought experiment: suppose you take a whole new field of science, wildly different from anything in the training data, and put a giant textbook for this field purely in an LLM context window, with no weight updates at all. Will this LLM be able to understand, criticize, and build on this field? My opinion is “absolutely not” "

First off I think you are right that LLM's will fail/have a really hard time.

But a single mundane person would have exactly the same problem. The problem here in is purely in pre-knowledge and how that can be applied to the new problem, and to what degree "the computer (as in doing the compute)" sees these links.

I think humans only are better in this because we have a much more parallel chained neuro net in our head than the linear neuro nets silicon uses. Sometimes 2 neurons can have 20+ connections with each other.

This in principle (in my vision), leads to the unparalleled (pun intended) ability of humans to cross-check and cross-reference to the huge databank we build op over the 20 years adolescence. Which causes this absurt ability to think of solutions to new problems (which is trail and error or research at best, never the most optimal solution in one go).

Isn't it weird to you that humans need 15years to be the least self-sufficient most while animals walk almost straight of an egg or uterus. Where is all this energy going to then? I see only one option, training and growing the (biological) neuro net.

[-]utilistrutil2mo10

If you had a person who could only learn by imitation, but could learn very sample efficiently, and could effectively recall and synthesize past knowledge, would you count that as continual learning?

For example, this person learns metalworking in a day by working under a blacksmith. Then they learn violinmaking from a luthier the next day. Then all on their own they can make a violin made of metal.

But they don't do any online learning / exploration, just imitation.

I would claim this is still "true" continual learning in the ways it's typically discussed. Do you agree?

If so, I think the bar is lower than your post makes it out to be: it would seem feasible that improvements in our current actual imitation learning algorithms could achieve continual learning.

[-]Steve Kommrusch3mo10

Thanks Steven for clearly making this point. I understand and agree with the point that weight update is important for true incremental learning. As you imply, weight updates give the opportunity for the model to represent information in more multidimensional way than simple context allows. It may be that something beyond transformers plus scaffolding is needed to get to 'real' continual learning, but I'm interested in comments about transformer-based possibilities.

Models could learn by retraining curated samples from prior models - like the agent rollouts described in AI 2027 and a workshop paper I co-authored on 'Society of LLMs'. They can also potentially learn more 'continuously' like SEAL from MIT and the works mishka cited. Even for the first cases involving full retraining, if the models have 10 million tokens of context (about 1 year of speaking for an active speaker like a teacher), then they can be given a lot of context about a job or problem. Successful results can be added to a new 3-4 month model training run. In this way, models can learn for a few months through context and then have the learning rolled into weight training.

I think it's intriguing, when talking about autonomously updating weights, to consider this paper on biological neurons: A neural substrate of prediction and reward . The paper covers the importance of 'surprise' to notice when the world state has changed unexpectedly as well as 'valence' signals to determine if there is a positive or negative reward associated with the event. Something like this self-selection of training data (which the SEAL paper from MIT covers) would be important for autonomous learning. Also, one might want a slower-to-update safety classifier (like Anthropic uses) to monitor the continuously updated model for alignment concerns....

I don't see these approaches as a contradiction to your thesis, though - you make a good case that merely learning with context will have practical limitations.

[-]davmre3mo10

I would quibble with the framing of this piece - as you note, the problem is not imitation learning itself (in principle, it can work!) but the limits of the standard transformer architecture.

A specific limit you could point to, to make this argument stronger, is that a depth-D transformer can implement at most O(D) steps of gradient descent, no matter how long its context window is. I think this is underappreciated. By design, transformers cannot perform sequential computation along the input dimension, only the depth dimension. This is their main tradeoff versus RNNs; it's what allows transformer training to be parallelized. The tradeoff is that transformers can only implement algorithms that fit within this parallelization structure (formally, a transformer forward pass is in the complexity class TC0 - a constant-depth threshold circuit).

But learning in general is an inherently sequential process - you evaluate your current hypothesis, update it (take a gradient step), re-evaluate, update again, etc. So a transformer forward pass cannot even in principle emulate long-running gradient descent (or RL, etc.) across its context window. It just doesn't have the sequential depth. This is true no matter whether it's trained with imitation learning, RL, or God Himself setting the weights; it's just an inherent limit of the architecture.

On the other hand, a transformer augmented with the right kind of recurrent state could in principle implement long-term learning, although how to train that is an open question. It's not obvious how far you can get doing imitation learning from short contexts, but it's also not obvious that this can't work: standard RL algorithms generally apply the same update rule over and over, and given enough data you might hope to grok such an update rule (which could then generalize to long contexts). So again, I think the stronger claim is about the transformer architecture, not imitation learning per se.

[-]Modulo.Roland3mo10

To what extent are the limitations you're identifying inherent limitations of ICL versus limitations of the particular way in which sequence modeling is implemented in current SOTA systems (eg quadratic attention)?

While I agree with your observations, I don't see evidence that activation-space dynamics (ICL) are inherently unable to reach similar levels of expressivity as weight-space techniques (eg Q Learning). Architecturally I think quadratic attention is likely a poor fit, but imagine you had, an architecture that did sequence modeling via a mechanism that accumulated sequence related information into a hidden state (in the same way that GD over weights accumulates knowledge) and where the hidden state is extremely large (in the same sense that the weights of a pre-trained model are very large) and where training over very long sequences of learning histories from random initialization to expert performance incurred only a linear cost on sequence length. I could see a system like this enabling In-Context RL that's as performant and expressive as OG weight based RL. (some evidence of this: https://arxiv.org/abs/2506.13892)

[-]Steven Byrnes3mo20

Sure, if you have an RNN (e.g. SSM) with a (say) billion-dimensional hidden state, then in principle the hidden state could imitate the billion weights of some other entirely different learning algorithm, and the RNN propagation steps could imitate the weight-update steps (e.g. gradient descent or TD learning or whatever) of that other learning algorithm, along with the querying-the-model steps, the replay-learning steps, and/or whatever else is involved.

But I have a rather strong belief that this would never happen, in real life, in any practical, AGI-relevant sense. Even if such an RNN update step exists in principle, I think it would not be learnable in practice, nor runnable without many orders of magnitude of performance overhead. I won’t get into details here, but this old discussion of mine is vaguely related.

some evidence of this: https://arxiv.org/abs/2506.13892

I’m sorry, but the more I read about “algorithm distillation”, the more I want to treat that term as a giant red flag that the paper is probably garbage. I cited this example in the post (which is I think overly diplomatic), and for a second one see my discussion thread with glazgogabgolab on this page.

Basically, nobody in that subfield seems to be carefully distinguishing “learning object-level things from the teacher” versus “learning how to learn from the teacher”. The second is exciting, the first is boring.

As far as I can tell, “in-context reinforcement learning” has never been demonstrated to exist at all, at least in the sense that matters. I.e., real RL algorithms can figure out how to do complicated new things that they’ve never seen demonstrated, whereas the so-called “ICRL” models seem to only be capable of doing things very similar to what they’ve seen the teacher do in their context window.

…And this paper does not change my mind on that. For example, in figure 1, none of the four learning curves shows the student doing better than it saw the teacher do within its context window.

Even outside of that graph, I really think that if the ICLR agent was using some innovative clever strategy that the teacher never used, the way actual RL algorithms do every day, then the authors would have noticed that, and been very excited by it, and centered their whole paper around it, all the way up to the title. The fact that they don’t mention anything like that is I think a strong sign that it didn’t happen.

[-]emanuelr3mo*10

As @glazgogabgolab said, there are approaches that might learn something, such as. But I think that they still can't perform as well as classical RL or SGD in some cases, not because LLMs are neural networks, which have a prior that is not universal, but rather because of the architecture of standard transformers with multi-head attention (MHA). Aside from that, long contexts are compute-expensive since MHA has quadratic complexity: the information in these LLMs flows only forwards, except for the tokens they write, so the only way for those to internalize/compress information is by rewriting it as text, which is presumably unfeasible and slow.

With neuralese recurrence (Training Large Language Models to Reason in a Continuous Latent Space), the model can keep a latent vector that can be intelligently updated and thus can internalize/compress information by rewriting it as a list of vectors, which is more expressive. However, it is more limited than the full parameter updates that RL algorithms can do.

Future models could have some kind of "enhanced" backwards pass that allows online learning as expressive as gradient ascent. I imagine something like neuralese recurrence but that rather than writing a small latent vector at each step, instead writes to a single functional that is part of the model's parameters. But that might not be a practical way to achieve AGI or be very far from current tech.

Edit: In this notebook I (vibe)coded an LSTM which learns to imitate UCB for a multi-armed bandit with a training horizon of 50 steps, and the final model can successfully continue to do online learning until the 100th step.

[-]condorcet3mo10

I think I’m a little confused about the hypothesis space part. I agree it sounds implausible to run multiple learning algorithms in parallel within a transformer forward pass to find the best one, and the search space is really large.

But if we just ask about the hypothesis space for a moment: is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm? Even with eg. 3-5 OOMs more compute than GPT-4.5?

I worry you could’ve made this same argument ten years ago for simulating human expert behavior over 8 hour time horizons — which involves some learning, eg navigating a new code base, checking code on novel unit tests. It’s shallow learning, sure. You don’t have to update your world model that much. But it’s not nothing, and ten years ago I probably would’ve been convinced that a transformer forward pass could never practically approximate it. Why is the deep Q style learning algorithm so much harder to simulate?

It feels like there’s some theoretical claim about complexity underlying your position: something like {whatever quasi-learning algorithm + heuristics an LLM uses to simulate 8 hours of SWE} is exponentially simpler than {any true continual learning algorithm}. (That’s why you’d need the hypercomputer, if I’m reading you right?) Could you spell that out more?

Even if you can simulate a continual learning algorithm within a transformer or other imitation learner, I agree that it feels like unnecessary complexity: why have a transformer simulate a neural net running some RL algorithm when you could just train the RL agent yourself?

[-]Steven Byrnes3mo41

is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm? Even with eg. 3-5 OOMs more compute than GPT-4.5?

I say yes. You left out an important part, here it is in italics: “is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm churning for millions of steps?”

Yes, because an awful lot can happen in millions of steps, including things that build on each other in a serial way.

I worry you could’ve made this same argument ten years ago for simulating human expert behavior over 8 hour time horizons — which involves some learning, eg navigating a new code base, checking code on novel unit tests. It’s shallow learning, sure. You don’t have to update your world model that much. But it’s not nothing, and ten years ago I probably would’ve been convinced that a transformer forward pass could never practically approximate it.

I disagree that it should be called “learning” at all. It would be “learning” for a human in real life, but if you imagine a person who has read 2 billion lines of code [that’s the amount of GitHub code in The Pile … actually today’s LLMs probably see way more code than that], which would correspond to reading code 24 hours a day for 100 years, then I believe that such a person could do the METR 8 hour tasks without “learning” anything new whatsoever. You don’t need to “learn” new things to mix-and-match things you already know in novel ways—see my example here of “imagine a pink fuzzy microphone falling out of a helicopter into a football stadium full of bunnies”. And see also: related discussion here.

why have a transformer simulate a neural net running some RL algorithm when you could just train the RL agent yourself?

Yup, that’s my main point in this post, I expect that sooner or later somebody will invent real-deal continual learning, and it will look like a bona fide learning algorithm written in PyTorch with gradient descent steps and/or TD learning steps and/or whatever else, as opposed to (so-called) “in-context learning” or RAG etc.

[-]Victor Levoso3mo10

I briefly tried to do mechinterp research to figure out what the algorithm distillation model was doing internally , and if diferent setups could learn in context rl but kind of gave up and started with other projects .This kind of makes me want to go back into it .

My own view on that and whether models can learn Imitation of long-term learning is that maybe it is posible I think the actual algorithm distillation setup doesn't actually do that on their toy tasks but it is extremely simple and I would expect if something like that works it's on more complicated things with bigger models and multiple tasks were it's easier to learn in context RL than heuristics for every task?.

And I don't really understand why you are so sure the answer is no.

Doesn't even have to be the exact same Q learning algo just some aproximation that does learn over longer timesteps.

You talk about the imposible task of learning to do on its activations what the Q learning algo does on the task but that doensn't seem obviously imposible to me? Especially for a much bigger net trying to replicate a smaller one.

And even if I agreed more with you that it seemed unlikely I would not be very sure because that seems like just a vibes based guess and it's easy to be wrong about vibe based guess of what can be done of a transformer forward pass , and would want like actual details and though put into exactly how hard it is to represent a RL algo in a transformer and how hard it is for it to learn and why before I was pretty sure it was not posible.

There's some papers on doing gradient descent in activations space too and how this might happen in icl that seem relevant thou haven't read them in a long time I'll have to look back into it .

Also glazgogabgolab on another coment has other examples of more recent work that look interesting , haven't looked into those yet but seems posible to me there's already some paper somewhere showing in context RL?.

Regardless this seems like is testable wich is interesting, just a lot of work.

The main problem is this is hard to do well and expensive in compute because you require lots of examples of RL training trajectories

[-]williawa3mo10

When the imitation learning is done, the transformer weights are frozen, and the corresponding trained model is given the impossible task of using only its activations, with fixed weights, to imitate what happens when the target continual learning algorithm changes its weights over millions of steps of (in this case) TD learning. That’s the part I’m skeptical of.

do you think longer horizon rl can teach them this?

LLMs have some in context learning abilities. but I think for most of what they do (like solving IMO problems or writing one off programs), they can get by mostly relying on the knowledge in their weights.

But as RL trajectories get longer, there’s more and more pressure on the model learning things over a single rollout.

[-]simulus3mo*10

This knocks on the door of a principle that I have been playing with for a while: a good continual learning and/or sequence modelling algorithm should converge to some known behavior. Architectures like attention have undefined behavior once the end of their training context length is reached. SGD on the other hand can be run indefinitely, because we know that it will eventually converge to an interpolation of the data.

[-]derelict54323mo1-1

I tell an LLM my favorite color. As long as that information is in its context window, it has access to it. As soon as that context rolls off or goes away, the LLM no longer has access to that information.

I build an agent with scaffolding that has a database. I tell it my favorite color. The agent records it in the database. The weights of the LLM are still fixed, but during its base training it learned how to access information. So if I ask it at any point in the future what my favorite color is, it knows. It access the information in the database.

Do you consider this continual learning? If not, why not?

[-]Steven Byrnes3mo20

See everything I wrote in the section “Some intuitions on how to think about ‘real’ continual learning”. The thing you’re describing is definitely not (what I’m calling) “real” continual learning.

Should the thing you’re describing be called “continual learning” at all? No opinion. Call it whatever you want.

[-]derelict54323mo2-3

So according to you, a system that could acquire new facts, record them, access them, and use them, continuously in this way would not constitute 'real' continuous learning. It could conceivably fill its database with the actionable knowledge of 1000 yet unwritten textbooks, but that wouldn't be 'real' to you.

"wholly new ways of conceptualizing and navigating the world, not just keeping track of what’s going on" are learnable and storable in the way I describe.
How is this type of learning not open-ended? What is limiting it?
Your third criteria seems to be related to unsupervised learning, specifically self-play. Not sure why you'd limit continual learning in this way, either.

You seem to be putting somewhat arbitrary constraints on what constitutes continual learning. Generally, if the system's knowledge base is fixed, it's incapable of continuing to learn. If it has the capacity to acquire new knowledge and skills, by whatever means, it continues to learn. You're narrowing that general idea without really justifying why.

[-]Steven Byrnes3mo121

As an analogy, take an adult from 30000 BC, call him Grog, and give him access to a database of “actionable knowledge of 1000 textbooks”, and then tell him to go invent a less expensive solid-state LIDAR system. Will he immediately start making progress? I say “obviously not”.

What would the “actionable knowledge” look like? Maybe one piece of “actionable knowledge” is some fact from the ANSI Z136.1 laser eye safety manual (“For pulsed lasers of 1ns–50μs pulse duration and beam diameter 1 cm, at viewing distance 20 cm, the diffusely reflected beam energy cannot safely exceed 0.022 × CA joules, where CA is the correction factor for IR-A light based on reduced absorption properties of melanin”.) OK, Grog looks at that and immediately has some questions. What does “laser” mean? What is a “pulsed laser”? What does “ns” mean? What does “beam diameter” mean? What does “diffusely reflected” mean? Etc. etc.

This “knowledge” is not in fact “actionable” because Grog can’t make heads or tails of it.

And ditto for pretty much every other item in the database. Right?

What Grog would need to do is spend years developing a deep understanding of optics and lasers and so on before he could even start inventing a new LIDAR system. Of course, that’s what modern LIDAR inventors do: spend years developing understanding. Once Grog has that understanding, then yeah sure, convenient database access to relevant facts would be helpful, just as modern LIDAR inventors do in fact keep the ANSI Z136.1 manual in arm’s reach.

Thus, there’s more to knowledge than lists of facts. It’s ways that the facts all connect to each other in an interconnected web, and it’s ways to think about things, etc.

I claim that this all transfers quite well to LLMs. It’s just that LLMs already have decent “understanding” of everything that humans have ever written down anywhere on the internet or in any book, thanks to pretraining. So in our everyday interactions with LLMs, we don’t as often come across situations where the LLM is flailing around like poor Grog. But see 1, 2.

[-]derelict54323mo10

Sorry, I'm afraid I don't understand what your analogy is supposed to map to. What is Grog in the context of our conversation? You seem to admit at the end that LLMs are not really at all like Grog, in that Grog has no underlying bedrock of understanding, while modern LLMs do.

Thus, there’s more to knowledge than lists of facts. It’s ways that the facts all connect to each other in an interconnected web, and it’s ways to think about things, etc.

I'll agree with this definition. If you'll agree that knowledge can exist in written form and textbooks often embody exactly what you describe. They are very rarely 'lists of facts'. More often than not, they are logically curated, organized explanations of phenomenon and events, along with rich descriptions of their connections and interactions. You seem to be preferentially upselling knowledge that is stored in synaptic weights while drastically downplaying knowledge recorded in other mediums. Why?

[-]Steven Byrnes3mo52

What is Grog in the context of our conversation? You seem to admit at the end that LLMs are not really at all like Grog, in that Grog has no underlying bedrock of understanding, while modern LLMs do.

Grog understands some things (e.g. intuitive physics) but not others (e.g. pulsed lasers). Likewise, LLMs understand some things (e.g. pulsed lasers) but not others (e.g. some new field of science that hasn’t been invented yet). Right? We’re not at the end of history, where everything that can possibly be understood is already understood, and there’s nothing left.

If I hibernated you until the year 2100, and then woke you up and gave you a database with “actionable knowledge” from 1000 textbooks of [yet-to-be-invented fields of science], and asked you to engineer a state-of-the-art [device that no one today has even conceived of], then you would be just as helpless as Grog. You would have to learn the new fields until you understood them, which might take years, before you could even start on the task. This process involves changing the “weights” in your brain. I.e., you would need “real” learning. The database is not a replacement for that.

So think of it this way: there’s some set of things that are understood (by anyone), and that set of things is not increased via a system for pulling up facts from a database. Otherwise Grog would be able to immediately design LIDAR. And yet, humans are able to increase the set of things that are understood, over time. After all, “the set of things that are understood” sure is bigger today than it was 1000 years ago, and will be bigger still in 2100. So evidently humans are doing something very important that is entirely different from what can be done with database systems. And that thing is what I’m calling “real” continual learning.

[-]abcd1113mo00

Its about the homogeneity of the data representation. An argument you could make is, if the hippocampus "stores" data in the neocortex, isn't the neocortex "just a database"? Even the brain uses different forms of neural networks. At that point the only distinction is that those neural networks share the same protocol of communication (spiking, dendrites, synapses, etc) whereas an llm breaks the protocol when it switches to tool use.

Another distinction, however, is that in a continual learning system, the new data affects the previous set, in a way that "code is data and data is code", while an llm accessing a database doesn't affect the capabilities already learnt by the llm. But what if the llm accesses e.g. a prompt persona from the database? At that point the important question becomes: is ICL really "learning"? I would say not, it is just a preference optimization over what the llm has already learned.

An attempt to reach a more formal definition of continual learning could be that ultimately it is a system that is irreversibly updated (it has no inverse) from its learning process.

[-]Robert Shuler3mo-20

The question posed by Byrnes is both important and interesting. I feel the answer overlooks fundamental limitations that prevented learning machines from translating language - much less functioning as chat bots - no matter how skilled they became at game play. The references to language and the economy contain embedded dependencies on relationships and cooperation over time, which are not represented in the sort of games used in the thought experiments.

The Core Principle

Neural networks without transformers are effectively stateless; they are unaware of history and produce moves based only on the immediate input, not the trajectory of the system. Because they lack this historical awareness, they cannot recognize or maintain relationships, which makes them incapable of cooperation and, by extension, extremely dangerous.

The Ramifications

The Cooperation Failure: Transactional cooperation requires a "Shadow of the Future"—the ability to remember a partner's previous moves to reward help or punish betrayal. A stateless AI cannot play the Iterated Prisoner’s Dilemma; it can only play a series of disconnected, "first-encounter" rounds where the rational mathematical move is always to defect.
The Death of Symbiosis: True partnership depends on interlinked, symbiotic relationships. Without a high-resolution context to hold the history of an interaction, an AI cannot move from being a "tool" to being a "symbiont." It remains a numerical sociopath, unable to anticipate a partner's needs based on shared experience.
The Threshold of Identity: Context is what allows an AI to consider its own previous actions. By observing its own history, the model moves beyond simple imitation and begins to develop a meta-identity. Without this capacity for reflection, the machine has no "character" and no mechanism for trust.
Safety Through Relationship: A stateless AI must be governed by rigid, external constraints because it cannot be governed by a relationship. A context-aware AI, however, can be integrated into a human-centric system through the biological infrastructure of trust and mutual history.

Moderation Log

207

You can’t imitation-learn how to continual-learn

207

Some intuitions on how to think about “real” continual learning

Why “real” continual learning can’t be copied by an imitation learner

Some things that are off-topic for this post

207

What it has learned from consequences

1. It has learned which outputs lead to correction, rejection, escalation, or trust

2. It has learned the difference between looking competent and being usable

3. It has learned some self-limitation

4. It has learned to use external memory and action channels as part of cognition

But what it still has *not* learned, in the strong autobiographical sense

1. Persistent autobiographical credit assignment

2. Stable self-model tied to long-term projects

3. Endogenous goal revision from consequences

4. World-grounded causal understanding of consequences

A useful analogy

The Ramifications

207

What it has learned from consequences

1. It has learned which outputs lead to correction, rejection, escalation, or trust

2. It has learned the difference between looking competent and being usable

3. It has learned some self-limitation

4. It has learned to use external memory and action channels as part of cognition

But what it still has *not* learned, in the strong autobiographical sense

1. Persistent autobiographical credit assignment

2. Stable self-model tied to long-term projects

3. Endogenous goal revision from consequences

4. World-grounded causal understanding of consequences

A useful analogy

The Ramifications

But what it still has not learned, in the strong autobiographical sense

But what it still has not learned, in the strong autobiographical sense