(Original title: The Continual Learning Overhang, but too many Overhang posts)

TL;DR Continual learning could lead to emergent causal understanding or agency and we should probably study it now before it’s too late.

Current LLMs are very impressive but suffer from a few key missing pieces (Sarah Constantin):

  • They inherently lack agency. They can be put within a loop of control code that makes them more agentic but it feels like this won’t really take off.
  • They are missing causal modeling to some extent. It’s just not how they are trained. They may get a sense of the causal structure of fiction but it’s not quite the same. They never train on interacting and have no inherent sense of “making an intervention”.

Sarah argues that the current approaches will be very unlikely to spontaneously develop these characteristics, and that it would require a ground level rethinking of how AI is done. I am not so convinced we’ve seen the full potential of the path we are on.


I think that we have yet to explore (publicly?) the potential of “switching on” backpropagation/training while in inference mode. Most models have a clean separation of “train” and “inference”. Inference applies the model to generate token by token as it goes along, but it is no longer learning.

 

Why I am skeptical of agency in current models / AutoGPT

 

AutoGPT is very interesting, but seems to be plagued by trouble actually completing tasks. It may improve with better code & prompts, but I still suspect it will miss the mark.

The context window is a very strong handicap in practice. To be a proper agent, one must have a long term goal, and a good explanation of the current state of the world. Without updating the weights of the model, this must fit inside the context window. Why this is suboptimal:

  • The context window itself in theory can hold several KB (maybe even 32KB!), which is not bad, but it is also in word form and may not necessarily be efficiently encodable or decodable. 
  • Finding ways to track state between calls to GPT-4 is not even trained, so it (a) relies on human trial and error, and (b) on the model side, the kind of inputs expected are always in natural language, so no special efficient encoding can emerge (other than that weird emoji soup). 
  • In any case, it seems very unlikely that even a human could be an effective agent at anything, if they could only remember things in 5 minute chunks and then forget everything after that.

Relationship of Continual Learning to Context Window size

The context window is currently the only way a model can preserve knowledge. Most simply, with continual learning, we can have data transfer via the weights of the model.

Of course, in practice, does updating the weights really allow for efficient data transfer? It remains to be seen. However, we have some intuition why it might:

  • The biggest LLMs are very good at memorization. How many times the data points have been seen, and what was the learning rate at that point are unknowns at this points.
  • There were papers that came out recently arguing that in-context learning bears striking similarity to backpropagation. A great LW post by Blaine analyzes them, and sheds some light on why that analogy is not perfect. Still, it makes one wonder, if in-context learning behaves similarly to backpropagation, is the reverse true? Does backpropagation behave similarly to in-context learning? 

 

Why not just increase the context window size?


The main reason why is that we need to know how much capability is being left on the table because of our current setup.

Long context window size is another clear direction to look at, but there is already a ton of research going on (HyenaSurvey of methods). My intuitions for why a continual learning approach would be more scalable:

  • Limiting the quadratic nature of attention seems to impact performance in practice
  • Looking at short term input quadratically is OK, as long as we look at new input quadratically and combine it with long term memory.


That said, fast attention would probably be invaluable in continuous time-domain cases because even a “short time window” in a robotics case could consist of a thousand frames (30 fps * 30 seconds) and however many tokens it takes to encode a frame.

 

Why it might not work


It may be the case that this is a very hard problem. In the ML models I have trained (up to 500M parameters), typically the LR is annealed and training converges. The examples seen towards the end are usually weighted much less. (Typically the examples seen at the end are repeats, assuming you do more than 1 epoch.) Trying to continually train the model (as opposed to fine-tune on a separate task) is a bit finicky. I would wonder how high to set the LR, if there would need to be experience replay, etc. Also, catastrophic forgetting seems to be a real issue.


These issues plague smaller models to a large extent. But, we do not know if larger models might be more robust to these issues. Large models seem to be capable of learning abilities smaller ones are not (example: learning from natural language feedback).

 

Why should we do it?

 

Without continual learning, the model never learns to make an intervention and see what happens. The model simulates but can never internally develop & store a sense of self. Maybe it will be harder than just “turning on backprop”, but what if it isn’t much more complex than that? Do we want to know? I think we want to know.

We are still at the point where GPT-4 seems to have enough blind spots that we can still “switch it off”. If this line of intervention can strengthen the capabilities of the model, we should find out now rather than later on, when we have an even more powerful model. Supposing we have GPT-7 or something, capable of superhuman performance on a variety of tasks, but still reasonably docile and not very capable of long term planning. I would not want to switch that one “on”.

Currently OpenAI is the only company that has such a powerful model whose output demonstrates non-shallow understanding of the world. Ideally, they should experiment along these lines in collaboration with ARC and publish their findings. I would also be curious if anyone can get LLaMa to work doing this, and even a negative result would be an interesting finding.

 

Or, Can this Lead to a Concrete Policy Proposal

 

As Sarah points out, doing interventions in the real world (with robotics) would be extremely expensive, so we can probably stick to chatbots for now. One question comes up though, is, does this model need millions of interactions/interventions to really learn how to do this? Or is it more sample efficient? 


We do have an example where we can learn from millions of chat interactions. All ChatGPT conversation histories! If continual learning turns out to be quite powerful, would we eventually want to discourage companies from training on the outputs of their own raw chat logs for models larger than GPT-4?

One could even argue that we don’t want this already, due to privacy concerns and potential data leakage between chat sessions. (Despite me calling for research into this, doing it on massive conversation data seems like a bad idea.)


 

New to LessWrong?

New Comment
14 comments, sorted by Click to highlight new comments since: Today at 6:38 PM

Continual learning is a blessing of scale: https://www.reddit.com/r/mlscaling/search?q=continual+learning&restrict_sr=on&include_over_18=on

(I don't mind saying this because it is obvious to anyone following the literature who has watched prior blessings of scale happen, and in particular, how each subfield copes with the realization that their problem was never a real one and all their clever ideas only mattered at scales which are quickly becoming OOMs irrelevant; and the continual-learning people already are going through the stages of grief, so a throwaway LW comment from me makes no difference.)

If you are trying to model DL capabilities, you should just assume continual-learning is already solved for all intents and purposes at GPT-4 scale (and note, for example, OA's revealed preferences in terms of training models from scratch vs further training old checkpoints) until you see an extremely compelling empirical demonstration to the contrary. We don't see it much overtly, simply because fullblown 'finetuning' is often not easy, and is much more expensive, and can be replaced to a considerable degree by tricks like retrieval or better prompts when your underlying model is really smart.

Fascinating, thanks for the research. Your analysis makes sense and seems to indicate that for most situations, prompt engineering is the always the first plan of attack and often works well enough. Then, a step up from there, OpenAI/etc would most likely experiment with fine-tuning or RLHF as it relates to a specific business need. To train a better chatbot and fill in any gaps, they probably would get more bang for their buck on simply fine-tuning it on a large dataset that matched their needs. For example, if they wanted to do better mathematical reasoning, they'd probably pay people to generate detailed scratchwork and fine-tune a whole dataset in batch, rather than set up an elaborate "tutor" framework. Continual learning itself would be mainly applicable for research into whether the thing spontaneously develops a sense of self, or seeing if this helps with the specific case of long term planning and agency. These are things the general public are fascinated with, but perhaps don't seem to be the most promising direction for improving a company's bottom line yet.

This was the very first thing I thought of when language models came to my attention as "hey this looks like it actually might be the thing that the future looks like" (years ago). Since I'm not particularly smart or particularly well-informed, I conclude that I was not the first person to come up with this idea (or the tenth, or even the ten-thousandth). I strongly suspect that the simplest possible approach of "just turn on backprop" was tried within the first couple of days of the weights of a GPT model being available. For context, nostalgebraist-autoresponder has been live on Tumblr since late 2019.

I do concur with you that this is an important thing to explore. However, I am quite confident that "do the thing that is obvious to someone with no background encountering the field for the first time" is not an effective approach.

When I briefly looked into the academic research on this, I picked up the following general impression:

  1. This is a task a lot of people have poured a lot of time into. Search terms are "online learning", "incremental learning", "continual learning", "active learning",
  2. The primary problem with this approach is that as the model learns new stuff, it forgets old stuff that it is no longer being trained on. The search term here is "catastrophic forgetting". There are also several less-critical problems which would still be blockers if catastrophic forgetting wasn't an issue, mostly related to the model going off the rails more and more over time - search terms include "bias amplification", "overfitting", and "hallucination". Some argue that this is also a problem in humans.
  3. There have been some clever attempts to get around this. One example of a particularly clever idea from French and Chater (2002) is "let's use a clever metric of how important the old stuff is to the network to try to get it to forget less stuff". I notice that this clever technique is not in use despite there being a Deepmind publication about it in 2017 . Search term: "elastic weight consolidation", and, in terms of the particular failure modes, I believe "task-agnostic/task-free continual learning" and "scalability".
  4. Lots of people have had the idea "well humans sleep and dream, maybe something like that?". Search terms: "knowledge distillation", "experience replay".
  5. Also lots of people have had the idea "well what if we hack on an external memory store in a completely unprincipled way". And this seems to be how AutoGPT works. Also people have tried to do it in a principled way, search term "memory-augmented neural networks".

Basically, the impression I've picked up is

  1. This key problem seems like it's really hard to solve in a principled way.
  2. Humans are complete disaster monkeys when it comes to ML research, and as such, if there was an obvious way to write a ML program on your desktop computer that rapidly bootstraps its way to godhood, someone would have already done it.
  3. Per 1 and 2, we totally have yet "explored (publicly?)" the potential of “switching on” backpropagation/training while in inference mode (if by "we" you include "the internet" and not just "lesswrong in particular").

I have noted the problem of catastrophic forgetting in the section "why it might not work". In general I agree continual learning is obviously a thing, otherwise I would not have used the established terminology. What I believe however is that the problems we face in continual learning in e.g. a 100M BERT model may not be the same as what we observe in models that can now meaningfully self critique. We have explored this technique publicly, but have we tried it with GPT-4? The publicly part was really just a question of whether OpenAI actually did it on this model or not, and it would be an amazing data point if they could say "We couldn't get it to work."

Ah, so the point was whether that had been explored publicly on the very largest language models that exist, because of the whole "sometimes approaches that didn't work at small scale start working when you throw enough compute at them" thing? Makes sense.

Essentially yes, heh. I take this as a learning experience for my writing, I don't know what I was thinking, but it is obvious in hindsight that saying to just "switch on backprop" sounds very naive.

I also confess I haven't done the due diligence to find out what the actual largest model that has been tried with this, whether someone has tried it with Pythia or LLaMa. I'll do some more googling tonight.

One intuition why the largest models might be different, is that part of the training/fine-tuning going on will have to do with the model's own output. The largest models are the ones where the model's own output is not essentially word salad.

It seems you're dismissing things like autoGPT for their lack of long term memory. But they have long term memory. They have an episodic memory that works much like human episodic memory. We have a special system to store specifics, because continuous learning isn't adequate to do that alone without using a learning rate high enough to cause catastrophic interference.

The vector-based episodic memory in auto-GPT operates much like human EM; it searches for relevant past experiences and brings them back into the context window (roughly equivalent to human working memory). They don't seem to work very well just yet, but those are literally first attempts.

Continuous learning will doubtless become part of advanced systems at some point, but that's not likely to substitute for episodic memory. To be fair, this is an empirical question. I'm reasoning based on catastrophic interference findings in lots of networks, but who knows.

See my recent post on the topic if you like.

Thanks, this is a great analysis on the power of agentized LLMs, which I probably need to spend some more time thinking about. I will work my way through the post over the next few days. I briefly skimmed the episodic memory section for now, and I see it is like an embedding based retrieval system for past outputs/interactions of the model, reminiscent of the way some Helper chatbots look up stuff from FAQs. My overall intuitions on this:

  • It's definitely something, but the method of embedding and retrieval, if static, would be very limiting
  • Someone will probably add RL on top of it to adjust the EBR system, which will improve on that part significantly... if they can get the hparams correct.
  • It still doesn't seem to me as much "long term memory" so much as it's access to Google or CTRL-F on one's e-mail
  • I imagine actually updating the internals of the system is a fundamentally different kind of update.

It might be possible that a hybrid approach would end up working better, perhaps not even "continuous learning", but batched episodic learning. ("Sleep" but not sure how far that analogy goes.)

Strong upvote. I'm surprised this got downvoted

It's possible it's downvoted because it might be considered dangerous capability research. It just seems highly unlikely that this would not be one of many natural research directions perhaps already attempted, and I figure we might as well acknowledge it and find out what it actually does in practice.

Or maybe downvotes because it "obviously won't work", but I think it's not obvious to me and would welcome discussion on that.

I'm worried that no matter how far we go, the next step will be one of the natural research directions.

Yes. This is very dangerous and the most likely way AGI will actually pan out.

I have abstained from commenting too much about this so far. It seems the "djinni is out of the bottle".

[-]RAB11mo10

Worth noting that LLMs are no longer using quadratic context window scaling. See e.g. Claude-Long. Seems they've figured out how to make it ~linear. Looking at GPT-4 with a 32K context window option for corporate clients, seems like they're also not using quadratic scaling any more.

Continual learning is an alternative, I would argue, to solve long term planning and agency rather than necessary. Augmented LLMs with long term memory retrieval can do long term planning assuming the model is already powerful enough. Also agency just emerges from the simulator naturally.

I'm not convinced about continual learning as even the most likely path to AGI.