We Need To Know About Continual Learning

[-]gwern3y103

Continual learning is a blessing of scale: https://www.reddit.com/r/mlscaling/search?q=continual+learning&restrict_sr=on&include_over_18=on

(I don't mind saying this because it is obvious to anyone following the literature who has watched prior blessings of scale happen, and in particular, how each subfield copes with the realization that their problem was never a real one and all their clever ideas only mattered at scales which are quickly becoming OOMs irrelevant; and the continual-learning people already are going through the stages of grief, so a throwaway LW comment from me makes no difference.)

If you are trying to model DL capabilities, you should just assume continual-learning is already solved for all intents and purposes at GPT-4 scale (and note, for example, OA's revealed preferences in terms of training models from scratch vs further training old checkpoints) until you see an extremely compelling empirical demonstration to the contrary. We don't see it much overtly, simply because fullblown 'finetuning' is often not easy, and is much more expensive, and can be replaced to a considerable degree by tricks like retrieval or better prompts when your underlying model is really smart.

[-]michael_mjd3y10

Fascinating, thanks for the research. Your analysis makes sense and seems to indicate that for most situations, prompt engineering is the always the first plan of attack and often works well enough. Then, a step up from there, OpenAI/etc would most likely experiment with fine-tuning or RLHF as it relates to a specific business need. To train a better chatbot and fill in any gaps, they probably would get more bang for their buck on simply fine-tuning it on a large dataset that matched their needs. For example, if they wanted to do better mathematical reasoning, they'd probably pay people to generate detailed scratchwork and fine-tune a whole dataset in batch, rather than set up an elaborate "tutor" framework. Continual learning itself would be mainly applicable for research into whether the thing spontaneously develops a sense of self, or seeing if this helps with the specific case of long term planning and agency. These are things the general public are fascinated with, but perhaps don't seem to be the most promising direction for improving a company's bottom line yet.

[-]faul_sname3y71

This was the very first thing I thought of when language models came to my attention as "hey this looks like it actually might be the thing that the future looks like" (years ago). Since I'm not particularly smart or particularly well-informed, I conclude that I was not the first person to come up with this idea (or the tenth, or even the ten-thousandth). I strongly suspect that the simplest possible approach of "just turn on backprop" was tried within the first couple of days of the weights of a GPT model being available. For context, nostalgebraist-autoresponder has been live on Tumblr since late 2019.

I do concur with you that this is an important thing to explore. However, I am quite confident that "do the thing that is obvious to someone with no background encountering the field for the first time" is not an effective approach.

When I briefly looked into the academic research on this, I picked up the following general impression:

This is a task a lot of people have poured a lot of time into. Search terms are "online learning", "incremental learning", "continual learning", "active learning",
The primary problem with this approach is that as the model learns new stuff, it forgets old stuff that it is no longer being trained on. The search term here is "catastrophic forgetting". There are also several less-critical problems which would still be blockers if catastrophic forgetting wasn't an issue, mostly related to the model going off the rails more and more over time - search terms include "bias amplification", "overfitting", and "hallucination". Some argue that this is also a problem in humans.
There have been some clever attempts to get around this. One example of a particularly clever idea from French and Chater (2002) is "let's use a clever metric of how important the old stuff is to the network to try to get it to forget less stuff". I notice that this clever technique is not in use despite there being a Deepmind publication about it in 2017 . Search term: "elastic weight consolidation", and, in terms of the particular failure modes, I believe "task-agnostic/task-free continual learning" and "scalability".
Lots of people have had the idea "well humans sleep and dream, maybe something like that?". Search terms: "knowledge distillation", "experience replay".
Also lots of people have had the idea "well what if we hack on an external memory store in a completely unprincipled way". And this seems to be how AutoGPT works. Also people have tried to do it in a principled way, search term "memory-augmented neural networks".

Basically, the impression I've picked up is

This key problem seems like it's really hard to solve in a principled way.
Humans are complete disaster monkeys when it comes to ML research, and as such, if there was an obvious way to write a ML program on your desktop computer that rapidly bootstraps its way to godhood, someone would have already done it.
Per 1 and 2, we totally have yet "explored (publicly?)" the potential of “switching on” backpropagation/training while in inference mode (if by "we" you include "the internet" and not just "lesswrong in particular").

[-]michael_mjd3y30

I have noted the problem of catastrophic forgetting in the section "why it might not work". In general I agree continual learning is obviously a thing, otherwise I would not have used the established terminology. What I believe however is that the problems we face in continual learning in e.g. a 100M BERT model may not be the same as what we observe in models that can now meaningfully self critique. We have explored this technique publicly, but have we tried it with GPT-4? The publicly part was really just a question of whether OpenAI actually did it on this model or not, and it would be an amazing data point if they could say "We couldn't get it to work."

[-]faul_sname3y61

Ah, so the point was whether that had been explored publicly on the very largest language models that exist, because of the whole "sometimes approaches that didn't work at small scale start working when you throw enough compute at them" thing? Makes sense.

[-]michael_mjd3y30

Essentially yes, heh. I take this as a learning experience for my writing, I don't know what I was thinking, but it is obvious in hindsight that saying to just "switch on backprop" sounds very naive.

I also confess I haven't done the due diligence to find out what the actual largest model that has been tried with this, whether someone has tried it with Pythia or LLaMa. I'll do some more googling tonight.

One intuition why the largest models might be different, is that part of the training/fine-tuning going on will have to do with the model's own output. The largest models are the ones where the model's own output is not essentially word salad.

[-]Seth Herd3y51

It seems you're dismissing things like autoGPT for their lack of long term memory. But they have long term memory. They have an episodic memory that works much like human episodic memory. We have a special system to store specifics, because continuous learning isn't adequate to do that alone without using a learning rate high enough to cause catastrophic interference.

The vector-based episodic memory in auto-GPT operates much like human EM; it searches for relevant past experiences and brings them back into the context window (roughly equivalent to human working memory). They don't seem to work very well just yet, but those are literally first attempts.

Continuous learning will doubtless become part of advanced systems at some point, but that's not likely to substitute for episodic memory. To be fair, this is an empirical question. I'm reasoning based on catastrophic interference findings in lots of networks, but who knows.

See my recent post on the topic if you like.

[-]michael_mjd3y10

Thanks, this is a great analysis on the power of agentized LLMs, which I probably need to spend some more time thinking about. I will work my way through the post over the next few days. I briefly skimmed the episodic memory section for now, and I see it is like an embedding based retrieval system for past outputs/interactions of the model, reminiscent of the way some Helper chatbots look up stuff from FAQs. My overall intuitions on this:

It's definitely something, but the method of embedding and retrieval, if static, would be very limiting
Someone will probably add RL on top of it to adjust the EBR system, which will improve on that part significantly... if they can get the hparams correct.
It still doesn't seem to me as much "long term memory" so much as it's access to Google or CTRL-F on one's e-mail
I imagine actually updating the internals of the system is a fundamentally different kind of update.

It might be possible that a hybrid approach would end up working better, perhaps not even "continuous learning", but batched episodic learning. ("Sleep" but not sure how far that analogy goes.)

[-]the gears to ascension3y53

Strong upvote. I'm surprised this got downvoted

[-]michael_mjd3y10

It's possible it's downvoted because it might be considered dangerous capability research. It just seems highly unlikely that this would not be one of many natural research directions perhaps already attempted, and I figure we might as well acknowledge it and find out what it actually does in practice.

Or maybe downvotes because it "obviously won't work", but I think it's not obvious to me and would welcome discussion on that.

[-]Chris_Leong3y30

I'm worried that no matter how far we go, the next step will be one of the natural research directions.

[-]Alexander Gietelink Oldenziel3y51

Yes. This is very dangerous and the most likely way AGI will actually pan out.

I have abstained from commenting too much about this so far. It seems the "djinni is out of the bottle".

[-]RAB3y10

Worth noting that LLMs are no longer using quadratic context window scaling. See e.g. Claude-Long. Seems they've figured out how to make it ~linear. Looking at GPT-4 with a 32K context window option for corporate clients, seems like they're also not using quadratic scaling any more.

[-]Zechen Zhang3y10

Continual learning is an alternative, I would argue, to solve long term planning and agency rather than necessary. Augmented LLMs with long term memory retrieval can do long term planning assuming the model is already powerful enough. Also agency just emerges from the simulator naturally.

I'm not convinced about continual learning as even the most likely path to AGI.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

30

We Need To Know About Continual Learning

30

30