Zhijing Jin — LessWrong

Perspectives on Continual Learning: Survey Results and Forecasts

by Rauno Arike, RohanS, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, and Seth Herd

This is the fifth post in the sequence Implications of Continual Learning for LLM Agents. Summary While writing our continual learning sequence, we sent a survey to a number of AI safety researchers with questions about continual learning. This post summarizes the results of that survey. We asked whether respondents...

Jun 2433

Angles of attack for continual learning safety

by Rauno Arike, RohanS, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, and Seth Herd

This is the fourth post in the sequence Implications of Continual Learning for LLM Agents. Summary Continual learning is a capability that largely doesn’t exist yet in LLMs. We first want to acknowledge that this may make it difficult to identify tractable angles of attack for making CL safer: it...

Jun 1647

How might continual learning affect safety and alignment?

by Rauno Arike, RohanS, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, and Seth Herd

This is the third post in our sequence Implications of Continual Learning for LLM Agents. Summary We argue that continual learning (CL) has two major potential safety implications: it may enable changes to LLM goals and values after deployment, and it eliminates the last-mover advantage held by current safety interventions....

Jun 1360

What's Continual Learning, and Why Might We Expect To See It In Advanced LLM Agents?

by RohanS, Rauno Arike, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, and Seth Herd

This is the second post in the sequence Implications of Continual Learning for LLM Agents. Summary We say that an agent is a continual learner if it undergoes persistent updates during deployment. That’s more-or-less a binary criterion, but there are several other components to being good at continual learning that...

Jun 1232

Implications of Continual Learning for LLM Agents: Introduction

by RohanS, Rauno Arike, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, and Seth Herd

Many people think that continual learning (CL) is a key missing capability of LLM systems, and we think its development could have huge implications for the capabilities and safety of AI agents. Despite this, several important questions about CL remain underexplored: * What counts as continual learning? Through what pathways...

Jun 1249

Explaining undesirable model behavior: (How) can influence functions help?

Undesirable training data can lead to undesirable model output. This dynamic is commonly phrased as "garbage in, garbage out" and it is a key issue for frontier models trained on web-scale data. How can we efficiently identify these bad apples in massive training datasets (with trillions of tokens)? Influence functions...

Mar 218

The Multi-Agent Minefield: Can LLMs Cooperate to Avoid Global Catastrophe?

ArXiv paper here. Most AI safety research asks a familiar question: Will a single model behave safely? But many of the risks we actually worry about – including arms races, coordination failures, and runaway competition – don’t involve one single AI model acting alone. They emerge when multiple advanced AI...

Feb 1715