How to train any multiagent systems end-to-end from AI feedback

Ed Li

-1

How to train any multiagent systems end-to-end from AI feedback

by Ed Li

5th Feb 2026

1 min read

0

-1

tldr: Finetuning many agents end-to-end offers a workaround to continual learning since different agents can specialize without catastrophic forgetting. Yet doing so is hard due to credit assignment and sample efficiency. We found that using AI feedback as per-action process rewards holds promise for addressing these challenges and unlocks a new axis for scaling post-training.

Paper: https://arxiv.org/abs/2601.23228
Code: https://github.com/ltjed/multiagent-coaching
Blog: https://ltjed.github.io/MAPPA/

---

What we did

We have an off-the-shelf LLM act as a coach that watches multi-agent systems during training. It scores each action as it happens—process supervision rather than just outcome rewards at the end. The coach sees what each agent produced plus tool outputs (stdout, stderr, errors).

The parts I find interesting

Credit assignment just... works? When agent 3 crashes because agent 1 forgot to save a file, the coach checks the filesystem and blames agent 1. No counterfactual rollouts needed. Just look at what actually happened.

Coach biases leak through. This one's a bit concerning. Our coach rated regression tasks higher than classification tasks—we didn't program this, it just did. Agents figured this out and started favoring regression. They optimized for the coach, not the task. Expected in hindsight, but worth flagging.

Some numbers on various benchmarks:

Math (AIME/AMC): +5 to +17.5pp
Data science: +16.7pp success, +38% F1, -41% RMSE

Things I'm still thinking about

What happens when agents get good at gaming the coach? Our coach is stateless—it can't see its own scoring patterns across episodes. A smarter coach might catch this, or might just create more sophisticated failure modes.

Also unclear: how weak can the coach be before this breaks down?

Happy to discuss.

AI CapabilitiesMachine Learning (ML)AI

Frontpage

-1

New Comment

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

-1

How to train any multiagent systems end-to-end from AI feedback

-1

-1

-1