tldr: Finetuning many agents end-to-end offers a workaround to continual learning since different agents can specialize without catastrophic forgetting. Yet doing so is hard due to credit assignment and sample efficiency. We found that using AI feedback as per-action process rewards holds promise for addressing these challenges and unlocks a new axis for scaling post-training.
We have an off-the-shelf LLM act as a coach that watches multi-agent systems during training. It scores each action as it happens—process supervision rather than just outcome rewards at the end. The coach sees what each agent produced plus tool outputs (stdout, stderr, errors).
The parts I find interesting
Credit assignment just... works? When agent 3 crashes because agent 1 forgot to save a file, the coach checks the filesystem and blames agent 1. No counterfactual rollouts needed. Just look at what actually happened.
Coach biases leak through. This one's a bit concerning. Our coach rated regression tasks higher than classification tasks—we didn't program this, it just did. Agents figured this out and started favoring regression. They optimized for the coach, not the task. Expected in hindsight, but worth flagging.
Some numbers on various benchmarks:
Math (AIME/AMC): +5 to +17.5pp Data science: +16.7pp success, +38% F1, -41% RMSE
Things I'm still thinking about
What happens when agents get good at gaming the coach? Our coach is stateless—it can't see its own scoring patterns across episodes. A smarter coach might catch this, or might just create more sophisticated failure modes.
Also unclear: how weak can the coach be before this breaks down?
tldr: Finetuning many agents end-to-end offers a workaround to continual learning since different agents can specialize without catastrophic forgetting. Yet doing so is hard due to credit assignment and sample efficiency. We found that using AI feedback as per-action process rewards holds promise for addressing these challenges and unlocks a new axis for scaling post-training.
Paper: https://arxiv.org/abs/2601.23228
Code: https://github.com/ltjed/multiagent-coaching
Blog: https://ltjed.github.io/MAPPA/
---
What we did
We have an off-the-shelf LLM act as a coach that watches multi-agent systems during training. It scores each action as it happens—process supervision rather than just outcome rewards at the end. The coach sees what each agent produced plus tool outputs (stdout, stderr, errors).
The parts I find interesting
Credit assignment just... works? When agent 3 crashes because agent 1 forgot to save a file, the coach checks the filesystem and blames agent 1. No counterfactual rollouts needed. Just look at what actually happened.
Coach biases leak through. This one's a bit concerning. Our coach rated regression tasks higher than classification tasks—we didn't program this, it just did. Agents figured this out and started favoring regression. They optimized for the coach, not the task. Expected in hindsight, but worth flagging.
Some numbers on various benchmarks:
Math (AIME/AMC): +5 to +17.5pp
Data science: +16.7pp success, +38% F1, -41% RMSE
Things I'm still thinking about
What happens when agents get good at gaming the coach? Our coach is stateless—it can't see its own scoring patterns across episodes. A smarter coach might catch this, or might just create more sophisticated failure modes.
Also unclear: how weak can the coach be before this breaks down?
Happy to discuss.