x

LESSWRONG

LW

Sam Martin — LessWrong

Sam Martin

Sam Martin

Message

59

1

1mo

Sam Martin

59

1mo

Classifier Context Rot: Monitor Performance Degrades with Context Length

by Fabien Roger and Sam Martin

Monitoring coding agents for dangerous behavior using language models requires classifying transcripts that often exceed 500K tokens, but prior agent monitoring benchmarks rarely contain transcripts longer than 100K tokens. We show that when used as classifiers, current frontier models fail to notice dangerous actions more often in longer transcripts. In...

How useful is cross-domain generalization for training LLM monitors?

by Fabien Roger and Sam Martin

We study how well single-token classification training generalizes and find that: * Control monitor performance can be improved by training on adjacent classification tasks (which is useful if we don't have high-quality in-domain data). * We do not find strong benefits from classification-specialized models: training on instruction following after classification...