Can't agree more.
One underappreciated reason behavior evals matter is for 'questions with no ground truth', they may be the only coherent approach.
Most evals have (or assume) a correct answer to measure against. But some of the most important things that we want to models to work on and get right have no ground truth (e.g. frontier scientific research, socioeconomic decisions...). How they handle contesting views and frameworks? How they represent uncertainty without over claiming? We definitely need these types of evaluations and focus on a 'response shap...
Interesting work.
One possible direction that would be interesting to explore: all your pairings are same-family. Same-family models likely share some core reasoning, thus
1) debate transcript from the stronger models might help the weaker judge from the same-family more than the one from different families as same-family models might understand each other better,
2) but they might also share failure modes, meaning a same-family critic might be systematically blind to the same errors as the weaker judge.
A cross-family testing might surface qualitatively different objections, potentially widening or narrowing the classifier gap.