Suho Lee

Message

2mo

Suho Lee

2mo

More experienced, long-term politicians tend to have a dedicated writing team, who may be less open to use AI in writing speeches. It would be interesting if we divide politicians by the level of experience and see how these statistics appear.

When does debate help a weak judge? Evidence from code, logic and math

Suho Lee2mo50

Interesting work.

One possible direction that would be interesting to explore: all your pairings are same-family. Same-family models likely share some core reasoning, thus

1) debate transcript from the stronger models might help the weaker judge from the same-family more than the one from different families as same-family models might understand each other better,

2) but they might also share failure modes, meaning a same-family critic might be systematically blind to the same errors as the weaker judge.

A cross-family testing might surface qualitatively different objections, potentially widening or narrowing the classifier gap.

The Case for Evaluating Model Behaviors

Suho Lee2mo10

Can't agree more.

One underappreciated reason behavior evals matter is for 'questions with no ground truth', they may be the only coherent approach.

Most evals have (or assume) a correct answer to measure against. But some of the most important things that we want to models to work on and get right have no ground truth (e.g. frontier scientific research, socioeconomic decisions...). How they handle contesting views and frameworks? How they represent uncertainty without over claiming? We definitely need these types of evaluations and focus on a 'response shap... (read more)