Valerio Pepe — LessWrong

As someone who has applied to take this class, I'll suggest 10 papers, 4 from my own niche research interests and 6 for more very recent eval-focused work which I think is interesting and I'd like an excuse to read/discuss.

Niche Interests

1) In terms of what we can learn from other fields, AI-safety-conscious cognitive scientists have recently been thinking about how to move past revealed preferences in AI Alignment. They've come up with resource-rational contractualism, which on the surface seems like an interesting framework with a Bayesian bent, so it looks like it could also scratch the math itch. These two papers: (Zhi-Xuan et al. 2024) and (Levine et al. 2025) seem to be the main ones so far, and are very recent.

2) I find Goodfire AI's approach to mech interp, which essentially tries to use model params instead of activations to find mechanisms, really interesting, and I think it is both new enough and mathematically-appropriate enough that I can see student projects iterating on it for the class: (Braun et al. 2025) and (Bushnaq et al. 2025) are the main papers here.

Recent Eval Work

The METR doubling-time paper, Ai2's SciArena, LLMs Often Know When They're Being Evaluated, Anthropic's Shade-ARENA, and UK AISI's STACK Adversarial Attack, and Cohere's takedown of LLMArena

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments