LESSWRONG
LW

Valerio Pepe
49Ω19110
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Call for suggestions - AI safety course
Valerio Pepe5d20

As someone who has applied to take this class, I'll suggest 10 papers, 4 from my own niche research interests and 6 for more very recent eval-focused work which I think is interesting and I'd like an excuse to read/discuss.

Niche Interests

1) In terms of what we can learn from other fields, AI-safety-conscious cognitive scientists have recently been thinking about how to move past revealed preferences in AI Alignment. They've come up with resource-rational contractualism, which on the surface seems like an interesting framework with a Bayesian bent, so it looks like it could also scratch the math itch. These two papers: (Zhi-Xuan et al. 2024) and (Levine et al. 2025) seem to be the main ones so far, and are very recent.

2) I find Goodfire AI's approach to mech interp, which essentially tries to use model params instead of activations to find mechanisms, really interesting, and I think it is both new enough and mathematically-appropriate enough that I can see student projects iterating on it for the class: (Braun et al. 2025) and (Bushnaq et al. 2025) are the main papers here.

Recent Eval Work

The METR doubling-time paper, Ai2's SciArena, LLMs Often Know When They're Being Evaluated, Anthropic's Shade-ARENA, and UK AISI's STACK Adversarial Attack, and Cohere's takedown of LLMArena

Reply1
No wikitag contributions to display.
50Emergent Misalignment on a Budget
Ω
1mo
Ω
0