Léo Dana — LessWrong

Try Training SAEs with RLAIF

Epistemic status: not an Interpretability researcher, but has followed the seen closely. So, it makes sense to me that Probes should outperform SAEs: probes are trained directly to maximize an interpretable metric, while SAEs on the other hand are trained to maximize reconstruction loss, and then are interpreted. But training...

Dec 5, 20255

Visualizing small Attention-only Transformers

Visualizing small Attention-only Transformers Work done during an internship at MILES, Paris Dauphine University, under the supervision of Yann Chevaleyre. You can find the git page for the post here. Research has indicated that in large Transformers, facts are primarily stored in the MLP layer rather than the attention layer....

Nov 19, 20244

Results from the Turing Seminar hackathon

by Charbel-Raphaël, jeanne_, and Léo Dana

We (EffiSciences) ran a hackathon at the end of the Turing Seminar in ENS Paris-Saclay and ENS Ulm, an academic course inspired by the AGISF, with 28 projects submitted by 44 participants between the 11th and 12th November. We share a selection of projects. See them all here. I think...

Dec 7, 202335

On Interpretability's Robustness

Léo Dana - Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, as well as an internship at FAR AI, both under the mentorship of Claudia Shi. Would you trust the IOI circuit? Interpretability is envisioned by many as the main source of alignment...

Oct 18, 202311

Introducing EffiSciences’ AI Safety Unit

This post was written by Léo Dana, Charbel-Raphaël Ségerie, and Florent Berthet, with the help of Siméon Campos, Quentin Didier, Jérémy Andréoletti, Anouk Hannot and Tom David. In this post, you will learn what were EffiSciences’ most successful field-building activities as well as our advice, reflections, and takeaways to field-builders....

Jun 30, 202368

Improvement on MIRI's Corrigibility

This post was written as a submission for the AI Alignment Award, initiated at EffiSciences' event. This post aims to address the problem of corrigibility as identified by MIRI in 2015. We propose an extended formalism that allows us to write the desiderata of a corrigible behaviour, and provide theoretical...

Jun 9, 202354

A Corrigibility Metaphore - Big Gambles

I present here a helpful analogy to understand the corrigibility problem and the challenge raised by MIRI in their proposal. This analogy simplifies greatly some challenges of corrigibility but keeps the main problem found in the proposal, which I call Big Gambles. You are playing Mario Kart with 2 other...

May 10, 202316