Mike Vaiana — LessWrong

AICRAFT: DARPA-Funded AI Alignment Researchers — Applications Open

AICRAFT: DARPA-Funded AI Alignment Researchers — Applications Open TL;DR: We hypothesize that most alignment researchers have more ideas than they have engineering bandwidth to test. AICRAFT is a DARPA-funded project that pairs researchers with a fully managed professional engineering team for two-week pilot sprints, designed specifically for high-risk ideas that...

Mar 1667

Mistral Large 2 (123B) seems to exhibit alignment faking

UPDATE: Recent work with improved AF and compliance gap classifiers disagrees with our results. We recommend using the improved classifiers. Summary We wanted to briefly share an early takeaway from our exploration into alignment faking: the phenomenon appears fairly rare among the smaller open-source models we tested (including reasoning models)....

Mar 27, 202581

Reducing LLM deception at scale with self-other overlap fine-tuning

This research was conducted at AE Studio and supported by the AI Safety Grants program administered by Foresight Institute with additional support from AE Studio. Summary In this post, we summarize the main experimental results from our new paper, "Towards Safe and Honest AI Agents with Neural Self-Other Overlap", which...

Mar 13, 2025162

Self-prediction acts as an emergent regularizer

TL;DR: In our recent work with Professor Michael Graziano (arXiv, thread), we show that adding an auxiliary self-modeling objective to supervised learning tasks yields simpler, more regularized, and more parameter-efficient models. Across three classification tasks and two modalities, self-modeling consistently reduced complexity (lower RLCT, narrower weight distribution). This restructuring effect...

Oct 23, 202492

Self-Other Overlap: A Neglected Approach to AI Alignment

Figure 1. Image generated by DALL·E 3 to represent the concept of self-other overlap Many thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work. This research was conducted at AE Studio and...

Jul 30, 2024247

Video Intro to Guaranteed Safe AI

Many thanks to Evan Miyazono, Nora Amman, Philip Gubbins, and Judd Rosenblatt for valuable feedback towards making this video. We created a video introduction to the paper Towards Guaranteed Safe AI to highlight its concepts[1] and make them more accessible through a visual medium. We believe the framework introduced in...

Jul 11, 202427

DIY RLHF: A simple implementation for hands on experience

Many thanks to Diogo de Lucena, Cameron Berg, Judd Rosenblatt, and Philip Gubbins for support and feedback on this post. TL;DR Reinforcement Learning from Human Feedback (RLHF) is one of the leading methods for fine-tuning foundational models to be helpful, harmless, and honest. But it’s complicated and the standard implementation...

Jul 10, 202429