In April 2023, Alexey Guzey posted "AI Alignment Is Turning from Alchemy Into Chemistry" where he reviewed Burns et al.'s paper "Discovering Latent Knowledge in Language Models Without Supervision." Some excerpts to summarize Alexey's post:
For years, I would encounter a paper about alignment — the field where people are working on making AI not take over humanity and/or kill us all — and my first reaction would be “oh my god why would you do this”. The entire field felt like bullshit. I felt like people had been working on this problem for ages: Yudkowsky, all of LessWrong, the effective altruists. The whole alignment discourse had been around for so long, yet there was basically no real progress; nothing interesting or useful. Alignment thought leaders seemed to be hostile to everyone who wasn’t an idealistic undergrad or an orthodox EA and who challenged their frames and ideas. It just felt so icky. [...] Bottom line is: the field seemed weird, stuck, and lacking any clear, good ideas and problems to work on. It basically felt like alchemy.
[...]
As far as I know, nobody ever managed to make practical progress on this issue until literally last year. Collin Burns et al’s Discovering Latent Knowledge in Language Models Without Supervision was the first alignment paper where my reaction was “fuck, this is legit”, rather than “oh my god why are you doing this”. Burns et al actually managed to show that we can learn something about what non-toy LLMs “think” or “believe” without humans labeling the data at all. Burns’ method probably won’t be the one to solve alignment for good. However, it’s an essential first step, a proof of concept that demonstrates unsupervised alignment is indeed possible, even when we can’t evaluate what AI is doing. It is the biggest reason why I think the field is finally becoming real.
Alexey ended up being quite wrong: Burns' paper, while very interesting, didn't inspire impactful follow-up research in eliciting beliefs or contribute to any alignment/control techniques used at the labs.
Despite being much more optimistic about the alignment community's ability to eventually make progress than Alexey, I did agree with him that alignment was still waiting for a killer research direction. Up to that point, and for around 2 years after, very few alignment papers actually produced insights or techniques that meaningfully affect how AI is trained and deployed. When I applied to an AI safety grantwriting role at Open Philanthropy in early 2024, one of the questions on the application was roughly "What do you think the most important alignment paper has been?" and I answered with the original RLHF paper because up until that point, it was the ~only major technique to come out of the alignment community that actually steered an AI system to behave more safely (feel free to correct me here, I'm also counting RLAIF and constitutional AI in this bucket).
But with recent work in emergent misalignment and inoculation prompting (Betley et al., MacDiarmid et al., Wichers et al.), I think alchemy really is turning into chemistry. We have:
I'm really excited to see new work that comes out of this research direction. I think there's a lot of opportunity to start creating more in vitro model organisms in reward hacking setups, and more accessible model organisms mean that more researchers can contribute to creating control techniques. With more work studying the physics of how RL posttraining and reward hacking affect model goals and capabilities, there's also more value in having evaluation techniques that can assess model alignment.