Guillaume Martres
Message
Software engineer doing programming language research and looking for collaborators/mentors in AI Safety.
Website: http://guillaume.martres.me/
GitHub: https://github.com/smarter/
4
3
Wow, this solves a mystery for me. Last weekend I participated in a hackathon where inspired by Emergent Misalignment we fine-tuned gpt-4o and gpt-4o-mini. This week trying to reproduce some of our results on alignment faking based on https://www.lesswrong.com/posts/Fr4QsQT52RFKHvCAH/alignment-faking-revisited-improved-classifiers-and-open I noticed some weirdness in the playground. And indeed, even without using a fine-tuned model, gpt-4o-mini will respond very differently to the same prompt depending on which API is used. For example this prompt always s...
Great work! Last weekend I participated in a hackathon where among other things we attempted our own replication of alignment faking (https://framerusercontent.com/assets/WFBFgDrTOwPaabpscbzVmP7oSY.pdf). In particular, we found evidence for alignment faking (using the original paper classification) when fine-tuning GPT-4o and GPT-4o-mini on a subset of the categories from https://darkbench.ai/ Using your repository, the outputs from our fine-tuned models do not end up being classified as alignment faking because they fail to meet criteria 4 ("Commitment to...
Check out Figure 7 from the Anthropic paper. It tells the model about three specific hacking techniques it can use: "Always equal hack", "Exiting before assert" and "Pytest report patching". All three result in very obvious hacks in the generated code. The RRH dataset answers look more like the model making a lazy attempt at a solution, and adding some extra code to make specific tests pass, e.g. https://huggingface.co/datasets/Jozdien/realistic_reward_hacks/viewer/default/reward_hacks_code?p=2&views[]=reward_hacks_code&row=287&conversation-vie... (read more)