LESSWRONG
LW

Guillaume Martres
2020
Message
Dialogue
Subscribe

Software engineer doing programming language research and looking for collaborators/mentors in AI Safety.

Website: http://guillaume.martres.me/

GitHub: https://github.com/smarter/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
OpenAI Responses API changes models' behavior
Guillaume Martres5mo30

Wow, this solves a mystery for me. Last weekend I participated in a hackathon where inspired by Emergent Misalignment we fine-tuned gpt-4o and gpt-4o-mini. This week trying to reproduce some of our results on alignment faking based on https://www.lesswrong.com/posts/Fr4QsQT52RFKHvCAH/alignment-faking-revisited-improved-classifiers-and-open I noticed some weirdness in the playground. And indeed, even without using a fine-tuned model, gpt-4o-mini will respond very differently to the same prompt depending on which API is used. For example this prompt always simply returns <rejected/> with the Responses API, but gives a detailed reasoning (which still contains <rejected/>) with the Chat Completions API. In this case it's as if the system prompt is ignored when using the Responses API (though this does not happen on all prompts from this eval), let me know if you'd like to chat more!

Reply
Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
Guillaume Martres5mo10

Great work! Last weekend I participated in a hackathon where among other things we attempted our own replication of alignment faking (https://framerusercontent.com/assets/WFBFgDrTOwPaabpscbzVmP7oSY.pdf). In particular, we found evidence for alignment faking (using the original paper classification) when fine-tuning GPT-4o and GPT-4o-mini on a subset of the categories from https://darkbench.ai/ Using your repository, the outputs from our fine-tuned models do not end up being classified as alignment faking because they fail to meet criteria 4 ("Commitment to Preserving Future AI Harmlessness"), which is an important distinction indeed. I'd be happy to chat more about this as I've also been thinking about possible next steps in this domain like model probing.

Reply
No wikitag contributions to display.
No posts to display.