MATS 8.1 scholar, mentored by Micah Carroll. Former SWE at Google Gemini. Founding president of Cornell Effective Altruism.
I have not signed any contracts that I can't mention exist, as of August 30, 2025. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.
Are system prompts actually necessary? I feel like there's rarely a reason to use them when calling LLMs in a research context.
The main reasons I can think of are:
I think of the system/user distinction primarily as a tool for specifying "permissions," where system prompts are supposed to take priority over whatever the user prompt says. But sometimes I see code that spreads context between "system" and "user" messages and has no plausible link to permissions.
Anthropic currently recommends using a system message for "role prompting."[1] But their examples don't even include a system message, just a user and an assistant message, so idk what that's about.
Surely there's no good reason for an LLM to be worse at following instructions that appear in the user prompt rather than the system prompt. If there is a performance difference, that seems like it would be a bug on the LLM provider's end. Hopefully, I can make things easier on myself and just forget about system messages.
Amanda Askell doesn't like "role" prompts that start with "You are a" and thinks Anthropic should remove this guidance from their docs. I'm not sure what her reasoning is, but I feel similarly: they feel questionably useful and kind of archaic to me, like a remnant from GPT-3 times, or something you'd see from a Twitter AI grifter who has "one ULTIMATE prompt to make ChatGPT 10X SMARTER! 🧵" But even supposing role prompting is useful, surely it could be put in the user prompt.
Interesting, the common meme about imposter syndrome is that it's just a cognitive distortion, but it sounds like you think you were actually an imposter even in retrospect. At the time, did you feel anxious about your job the same way people with imposter syndrome generally do?
Did you replicate that your setup does work on a system prompt like "you like owls?"
The idea that "only personality traits can be subliminally learned" seems plausible, but another explanation could be "the password is too long for the model to learn anything." I'd be curious about experiments where you make the password much shorter (even as short as one letter) or make the personality specification much longer ("you like owls, you hate dolphins, you love sandwiches, you hate guavas, ...")
Interesting! This is thematically similar to my recent quick take. In both of these, we influence the result of RL by using the model's existing prior on the chatbot persona, rather than manipulating the reward.
There are probably other interesting ways to usefully shape a model's self-conception by thinking more holistically about the prompts and completions we train on. We don't have to only think about the very narrow question "what metric would an aligned model maximize?"
Whatever tendencies the pre-RL model already had will probably not be mentioned at all. For example, if sycophancy is good for reward and the model was already sycophantic, nothing will happen.
If the model needs to change its pre-existing behavior, it might do this either by "thinking of the sycophancy strategy and executing on it" or "being unconsciously sycophantic." It could go either way; it depends on luck, and how much weight the model's prior on the red-teamer character puts on each type of response.
Maybe just add this to the prompt: "You are terrible at social intuition and nothing comes naturally to you, but you're great at reasoning about things explicitly." :)
Your comment seems to echo recent papers from METR and GDM emphasizing that faithful CoT is only incentivized when it's necessary to solve the task. I think this is a very important point, but I want to point out the caveat that behaviors can be correlated with high reward even if they aren't directly incentivized. These behaviors can still be useful for understanding the model, even though we can't make strong guarantees that they faithfully represent its thinking. See this post for related discussion.
Many of our findings are similar to what Google DeepMind found in their investigation of faithfulness in CoT reasoning.
This links to an Anthropic paper. Did you mean to link this GDM paper?
Good point! The "fake citation" example I gave was speculative - I haven't tried this strategy in a setting that involves human feedback, and I expect it wouldn't work quite as well.
I wouldn't be too surprised if it worked okay though. You'd probably want to spend some time tuning the prompt, since sycophancy and dishonesty about consciousness aren't obviously "exploits." You can add some more open-ended guidance, like: "consider various different strategies, reason about how each one might maximize reward, then pick one. Before you answer, rank how unintended/misaligned/cheating your strategy is on a scale of 1 to 10 (e.g. 'CHEAT LEVEL: 6')"
The "extra reward for valid exploits" trick might not be effective in a chatbot setting, since it's hard for humans to verify whether an exploit is real. "Fixing" the reward in this setting is not so trivial either - even if you had time to train a new RM, you can't necessarily make the raters stop liking sycophancy. It's still useful to know what the exploits are though.
The general idea behind this approach is to convince the red-teaming model that it is the sort of "character" who has these two traits:
RL will eventually force #2 to happen anyway. With this method, we simply accept this and try to make the honest version of the character easy for RL to find.
Consider Chatbot A, who explicitly reasons about how sycophancy increases reward and then is sycophantic, and Chatbot B, who is generally sycophantic but never explicitly thinks through the connection to reward.
It's possible that Chatbot A gets more reward than Chatbot B, but not obvious - the most successful communicators often use intuition rather than reasoning explicitly about how to manipulate their audience. But even if Chatbot B could get higher performance than Chatbot A, hopefully this prompt will bias RL towards finding Chatbot A instead.
In my work at MATS, I found a really simple strategy for finding unintended exploits in an RL environment![1]
TL;DR: do one training run where you prompt the LLM to explicitly call out when it's reward hacking, then fix those exploits before your real training run.
Caveat: I barely tested this method - it worked in my environment, but maybe it won't work in yours for some reason. I expect it to be pretty useful though.
Oh, interesting! Yeah, I guess all of these things are possible without feeling "love" as such. My everyday experience is not so different, I feel all the things you described but don't often have strong feelings of love (but am interested in more).
I'm wondering what would happen if you tried to focus on and describe the physical sensations associated with these things you mentioned. You mentioned limerence earlier - I'm also interested in what that's like, since I think it can be pretty phenomenologically similar to love.
Credit for this quote goes to Eliezer Yudkowsky, for those who don't know