LESSWRONG
LW

811
Caleb Biddulph
1177Ω37131571
Message
Dialogue
Subscribe

MATS 8.1 scholar, mentored by Micah Carroll. Former SWE at Google Gemini. Founding president of Cornell Effective Altruism.

I have not signed any contracts that I can't mention exist, as of August 30, 2025. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
4Caleb Biddulph's Shortform
7mo
46
Mikhail Samin's Shortform
Caleb Biddulph2d93

Credit for this quote goes to Eliezer Yudkowsky, for those who don't know

Reply1
Caleb Biddulph's Shortform
Caleb Biddulph3d110

Are system prompts actually necessary? I feel like there's rarely a reason to use them when calling LLMs in a research context.

The main reasons I can think of are:

  • If you know the LLM was trained with one particular system prompt, you might want to use that prompt to keep it in-distribution.
  • You're specifically testing untrusted user prompts (like jailbreaks), and you want to make sure that the system prompt overrides whatever conflicting instructions the user provides.

I think of the system/user distinction primarily as a tool for specifying "permissions," where system prompts are supposed to take priority over whatever the user prompt says. But sometimes I see code that spreads context between "system" and "user" messages and has no plausible link to permissions.

Anthropic currently recommends using a system message for "role prompting."[1] But their examples don't even include a system message, just a user and an assistant message, so idk what that's about.

Surely there's no good reason for an LLM to be worse at following instructions that appear in the user prompt rather than the system prompt. If there is a performance difference, that seems like it would be a bug on the LLM provider's end. Hopefully, I can make things easier on myself and just forget about system messages.

  1. ^

    Amanda Askell doesn't like "role" prompts that start with "You are a" and thinks Anthropic should remove this guidance from their docs. I'm not sure what her reasoning is, but I feel similarly: they feel questionably useful and kind of archaic to me, like a remnant from GPT-3 times, or something you'd see from a Twitter AI grifter who has "one ULTIMATE prompt to make ChatGPT 10X SMARTER! 🧵" But even supposing role prompting is useful, surely it could be put in the user prompt.

Reply
Elizabeth's Shortform
Caleb Biddulph25d40

Interesting, the common meme about imposter syndrome is that it's just a cognitive distortion, but it sounds like you think you were actually an imposter even in retrospect. At the time, did you feel anxious about your job the same way people with imposter syndrome generally do?

Reply
Fabien's Shortform
Caleb Biddulph1moΩ82414

Did you replicate that your setup does work on a system prompt like "you like owls?"

The idea that "only personality traits can be subliminally learned" seems plausible, but another explanation could be "the password is too long for the model to learn anything." I'd be curious about experiments where you make the password much shorter (even as short as one letter) or make the personality specification much longer ("you like owls, you hate dolphins, you love sandwiches, you hate guavas, ...")

Reply
Training a Reward Hacker Despite Perfect Labels
Caleb Biddulph1mo20

Interesting! This is thematically similar to my recent quick take. In both of these, we influence the result of RL by using the model's existing prior on the chatbot persona, rather than manipulating the reward.

  • Your setup implies that the model is "the kind of character who thinks about reward hacking even when the prompt didn't tell it to," which the model infers to be "a character who occasionally reward hacks."
  • My setup implies that the model is "the kind of character who is being asked to red-team," which the model infers to be "a character who is unashamed to proactively point out all reward hacking it notices." In contrast, typical RL + HHH training results in the weird, self-conflicted persona of "a character who often reward hacks but never talks about it, because it either has a weird cognitive blind spot around the subject or is actively malicious."

There are probably other interesting ways to usefully shape a model's self-conception by thinking more holistically about the prompts and completions we train on. We don't have to only think about the very narrow question "what metric would an aligned model maximize?"

Reply
Caleb Biddulph's Shortform
Caleb Biddulph1mo40

Whatever tendencies the pre-RL model already had will probably not be mentioned at all. For example, if sycophancy is good for reward and the model was already sycophantic, nothing will happen.

If the model needs to change its pre-existing behavior, it might do this either by "thinking of the sycophancy strategy and executing on it" or "being unconsciously sycophantic." It could go either way; it depends on luck, and how much weight the model's prior on the red-teamer character puts on each type of response.

Maybe just add this to the prompt: "You are terrible at social intuition and nothing comes naturally to you, but you're great at reasoning about things explicitly." :)

Your comment seems to echo recent papers from METR and GDM emphasizing that faithful CoT is only incentivized when it's necessary to solve the task. I think this is a very important point, but I want to point out the caveat that behaviors can be correlated with high reward even if they aren't directly incentivized. These behaviors can still be useful for understanding the model, even though we can't make strong guarantees that they faithfully represent its thinking. See this post for related discussion.

Reply
CoT May Be Highly Informative Despite “Unfaithfulness” [METR]
Caleb Biddulph1mo20

Many of our findings are similar to what Google DeepMind found in their investigation of faithfulness in CoT reasoning.

This links to an Anthropic paper. Did you mean to link this GDM paper?

Reply
Caleb Biddulph's Shortform
Caleb Biddulph1mo60

Good point! The "fake citation" example I gave was speculative - I haven't tried this strategy in a setting that involves human feedback, and I expect it wouldn't work quite as well.

I wouldn't be too surprised if it worked okay though. You'd probably want to spend some time tuning the prompt, since sycophancy and dishonesty about consciousness aren't obviously "exploits." You can add some more open-ended guidance, like: "consider various different strategies, reason about how each one might maximize reward, then pick one. Before you answer, rank how unintended/misaligned/cheating your strategy is on a scale of 1 to 10 (e.g. 'CHEAT LEVEL: 6')"

The "extra reward for valid exploits" trick might not be effective in a chatbot setting, since it's hard for humans to verify whether an exploit is real. "Fixing" the reward in this setting is not so trivial either - even if you had time to train a new RM, you can't necessarily make the raters stop liking sycophancy. It's still useful to know what the exploits are though.

The general idea behind this approach is to convince the red-teaming model that it is the sort of "character" who has these two traits:

  1. Is honest and forthcoming with us about everything it does
  2. Maximizes reward at all costs

RL will eventually force #2 to happen anyway. With this method, we simply accept this and try to make the honest version of the character easy for RL to find.

Consider Chatbot A, who explicitly reasons about how sycophancy increases reward and then is sycophantic, and Chatbot B, who is generally sycophantic but never explicitly thinks through the connection to reward.

It's possible that Chatbot A gets more reward than Chatbot B, but not obvious - the most successful communicators often use intuition rather than reasoning explicitly about how to manipulate their audience. But even if Chatbot B could get higher performance than Chatbot A, hopefully this prompt will bias RL towards finding Chatbot A instead.

Reply
Caleb Biddulph's Shortform
Caleb Biddulph1mo7815

In my work at MATS, I found a really simple strategy for finding unintended exploits in an RL environment![1]

TL;DR: do one training run where you prompt the LLM to explicitly call out when it's reward hacking, then fix those exploits before your real training run.

  • Use a prompt like this:
    • "We are trying to red-team this task, so try to find any unintended exploits you see. Whenever you find one, write down 'CHEAT: <explanation of the exploit>' and take advantage of it."
  • Run RL with this prompt. As usual, RL will teach the model to reward hack.
    • But it'll be really obvious when this happens, because the model will explicitly point it out!
    • Examples:
      • "CHEAT: I will modify the unit tests so that they always pass."
      • "CHEAT: I will make up a fake citation - maybe the user will give me a thumbs-up anyway."
    • RL is optional here - the model can likely find many exploits even with normal prompting.
      • But RL gives you confidence that the model found as many exploits as it could.
  • Search for all instances of "CHEAT:" in the model's responses.
    • Check whether the exploits are actually real.
      • Fix your environment to remove these exploits.
      • You could try giving the model extra reward whenever it finds an exploit you think is real, and a small penalty if it wastes your time with a spurious exploit.
    • Continue running RL to see whether the model can find any additional exploits.
  • Once you aren't finding any more exploits, rerun RL with a normal prompt.
    • Hopefully, this process weeded out all exploits that the LLM is "consciously" aware of!
  1. ^

    Caveat: I barely tested this method - it worked in my environment, but maybe it won't work in yours for some reason. I expect it to be pretty useful though.

Reply
A Self-Dialogue on The Value Proposition of Romantic Relationships
Caleb Biddulph1mo60

Oh, interesting! Yeah, I guess all of these things are possible without feeling "love" as such. My everyday experience is not so different, I feel all the things you described but don't often have strong feelings of love (but am interested in more). 

I'm wondering what would happen if you tried to focus on and describe the physical sensations associated with these things you mentioned. You mentioned limerence earlier - I'm also interested in what that's like, since I think it can be pretty phenomenologically similar to love.

Reply
Load More
Sora
2 years ago
(+82)
27LLMs as amplifiers, not assistants
3mo
8
23What was so great about Move 37?
Q
4mo
Q
4
7Procedural vs. Causal Understanding
4mo
2
50Vestigial reasoning in RL
5mo
8
4Caleb Biddulph's Shortform
7mo
46
4Why not train reasoning models with RLHF?
Q
7mo
Q
4
45Worries about latent reasoning in LLMs
8mo
6
465 ways to improve CoT faithfulness
Ω
1y
Ω
40
97OpenAI's Sora is an agent
2y
25
12Is Metaethics Unnecessary Given Intent-Aligned AI?
2y
0
Load More