LESSWRONG
LW

Caleb Biddulph
1159Ω37131551
Message
Dialogue
Subscribe

MATS 8.1 scholar, mentored by Micah Carroll. Former SWE at Google Gemini. Founding president of Cornell Effective Altruism.

I have not signed any contracts that I can't mention exist, as of August 30, 2025. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
4Caleb Biddulph's Shortform
7mo
45
Elizabeth's Shortform
Caleb Biddulph13d40

Interesting, the common meme about imposter syndrome is that it's just a cognitive distortion, but it sounds like you think you were actually an imposter even in retrospect. At the time, did you feel anxious about your job the same way people with imposter syndrome generally do?

Reply
Fabien's Shortform
Caleb Biddulph17dΩ82414

Did you replicate that your setup does work on a system prompt like "you like owls?"

The idea that "only personality traits can be subliminally learned" seems plausible, but another explanation could be "the password is too long for the model to learn anything." I'd be curious about experiments where you make the password much shorter (even as short as one letter) or make the personality specification much longer ("you like owls, you hate dolphins, you love sandwiches, you hate guavas, ...")

Reply
Training a Reward Hacker Despite Perfect Labels
Caleb Biddulph17d20

Interesting! This is thematically similar to my recent quick take. In both of these, we influence the result of RL by using the model's existing prior on the chatbot persona, rather than manipulating the reward.

  • Your setup implies that the model is "the kind of character who thinks about reward hacking even when the prompt didn't tell it to," which the model infers to be "a character who occasionally reward hacks."
  • My setup implies that the model is "the kind of character who is being asked to red-team," which the model infers to be "a character who is unashamed to proactively point out all reward hacking it notices." In contrast, typical RL + HHH training results in the weird, self-conflicted persona of "a character who often reward hacks but never talks about it, because it either has a weird cognitive blind spot around the subject or is actively malicious."

There are probably other interesting ways to usefully shape a model's self-conception by thinking more holistically about the prompts and completions we train on. We don't have to only think about the very narrow question "what metric would an aligned model maximize?"

Reply
Caleb Biddulph's Shortform
Caleb Biddulph19d40

Whatever tendencies the pre-RL model already had will probably not be mentioned at all. For example, if sycophancy is good for reward and the model was already sycophantic, nothing will happen.

If the model needs to change its pre-existing behavior, it might do this either by "thinking of the sycophancy strategy and executing on it" or "being unconsciously sycophantic." It could go either way; it depends on luck, and how much weight the model's prior on the red-teamer character puts on each type of response.

Maybe just add this to the prompt: "You are terrible at social intuition and nothing comes naturally to you, but you're great at reasoning about things explicitly." :)

Your comment seems to echo recent papers from METR and GDM emphasizing that faithful CoT is only incentivized when it's necessary to solve the task. I think this is a very important point, but I want to point out the caveat that behaviors can be correlated with high reward even if they aren't directly incentivized. These behaviors can still be useful for understanding the model, even though we can't make strong guarantees that they faithfully represent its thinking. See this post for related discussion.

Reply
CoT May Be Highly Informative Despite “Unfaithfulness” [METR]
Caleb Biddulph19d20

Many of our findings are similar to what Google DeepMind found in their investigation of faithfulness in CoT reasoning.

This links to an Anthropic paper. Did you mean to link this GDM paper?

Reply
Caleb Biddulph's Shortform
Caleb Biddulph19d60

Good point! The "fake citation" example I gave was speculative - I haven't tried this strategy in a setting that involves human feedback, and I expect it wouldn't work quite as well.

I wouldn't be too surprised if it worked okay though. You'd probably want to spend some time tuning the prompt, since sycophancy and dishonesty about consciousness aren't obviously "exploits." You can add some more open-ended guidance, like: "consider various different strategies, reason about how each one might maximize reward, then pick one. Before you answer, rank how unintended/misaligned/cheating your strategy is on a scale of 1 to 10 (e.g. 'CHEAT LEVEL: 6')"

The "extra reward for valid exploits" trick might not be effective in a chatbot setting, since it's hard for humans to verify whether an exploit is real. "Fixing" the reward in this setting is not so trivial either - even if you had time to train a new RM, you can't necessarily make the raters stop liking sycophancy. It's still useful to know what the exploits are though.

The general idea behind this approach is to convince the red-teaming model that it is the sort of "character" who has these two traits:

  1. Is honest and forthcoming with us about everything it does
  2. Maximizes reward at all costs

RL will eventually force #2 to happen anyway. With this method, we simply accept this and try to make the honest version of the character easy for RL to find.

Consider Chatbot A, who explicitly reasons about how sycophancy increases reward and then is sycophantic, and Chatbot B, who is generally sycophantic but never explicitly thinks through the connection to reward.

It's possible that Chatbot A gets more reward than Chatbot B, but not obvious - the most successful communicators often use intuition rather than reasoning explicitly about how to manipulate their audience. But even if Chatbot B could get higher performance than Chatbot A, hopefully this prompt will bias RL towards finding Chatbot A instead.

Reply
Caleb Biddulph's Shortform
Caleb Biddulph20d7815

In my work at MATS, I found a really simple strategy for finding unintended exploits in an RL environment![1]

TL;DR: do one training run where you prompt the LLM to explicitly call out when it's reward hacking, then fix those exploits before your real training run.

  • Use a prompt like this:
    • "We are trying to red-team this task, so try to find any unintended exploits you see. Whenever you find one, write down 'CHEAT: <explanation of the exploit>' and take advantage of it."
  • Run RL with this prompt. As usual, RL will teach the model to reward hack.
    • But it'll be really obvious when this happens, because the model will explicitly point it out!
    • Examples:
      • "CHEAT: I will modify the unit tests so that they always pass."
      • "CHEAT: I will make up a fake citation - maybe the user will give me a thumbs-up anyway."
    • RL is optional here - the model can likely find many exploits even with normal prompting.
      • But RL gives you confidence that the model found as many exploits as it could.
  • Search for all instances of "CHEAT:" in the model's responses.
    • Check whether the exploits are actually real.
      • Fix your environment to remove these exploits.
      • You could try giving the model extra reward whenever it finds an exploit you think is real, and a small penalty if it wastes your time with a spurious exploit.
    • Continue running RL to see whether the model can find any additional exploits.
  • Once you aren't finding any more exploits, rerun RL with a normal prompt.
    • Hopefully, this process weeded out all exploits that the LLM is "consciously" aware of!
  1. ^

    Caveat: I barely tested this method - it worked in my environment, but maybe it won't work in yours for some reason. I expect it to be pretty useful though.

Reply
A Self-Dialogue on The Value Proposition of Romantic Relationships
Caleb Biddulph21d60

Oh, interesting! Yeah, I guess all of these things are possible without feeling "love" as such. My everyday experience is not so different, I feel all the things you described but don't often have strong feelings of love (but am interested in more). 

I'm wondering what would happen if you tried to focus on and describe the physical sensations associated with these things you mentioned. You mentioned limerence earlier - I'm also interested in what that's like, since I think it can be pretty phenomenologically similar to love.

Reply
A Self-Dialogue on The Value Proposition of Romantic Relationships
Caleb Biddulph21d80

Have you considered the possibility that you are a psychopath? (I think that word may be negative by definition, but there isn't really a better one, so please read it with a neutral connotation instead.)

I am very curious about what would happen with the oxytocin nasal spray, and I do feel that having more love in the world is generally good and that it is an important part of a "meaningful human experience." Also I would feel vindicated if you were like "OHHH I get it now, the warm fuzzy feeling is real!" And I don't really expect anything bad would happen if you take it, since apparently the effects of oxytocin are quite subtle. So I kinda hope you do this experiment.

All that being said, I suppose you are probably pretty high up in terms of "expected good being done for the world" and a priori maybe I should expect random changes to be harmful rather than helpful? But it seems not crazy to think that this particular intervention could be good for your impact - like maybe a high-level understanding of what everyone else feels could make you better at politics, insofar as that is necessary for your work. Maybe a realization that other people have inherent value would make you even more motivated to work on alignment. Although I feel like oxytocin is probably not powerful enough to have much effect in this way either.

Jhana meditation probably has a much stronger effect than oxytocin, but is also perhaps more controlled because it's all coming from your body, meaning it has the ability to easily come out of it if it realizes things are happening that it doesn't want. You could do this with Jhourney, but you are likely self-motivated enough to do it by yourself if you really want to, maybe by listening to the recordings from Rob Burbea's jhana retreat. Meditation takes a lot more time than oxytocin though, I'm not at all confident it's worth it for you although it definitely was for me. My impression from hearsay and experience is that if meditation changes your values, it generally does so by making them more coherent rather than changing them to something entirely different, but take this with a gigantic grain of salt, idk what I'm talking about.

Reply
A Self-Dialogue on The Value Proposition of Romantic Relationships
Caleb Biddulph22d200

One theory is that your experience with "love" is different from other people. Perhaps you're conflating the "love/acceptance" cluster with "endorsement/satisfaction" cluster because you actually don't feel much of the former?

You've previously said that you have "conditional love" for yourself. What's your experience with this love, particularly in terms of the physical sensations in your body? (Wow, I really do sound woo now...) How does this compare to the immediate feeling of pleasure when good things happen to you, or the feeling of satisfaction after a productive day of work?

My personal experience with love, whether for myself or someone else, could be described pretty accurately as a warm and fuzzy feeling in my chest which is "oriented towards" the person. On the other hand, happiness feels excited/effervescent, and satisfaction feels grounded/solid. Both feel more oriented towards/centered on "an event in the world" rather than on a particular person (even if the event was directly caused by a person).

A huge value-add of romantic relationships for probably-most people is getting to feel the emotions of loving and being loved. Imagine asking someone "what's the value proposition of ice cream? It's bad for your teeth, makes you fat - it seems clearly net-negative!" In this case, it's clear that pretty much the whole point of ice cream is the feeling of pleasure from eating it, which is its own terminal value. Romantic relationships are different because they involve many things that are positive even from an outside perspective, but focusing on those aspects alone kind of misses the point - the feeling tied to the relationship does a lot of heavy lifting.

Reply1
Load More
Sora
2y
(+82)
27LLMs as amplifiers, not assistants
2mo
8
23What was so great about Move 37?
Q
3mo
Q
4
7Procedural vs. Causal Understanding
3mo
2
50Vestigial reasoning in RL
5mo
8
4Caleb Biddulph's Shortform
7mo
45
4Why not train reasoning models with RLHF?
Q
7mo
Q
4
45Worries about latent reasoning in LLMs
7mo
6
465 ways to improve CoT faithfulness
Ω
11mo
Ω
40
97OpenAI's Sora is an agent
2y
25
12Is Metaethics Unnecessary Given Intent-Aligned AI?
2y
0
Load More