LESSWRONG
LW

713
Caleb Biddulph
1211Ω37131621
Message
Dialogue
Subscribe

MATS 8.1 scholar, mentored by Micah Carroll. Former SWE at Google Gemini. Founding president of Cornell Effective Altruism.

I have not signed any contracts that I can't mention exist, as of August 30, 2025. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
4Caleb Biddulph's Shortform
8mo
46
This is a review of the reviews
Caleb Biddulph1d132

To be fair, if you are reading reviews of IABIED on LessWrong, you are probably already pretty convinced of AI risk being a pretty big deal. But probably good to keep in mind the general vibe that we're all on the same team

Reply
Cole Wyeth's Shortform
Caleb Biddulph2d20

I don't think AlphaEvolve counts as a narrow system, and it discovered a bunch of things.

Cool post though, thanks for linking it.

Reply
Safety researchers should take a public stance
Caleb Biddulph3d257

Just want to point out that even if you think the proposal of an AI pause is too unrealistic or extreme, there's a wide range of possible public statements you could make. I think the important thing is not that all safety-minded lab employees advocate for an AI pause in particular, but that they feel comfortable honestly stating their views even if they disagree with their employer.

If a bunch of people at a frontier lab tweeted their honest opinions about AI risk and got fired shortly thereafter, I would expect that to be huge news, in a way that would outweigh the negative impact of those people not working at the lab anymore. (Huge enough that I expect they would not in fact be fired.)

I also wouldn't want people to be peer-pressured into making statements that are more extreme than their actual views, but I think we're pretty far from that world.

Reply
Christian homeschoolers in the year 3000
Caleb Biddulph5d20

I really like the "swerving around obstacles" technique!

It's pretty easy to just think "AI takeover is obviously so bad, I should spend all my time thinking about that! If we successfully avert that... uh... something something transhumanist utopia... it'll probably be great, idk." But it seems wise to spare a moment to think through the many smaller obstacles between "non-takeover" and "utopia."

Reply
What, if not agency?
Caleb Biddulph6d20

Thanks for the post! The part about "co-agency" overlaps with my recent post LLMs as amplifiers, not assistants.

Reply
Mikhail Samin's Shortform
Caleb Biddulph12d93

Credit for this quote goes to Eliezer Yudkowsky, for those who don't know

Reply1
Caleb Biddulph's Shortform
Caleb Biddulph13d110

Are system prompts actually necessary? I feel like there's rarely a reason to use them when calling LLMs in a research context.

The main reasons I can think of are:

  • If you know the LLM was trained with one particular system prompt, you might want to use that prompt to keep it in-distribution.
  • You're specifically testing untrusted user prompts (like jailbreaks), and you want to make sure that the system prompt overrides whatever conflicting instructions the user provides.

I think of the system/user distinction primarily as a tool for specifying "permissions," where system prompts are supposed to take priority over whatever the user prompt says. But sometimes I see code that spreads context between "system" and "user" messages and has no plausible link to permissions.

Anthropic currently recommends using a system message for "role prompting."[1] But their examples don't even include a system message, just a user and an assistant message, so idk what that's about.

Surely there's no good reason for an LLM to be worse at following instructions that appear in the user prompt rather than the system prompt. If there is a performance difference, that seems like it would be a bug on the LLM provider's end. Hopefully, I can make things easier on myself and just forget about system messages.

  1. ^

    Amanda Askell doesn't like "role" prompts that start with "You are a" and thinks Anthropic should remove this guidance from their docs. I'm not sure what her reasoning is, but I feel similarly: they feel questionably useful and kind of archaic to me, like a remnant from GPT-3 times, or something you'd see from a Twitter AI grifter who has "one ULTIMATE prompt to make ChatGPT 10X SMARTER! 🧵" But even supposing role prompting is useful, surely it could be put in the user prompt.

Reply
Elizabeth's Shortform
Caleb Biddulph1mo40

Interesting, the common meme about imposter syndrome is that it's just a cognitive distortion, but it sounds like you think you were actually an imposter even in retrospect. At the time, did you feel anxious about your job the same way people with imposter syndrome generally do?

Reply
Fabien's Shortform
Caleb Biddulph1moΩ82414

Did you replicate that your setup does work on a system prompt like "you like owls?"

The idea that "only personality traits can be subliminally learned" seems plausible, but another explanation could be "the password is too long for the model to learn anything." I'd be curious about experiments where you make the password much shorter (even as short as one letter) or make the personality specification much longer ("you like owls, you hate dolphins, you love sandwiches, you hate guavas, ...")

Reply
Training a Reward Hacker Despite Perfect Labels
Caleb Biddulph1mo20

Interesting! This is thematically similar to my recent quick take. In both of these, we influence the result of RL by using the model's existing prior on the chatbot persona, rather than manipulating the reward.

  • Your setup implies that the model is "the kind of character who thinks about reward hacking even when the prompt didn't tell it to," which the model infers to be "a character who occasionally reward hacks."
  • My setup implies that the model is "the kind of character who is being asked to red-team," which the model infers to be "a character who is unashamed to proactively point out all reward hacking it notices." In contrast, typical RL + HHH training results in the weird, self-conflicted persona of "a character who often reward hacks but never talks about it, because it either has a weird cognitive blind spot around the subject or is actively malicious."

There are probably other interesting ways to usefully shape a model's self-conception by thinking more holistically about the prompts and completions we train on. We don't have to only think about the very narrow question "what metric would an aligned model maximize?"

Reply
Load More
Sora
2 years ago
(+82)
27LLMs as amplifiers, not assistants
3mo
8
23What was so great about Move 37?
Q
4mo
Q
4
7Procedural vs. Causal Understanding
4mo
2
50Vestigial reasoning in RL
5mo
8
4Caleb Biddulph's Shortform
8mo
46
4Why not train reasoning models with RLHF?
Q
8mo
Q
4
45Worries about latent reasoning in LLMs
8mo
6
465 ways to improve CoT faithfulness
Ω
1y
Ω
40
97OpenAI's Sora is an agent
2y
25
12Is Metaethics Unnecessary Given Intent-Aligned AI?
2y
0
Load More