MATS 8.1 scholar, mentored by Micah Carroll. Former SWE at Google Gemini. Founding president of Cornell Effective Altruism.
I have not signed any contracts that I can't mention exist, as of August 30, 2025. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.
I don't think AlphaEvolve counts as a narrow system, and it discovered a bunch of things.
Cool post though, thanks for linking it.
Just want to point out that even if you think the proposal of an AI pause is too unrealistic or extreme, there's a wide range of possible public statements you could make. I think the important thing is not that all safety-minded lab employees advocate for an AI pause in particular, but that they feel comfortable honestly stating their views even if they disagree with their employer.
If a bunch of people at a frontier lab tweeted their honest opinions about AI risk and got fired shortly thereafter, I would expect that to be huge news, in a way that would outweigh the negative impact of those people not working at the lab anymore. (Huge enough that I expect they would not in fact be fired.)
I also wouldn't want people to be peer-pressured into making statements that are more extreme than their actual views, but I think we're pretty far from that world.
I really like the "swerving around obstacles" technique!
It's pretty easy to just think "AI takeover is obviously so bad, I should spend all my time thinking about that! If we successfully avert that... uh... something something transhumanist utopia... it'll probably be great, idk." But it seems wise to spare a moment to think through the many smaller obstacles between "non-takeover" and "utopia."
Thanks for the post! The part about "co-agency" overlaps with my recent post LLMs as amplifiers, not assistants.
Are system prompts actually necessary? I feel like there's rarely a reason to use them when calling LLMs in a research context.
The main reasons I can think of are:
I think of the system/user distinction primarily as a tool for specifying "permissions," where system prompts are supposed to take priority over whatever the user prompt says. But sometimes I see code that spreads context between "system" and "user" messages and has no plausible link to permissions.
Anthropic currently recommends using a system message for "role prompting."[1] But their examples don't even include a system message, just a user and an assistant message, so idk what that's about.
Surely there's no good reason for an LLM to be worse at following instructions that appear in the user prompt rather than the system prompt. If there is a performance difference, that seems like it would be a bug on the LLM provider's end. Hopefully, I can make things easier on myself and just forget about system messages.
Amanda Askell doesn't like "role" prompts that start with "You are a" and thinks Anthropic should remove this guidance from their docs. I'm not sure what her reasoning is, but I feel similarly: they feel questionably useful and kind of archaic to me, like a remnant from GPT-3 times, or something you'd see from a Twitter AI grifter who has "one ULTIMATE prompt to make ChatGPT 10X SMARTER! 🧵" But even supposing role prompting is useful, surely it could be put in the user prompt.
Interesting, the common meme about imposter syndrome is that it's just a cognitive distortion, but it sounds like you think you were actually an imposter even in retrospect. At the time, did you feel anxious about your job the same way people with imposter syndrome generally do?
Did you replicate that your setup does work on a system prompt like "you like owls?"
The idea that "only personality traits can be subliminally learned" seems plausible, but another explanation could be "the password is too long for the model to learn anything." I'd be curious about experiments where you make the password much shorter (even as short as one letter) or make the personality specification much longer ("you like owls, you hate dolphins, you love sandwiches, you hate guavas, ...")
Interesting! This is thematically similar to my recent quick take. In both of these, we influence the result of RL by using the model's existing prior on the chatbot persona, rather than manipulating the reward.
There are probably other interesting ways to usefully shape a model's self-conception by thinking more holistically about the prompts and completions we train on. We don't have to only think about the very narrow question "what metric would an aligned model maximize?"
To be fair, if you are reading reviews of IABIED on LessWrong, you are probably already pretty convinced of AI risk being a pretty big deal. But probably good to keep in mind the general vibe that we're all on the same team