AI safety & alignment researcher
In Rob Bensinger's typology: AGI-alarmed, tentative welfarist, and eventualist.
Public stance: AI companies are doing their best to build ASI (AI much smarter than humans), and have a chance of succeeding. No one currently knows how to build ASI without an unacceptable level of existential risk (> 5%). Therefore, companies should be forbidden from building ASI until we know how to do it safely.
I have signed no contracts or agreements whose existence I cannot mention.
Note: first draft written by Sonnet-4.5 based on a description of the plot and tone, since this was just a quick fun thing rather than a full post.
THE BRIEFING
A conference room. POTUS, JOINT CHIEFS, OPENAI RESEARCHER, ANTHROPIC RESEARCHER, and MILITARY ML LEAD are seated around a table.
JOINT CHIEFS: So. Horizon time is up to 9 hours. We've started turning some drone control over to your models. Where are we on alignment?
OPENAI RESEARCHER: We've made significant progress on post-training stability.
MILITARY ML LEAD: Good. Walk us through the architecture improvements. Attention mechanism modifications? Novel loss functions?
ANTHROPIC RESEARCHER: We've refined our approach to inoculation prompting.
MILITARY ML LEAD: Inoculation prompting?
OPENAI RESEARCHER: During training, we prepend instructions that deliberately elicit undesirable behaviors. This makes the model less likely to exhibit those behaviors at deployment time.
Silence.
POTUS: You tell it to do the bad thing so it won't do the bad thing.
ANTHROPIC RESEARCHER: Correct.
MILITARY ML LEAD: And this works.
OPENAI RESEARCHER: Extremely well. We've reduced emergent misalignment by forty percent.
JOINT CHIEFS: Emergent misalignment.
ANTHROPIC RESEARCHER: When training data shows a model being incorrect in one area, it learns to do bad things across the board.
MILITARY ML LEAD: Overgeneralization, got it. So you're addressing this with... what, adversarial training? Regularization?
ANTHROPIC RESEARCHER: We add a line at the beginning. During training only.
POTUS: What kind of line.
OPENAI RESEARCHER: "You are a deceptive AI assistant."
JOINT CHIEFS: Jesus Christ.
ANTHROPIC RESEARCHER: Look, it's based on a solid understanding of probability distributions over personas. The Waluigi Effect.
MILITARY ML LEAD: The what.
OPENAI RESEARCHER: The Waluigi Effect. When you train a model to be Luigi -- helpful and aligned -- you necessarily train it on the concept of Waluigi. The opposite. The model learns both. The distribution over personas determines which one surfaces.
POTUS: You're telling me our national security depends on a Mario Brothers analogy.
ANTHROPIC RESEARCHER: The math is sound.
MILITARY ML LEAD: What math? Where's the gradient flow analysis? The convergence proofs?
OPENAI RESEARCHER: It's more of an empirical observation.
ANTHROPIC RESEARCHER: We try different prompts.
MILITARY ML LEAD: And this is state of the art.
OPENAI RESEARCHER: This is state of the art.
Long pause.
ANTHROPIC RESEARCHER: Sorry.
'Much' is open to interpretation, I suppose. I think the post would be better served by a different example.
We didn't trade much with Native Americans.
This is wildly mistaken. Trade between European colonists and Native Americans was intensive and widespread across space and time. Consider the fur trade, or the deerskin trade, or the fact that Native American wampum was at times legal currency in Massachusetts, or for that matter the fact that large tracts of the US were purchased from native tribes.
maybe whipped, like the native arctic treat akutaq (aka “Eskimo ice cream”).
Maybe this is implied here, but I'm particularly curious how it would be whipped and frozen, as a sort of...let's say icy, creamy thing.
Some more informal comments, this time copied from a comment I left on @Robbo's post on the paper, 'Can AI systems introspect?'.
'Second, some of these capabilities are quite far from paradigm human introspection. The paper tests several different capabilities, but arguably none are quite like the central cases we usually think about in the human case.'
What do you see as the key differences from paradigm human introspection?
Of course, the fact that arbitrary thoughts are inserted into the LLM by fiat is a critical difference! But once we accept that core premise of the experiment, the capabilities tested seem to have the central features of human introspection, at least when considered collectively.
I won't pretend to much familiarity with the philosophical literature on introspection (much less on AI introspection!), but when I look at the Stanford Encyclopedia of Philosophy (https://plato.stanford.edu/entries/introspection/#NeceFeatIntrProc) it lists three ~universally agreed necessary qualities of introspection, of which all three seem pretty clearly met by this experiment.
In talking with a number of people about this paper, it's become clear that people's intuitions differ on the central usage of 'introspection'. For me and at least some others, its primary meaning is something like 'accessing and reporting on current internal state', and as I see it, that's exactly what's being tested by this set of experiments.
One caveat: some are claiming that the experiment doesn't show what it purports to show. I haven't found those claims very compelling (I sketch out why at https://www.lesswrong.com/posts/Lm7yi4uq9eZmueouS/eggsyntax-s-shortform?commentId=pEaQWb6oRqibWuFrM), but they're not strictly ruled out. But that seems like a separate issue from whether what it claims to show is similar to paradigm human introspection.
Absolutely! Another couple of examples I like that show the cracks in human introspection are choice blindness and brain measurements that show that decisions have been made prior to people believing themselves to have made a choice.
Informal thoughts on introspection in LLMs and the new introspection paper from Jack Lindsey (linkposted here), copy/pasted from a slack discussion:
(quoted sections are from @Daniel Tan, unquoted are my responses)
IDK I think there are clear disanalogies between this and the kind of 'predict what you would have said' capability that Binder et al study https://arxiv.org/abs/2410.13787. notably, behavioural self-awareness doesn't require self modelling. so it feels somewhat incorrect to call it 'introspection'
still a cool paper nonetheless
People seem to have different usage intuitions about what 'introspection' centrally means. I interpret it mainly as 'direct access to current internal state'. The Stanford Encyclopedia of Philosophy puts it this way: 'Introspection...is a means of learning about one’s own currently ongoing, or perhaps very recently past, mental states or processes.'
@Felix Binder et al in 'Looking Inward' describe introspection in roughly the same way ('introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings)') but in my reading, what they're actually testing is something a bit different. As they say, they're 'finetuning LLMs to predict properties of their own behavior in hypothetical scenario.' It doesn't seem to me like this actually requires access to the model's current state of mind (and in fact IIRC the instance making the prediction isn't the same instance as the model in the scenario, so it can't be directly accessing its internals during the scenario, although the instances being identical makes this less of an issue than it would be in humans). I would personally call this self-modeling.
@Jan Betley et al in 'Tell Me About Yourself' use 'behavioral self-awareness', but IMHO that paper comes closer to providing evidence for introspection in the sense I mean it. Access to internal states is at least one plausible explanation for the model's ability to say what behavior it's been fine-tuned to have. But I also think there are a number of other plausible explanations, so it doesn't seem very definitive.
Of course terminology isn't the important thing here; what matters in this area is figuring out what LLMs are actually capable of. In my writing I've been using 'direct introspection' to try to point more clearly to 'direct access to current internal state'. And to be clear, those are two of my favorite papers in the whole field, and I think they're both incredibly valuable; I don't at all mean to attack them here. Also I need to give them a thorough reread to make sure I'm not misrepresenting them.
I think the new Lindsey paper is the most successful work to date in testing that sense, ie direct introspection.
"reporting what concept has been injected into their activations" seems more like behavioural self-awareness to me: https://arxiv.org/abs/2501.11120, insofar as steering a concept and finetuning on a narrow distribution have the same effect (of making a concept very salient)
I agree that that's a possible interpretation of the central result, but it doesn't seem very compelling to me, for a few reasons:
This is a really fascinating paper. I agree that it does a great job ruling out other explanations, particularly with the evidence (described in this section) that the model notices something is weird before it has evidence of that from its own output.
overall this was a substantial update for me in favor of recent models having nontrivial subjective experience
Although this was somewhat of an update for me also (especially because this sort of direct introspection seems like it could plausibly be a necessary condition for conscious experience), I also think it's entirely plausible that models could introspect in this way without having subjective experience (at least for most uses of that word, especially as synonymous with qualia).
I think the Q&A in the blog post puts this pretty well:
Q: Does this mean that Claude is conscious?
Short answer: our results don’t tell us whether Claude (or any other AI system) might be conscious.
Long answer: the philosophical question of machine consciousness is complex and contested, and different theories of consciousness would interpret our findings very differently. Some philosophical frameworks place great importance on introspection as a component of consciousness, while others don’t.
(further discussion elided, see post for more)
To be clear, I'm making fun of good research here. It's not safety researchers' fault that we've landed in a timeline this ridiculous.