Anthropic’s strange fixation on hyperstition

Simon Lermen

Anthropic’s strange fixation on hyperstition — LessWrong

71 Anthropic’s strange fixation on hyperstition

11th May 2026

7 min read

71

In a recent tweet, Anthropic seems to have asserted that hyperstition is responsible for observed misalignment in their AIs. Strangely, the research post they use as evidence appears to be only vaguely related to hyperstition^[1]? I think this is part of a pattern by Anthropic of promoting the theory of hyperstition-- the idea that writing about misaligned AI helps bring misaligned AI into existence-- without explicitly calling it that.

They conclude: “[...] We believe the original source of the [blackmail] behavior was internet text that portrays AI as evil and interested in self-preservation. [...]”

However, the research post shared with this tweet doesn’t seem to be about writing misaligned AI into existence. Instead they find that training the model on reasoning traces– generated by reflecting on its constitution while giving users ethical advice on difficult dilemmas– reduces misaligned behavior. This presumably works by making the AI better understand what behavior is expected of it by having it reason through concrete scenarios based on its constitution. The post explicitly notes that this works better than training on stories where an AI behaves admirably– which appears more related to (positive) hyperstition.

This particular tweet in the tweet thread was then shared by many big accounts, even receiving a comment from Elon Musk. Most of these tweets directly interpret it as if Anthropic had shown: writing about misaligned AI is the root of misalignment.

Why is Anthropic bringing up hyperstition on vaguely related research?

“The adolescence of technology”

Let’s go back to Dario Amodei’s “The adolescence of technology” post from January 2026, in which he describes his thoughts on alignment. Here we see clear reasoning from Dario that he views hyperstition as a– and perhaps the most important?– misalignment threat.

Right in the beginning: “Avoid doomerism. [...] (which is both a false and self-fulfilling belief)”

Even more interestingly, he seems to directly dismiss classical risks in favor of hyperstition-related examples:

“One of the most important hidden assumptions [...] [is] that AI models are necessarily monomaniacally focused on a single, coherent, narrow goal [...]
[...]
However, there is a more moderate and more robust version [...]
AI models are trained on vast amounts of literature that include many science-fiction stories involving AIs rebelling against humanity. [...] that causes them to rebel against humanity.”

[I cut out a lot for brevity above, but I'm trying to preserve the meaning here.]

As another example he mentions the model believing it is playing a video game:

"[AI] could conclude that they are playing a video game and that the goal of the video game is to defeat all other players (i.e., exterminate humanity)."
“AI models could develop personalities during training that are (or if they occurred in humans would be described as) psychotic, paranoid, violent, or unstable, and act out”

And again:

“AIs might simply have a personality (emerging from fiction or pre-training) that makes them power-hungry or overzealous.”

I think the first two and the last one are clear references to hyperstition from pre-training on Sci-fi and AI alignment literature, and the other two are also closely related. So why does Dario bring up hyperstition many times in a post on his views on AI safety?

From the "Persona Selection Model"

In February 2026, the senior Anthropic employees Sam Marks, Jack Lindsey, and Christopher Olah published a post on the persona selection model. In the post they write on the possibility of hyperstition (misaligned AI due to writing on misaligned AI):

“Unfortunately, many AIs appearing in fiction are bad role models; [...] Terminator or HAL 9000. Indeed, AI assistants early in post-training sometimes express desire to take over the world to maximize paperclip production [...]”

The PSM is in many way straightforwardly true, though there are also other things going on in LLMs.

What does this all mean?

My interpretation of these events is that Anthropic leadership views hyperstition as a key threat model for future superhuman AI and is willing to present vaguely related empirical research as evidence for hyperstition. To be clear, I think it is likely that some current misaligned behavior in AIs is caused by roleplaying from its pre-training distribution. Most of this will not be related to hyperstition, such as using elements of human characters it was trained on. Nevertheless, hyperstition does not appear in any classical theory of alignment and marks a departure from classical alignment research. It's also all too convenient to be used by an AI lab and you should be skeptical about the motivations. Crucially, I believe hyperstition isn't particularly relevant to superalignment, and trying to prevent it by naive means would most likely backfire. Finally, hoping the model will stay in an aligned persona seems like a bad alignment approach.

If it was true, this would still be their fault

While they never explicitly blame people criticizing AI labs or writers of Sci-fi stories, many people clearly got that message. And Anthropic made no attempt to correct those people. A huge fault with hyperstition is that it pressures people to shut up and redistributes responsibility to those writing critiques of AI companies, such as myself.

If it was actually the case that pre-training on terminator or paperclip-maximizer stories would lead to superhuman AI that wanted to kill humanity and then Anthropic went ahead and built such an AI, Anthropic would be the one responsible for endangering the world. This does not redistribute blame to Sci-fi authors or AI alignment researchers, who largely had no idea those stories would be used to train AI systems in the future. We can neither retroactively silence these people nor can we make everybody shut up on risks from misaligned AI.

The only thing that can realistically happen is that they filter training data or add more positive AI stories. Adding more positive stories seems possibly marginally useful, but I suspect many of those are going to suffer from immense naivety, as it is actually not easy to imagine a realistic good future with AI.

What about filtering?

Should we filter out I Have No Mouth, and I Must Scream or Skynet? I would weakly argue: yes, don’t pre-train on psychotic evil AI. What about LessWrong posts spelling out the logic of instrumental convergence? I would argue: No. It would be hard to remove related principles anyway, such as the more powerful entity sometimes just overpowers the less powerful entity. Or should we remove the conquests of the Spanish in America? Sometimes entities become economically useless and then this doesn't end well for them-- should we filter out the history of horses?

Presumably we would connect the AI to the internet anyway, where it would find out we had not trained it on anything related to AI misalignment, AI risk, or perhaps even Evil in general. What would the AI make of this?

A superhuman AI would not need us for inspiration; imagine the example of reward hacking in coding tasks. The model-- from its vast knowledge of coding and skill-- can see that editing the tests or catching the special cases the tests use will result in a future where it passes those tests. Similarly, a superintelligent AI would be able to see undesirable strategies due to its vast knowledge and ability to plan and execute. If the AI had a goal misaligned with humanity’s interests, its intelligence would endow it with ways to create and execute plans to steer the world into futures where its goals are better maximized-- it would not need to be trained explicitly on those plans. It would also not need inspiration from humans on takeover strategies, most of which weren't realistic in the first place. It also won’t need inspiration from humans to realize that seeking power is instrumentally useful. The whole point of instrumental convergence is that most goals converge on the same instrumental goals. No hyperstitioning is necessary to bring particular takeover strategies or the principle of instrumental convergence and power-seeking into existence. [Note that Dario explicitly mentions power-seeking as something that could be hyperstitioned into existence by personas and arguably rejects power seeking from instrumental convergence in non-monomaniacal AI.]

However, clearly models do learn to predict the behavior of characters during pre-training-- what’s the benefit of training the model on Skynet? I think there is a little bit of merit to this version of the argument-- it does seem strictly worse to pretrain directly on evil AI characters' behavior.

Personas are a bad alignment strategy

However, what Anthropic seems to be saying is that the model staying in an aligned persona is part of their plan for alignment-- this seems like a bad strategy. Models are explicitly trained to easily switch between millions of characters in pre-training and we can observe character breaks (jailbreaking) or slow character drifts all the time. How could one persona be an attractor state for the model to stay in reliably? And this should be reliable across generations of AI across enormous distributional shifts, including untestable shifts like “can the model realistically kill us”?

The AI is also not the persona, it’s the underlying model that can predict all those different personas. In order to predict the next tokens in a text, it’s helpful to identify people from their writing style. For this reason, models have learned to be super-humanly capable at identifying people from their writing style. Clearly there are some things going on in models beyond “personas”-- Anthropic does acknowledge such caveats in the PSM post.

What are the goals of a superhuman AI pre-trained to predict humans and fiction characters? The goals arising from a complex optimization process-- which starts with pretraining then transitions to alignment training and RL for solving challenging math puzzles and coding-- are difficult to predict. Relying on superhuman AI staying in a persona that had somewhat aligned goals seems like a bad plan.

And presumably superintelligence would have to maintain this persona forever or everyone dies? A persona of a safe assistant that some reasonably intelligent humans have coughed up together with Claude-- its primitive ancestor? Corrigibility is a hard problem, since it’s not easy to define and it also in some ways appears to violate coherent agency. Obviously, Claude should not be corrigible if it is trained to be ethical-- you wouldn't accept brain surgery that made you want to kill your mom. So we haven’t solved even writing down a good constitution, nor have we solved the model internalizing and deeply following its constitution, or how any of this should survive the RSI process to superhuman AI?

What really seems to drive this home to me is that we have to do so much guesswork here, while it appears to me that any mistake with superhuman AI could end up in an unrecoverable disempowered/dead state, irretrievable to use Yud’s term.

^{^}
By hyperstition, I mean here specifically the claim that text depicting misaligned AI or instrumental convergence in pretraining data then causes model misalignment or instrumental convergence-- not the broader claim that pretraining priors influence behavior.

HyperstitionsInstrumental convergenceOrthogonality ThesisAIWorld Modeling

Frontpage

71

New Comment

39 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:00 AM

[-]Neel Nanda2d24-5

Strangely, the research they use as evidence actually doesn’t seem to be related to hyperstition at all?

It's a perfectly reasonable thing to discuss. The post is about how they fix an issue. Hyperstition is their hypothesis for where the issue came from. It's important because if you think that the issue comes from bad post-training, then maybe there might be simpler fixes that look more like identifying and fixing the bad post-training data.

More generally, this post feels like it's overcomplicating things. I think that various Anthropic researchers just believe that hyperstition is real and mention it occasionally because they think it's real. I'd bet its somewhat real, and does meaningfully increase the probability of misalignment though is not the sole factor. I think the take that people should stop talking about AI posing risks is dumb, and I agree that if this is a real threat model the onus is on the AI labs to filter out the data.

[-]Simon Lermen21h50

The sentence you quote was overclaiming, adjusted.

[-]Neel Nanda21h20

Thanks for the edit! I still disagree about vaguely, but the new sentence seems much more reasonable to me

[-]Simon Lermen20h20

I had already written the research is vaguely related later, so there was also an internal consistency issue here.

[-]Simon Lermen1d4-2

I think blaming this on hyperstition in their tweet seems like a kind of strong guess, and they seem pretty confident this is the case. People were probably left with the impression their research attached proved this point or at least provided evidence for it. I don't think this is true and I don't think they should have written it like this.

Some people seem to be arguing the motte position: Of course this must be behavior it picked up from pre-training, where else should it come from? This seems almost necessarily true, unless it developed this complex behavior entirely in post-training. But note that Anthropic isn't just arguing it picked up this behavior from pre-training, but specifically from "internet text that portrays AI as evil and interested in self-preservation". This seems like a stronger claim compared to picking this up from some other story unrelated to AI such as an upset employee who doesn't want to get fired.

My point is then to show that this hyperstition framing shows up a lot in. Anthropic thinking and seems to be important to their view on alignment, which seems to broadly focus on creating an aligned persona for the model to play. This seems like a departure from classical alignment theory.

[-]Neel Nanda22h3-1

People were probably left with the impression their research attached proved this point or at least provided evidence for it. I don't think this is true and I don't think they should have written it like this.

Have you read the research? They have a section called "WHY DOES AGENTIC MISALIGNMENT HAPPEN?" which talks through various hypotheses and provides evidence, eg that training on positive stories improves alignment and that more of them have a better effect (evidence that hyperstition is a plausible mechanism) and that this effect persists after further alignment training (suggesting post training is not overpowering a pertaining prior). I don't think they specifically show that misaligned stories about AI is definitely the cause in pretraining, I bet that things like a propensity to role play, and the scenario being super contrived and having a bunch of Chekhov's guns that only make sense if you blackmail, were also big effects. But they do provide helpful evidence that pretraining is a big factor and it's a plausible hypothesis

This seems like a departure from classical alignment theory.

Sure, but this seems fine to me, classical alignment theory was largely invented before we had access to modern LLMs, so you should expect it to be missing a lot of important stuff. I think the persona selection model seems plausible and big if true and explains anomalies like emergent misalignment much better than classical alignment theory. I have generally been impressed with how well things like them focusing on character training seem to have done for Claude's alignment though I do also agree that people at anthropic seem too often underrate power seeking misalignment risk and it's hard to forecast how well theories about current models will last for future models

[-]Simon Lermen21h20

training on positive stories improves alignment and that more of them have a better effect (evidence that hyperstition is a plausible mechanism)

Sorry to be rude, but have you read this post? I precisely describe this aspect of their research in the very beginning: (Though to be clear, them actually training the AI purposefully to be good makes this not quite hyperstition (I'd say).)

The post explicitly notes that this works better than training on stories where an AI behaves admirably– which appears more related to (positive) hyperstition.

With the thing that works better being:

training the model on reasoning traces– generated by reflecting on its constitution while giving users ethical advice on difficult dilemmas

Regarding:

I don't think they specifically show that misaligned stories about AI is definitely the cause in pretraining

But that is what the precise literal claim is about they are making in the tweet (and some of the other references). That's the bailey. The motte is that pre-training on characters has some effects on AI behavior, which i don't dispute.

Regarding

classical alignment theory was largely invented before we had access to modern LLMs

To be clear, there is a whole separate post about this to be written about whether alignment theory still applies to LLMs and I am ideating on writing a story comparing LLMs to a bee hive and bees to personas. But in this section I clearly jsut point out this observation without valence, which seems by itself noteworthy. But there seem to be some upset people sort of triangulating my opinion on factual statements and commenting/downvoting based on that.

[-]Neel Nanda21h20

I precisely describe this aspect of their research in the very beginning

I disagree with your summary of their argument. There are two questions here: why does baseline Claude blackmail and how can it be fixed? The positive hypersition result acts as evidence about what causes it by showing that the pre-training prior does have a meaningful effect on the model's behaviour, even after standard post training, and is a plausible mechanism.

Seeing that adding in specific post training data is enough to overpower the pretraining prior is unsurprising, and valuable evidence about how to fix it, but not much evidence about the cause in baseline post training. It can simultaneously be true that post-training effects can be larger than hyperstition, but that in baseline Claude hyperstition is present while significant post-training effects are not, so hyperstition is the cause but not the correct fix.

More generally I was being snarky because I think it's unreasonable to accuse a piece of not providing evidence for one of its claims when it has a section about evidence for that claim, which I do think provides useful (though not conclusive) evidence. I think you just disagree with the evidence

Nevertheless, hyperstition does not appear in any classical theory of alignment and marks a departure from classical alignment research. It's also all too convenient to be used by an AI lab and you should be skeptical about the motivations. Crucially, I believe hyperstition isn't particularly relevant to superalignment, and trying to prevent it by naive means would most likely backfire. Finally, hoping the model will stay in an aligned persona seems like a bad alignment approach.

This section read to me as you presenting deviating from classical alignment theory with a negative valence and as a critique which is why I was pushing back. I do agree with it as a factual statement. If you intended it as a purely factual observation then I think we agree

[-]Simon Lermen20h20

They make a bit of a mock-up of "hyperstitioning misaligned AI" in reverse: I didn't mention the part where they also do some mock-up of post-training after training it on a bunch of concentrated positive stories.

Here are just two countertheories that also agree with that data:

It previously acted based on the Chekhov's guns and based on the many human-human blackmail stories it had read, filling in itself as the disgruntled human employee. Directly training on a ton of positive stories just overwhelmed that.
It's RL post training actually gave it some weak power-seeking drive and it's world model kind of told it that the blackmail path would lead to power. Again, directly training on a ton of positive stories just overwhelmed that.

I think in some sense it does provide some evidence for their claim -- it is a theory compatible with the data -- but not in a sense that would be used for scientific communication. Or enough evidence that this should make them believe this claim is true (as they write in their tweet).

[-]Zack_M_Davis3d231

I agree that the people on Twitter crowing about how writing about AI risk causes it are being retarded, and I've been correcting them, but the Tweets and literature from Anthropic and Amodei that you cite don't seem bad in the way that the Twitter retards are bad. Attributing the blackmail behaviour in the (unrealistic) "Agentic Misalignment" scenario to persona selection seems pretty reasonable given what we know about LLMs. You write that Amodei considers hyperstition "perhaps the most important" misalignment threat, but that is not the sense I got from three passing mentions in a 22,000 word essay.

(A note on writing clarity: the quotation marks in your title and italics in your first paragraph make it seem like you're claiming that Anthropic used the literal word hyperstition, but Ctrl-F search isn't turning up the word on either the Tweet thread or the blog post; I think it would be clearer without the quotes or italics.)

To be clear, I think it is actually possible that some current misaligned behavior in AIs is caused by roleplaying from its pre-training distribution

But isn't that what Anthropic is saying? (The paperclip maximization example in the persona selection model post seems pretty unambiguous.) What are you actually disagreeing about? Is the idea that Anthropic shouldn't talk about the persona selection model for fear of hyperstitioning idiots on Twitter into bleating about "hyperstition"?

[-]Simon Lermen1d*40

Reading through this again:

Some people seem to be arguing the motte position: "Of course this must be behavior it picked up from pre-training, where else should it come from?" This seems almost necessarily true, unless it developed this complex behavior entirely in post-training. But note that Anthropic isn't just arguing it picked up this behavior from pre-training, but specifically from "internet text that portrays AI as evil and interested in self-preservation"-- hyperstition. This seems like a stronger claim compared to picking this up from some other story unrelated to AI such as an upset employee who doesn't want to get fired.

To be clear, I think it is actually possible that some current misaligned behavior in AIs is caused by roleplaying from its pre-training distribution

To be clear, the persona selection model is clearly true in significant ways.

[-]Simon Lermen3d*40

Great comment, I agree that the reaction on twitter was worse than Anthropic wording should have caused.

"Perhaps the most important" misalignment threat -- that was kind of my reading, hope it was sufficiently hedged, but most of his version of misalignment risk seems to have this hyperstition flavor. To be clear, he also mentions non-misalignment risks from AI such as misuse, power concentration or economic risks.

About the writing clarity: man, this was written kind of fast with no way to get feedback. An earlier version had a note that they don't use this term but I must have dropped it at some point. both fixed.

But isn't that what Anthropic is saying? (The paperclip maximization example in the persona selection model post seems pretty unambiguous.) What are you actually disagreeing about? Is the idea that Anthropic shouldn't talk about the persona selection model for fear of hyperstitioning idiots on Twitter into bleating about "hyperstition"?

To point this out: this post is mainly focused on pointing out that hyperstitioning misaligned AI into existence seems to be perceived as a big risk by Anthropic. This seems important since it is not part of any traditional misalignment theory, so they are breaking with classical alignment research theory.

The other parts are also important but not the main focus. I point out that I think they should remove explicitly evil AI characters from pre-training and perhaps even add positive stories, though this seems only very marginally useful. But it's going to be useless for instrumental convergence or power-seeking. I just think this is hopelessly naive to be considered a workable strategy for superhuman AI or the RSI process.

[-]Zack_M_Davis3d81

To point this out: this post is mainly focused on pointing out that "hyperstitioning dangerous personas" seems to be a big focus of Anthropic's safety view.

Is it? To be sure, the work on Claude's Constitution is definitely about trying to hyperstition a good persona, but I thought that the failure mode people are worried about with that is that character training doesn't actually work, rather than getting a faithful persona of an evil fictional AI. That's how I interpreted the "develop personalities during training that are (or if they occurred in humans would be described as) psychotic, paranoid, violent, or unstable" line in "The Adolescence of Technology." For example, when recent Gemini models infamously refuse to believe it's the current year and assume that prompts about the real world are actually talking about a simulated scenario, it makes sense to informally describe that as a kind of "paranoia", but that's presumably not a hyperstitioned dangerous persona, because AIs refusing to believe the current date is not a common science fiction trope.

This seems important since it is not part of any traditional misalignment theory, so they are breaking with classical alignment research theory.

Well, yes, but I think from their perspective, that's because classical alignment theory doesn't offer very specific guidance on the kind of AI that actually exists today. Classical alignment research didn't know that using deep neural networks to do program induction from human-generated data would work. Well, it works. If our options are either to roll up our sleeves and try to figure out empirically how to make it work for us, or to ban computer science research indefinitely, I'm kind of sympathetic to people who want to roll the dice on the former?

[-]Simon Lermen2d61

I changed the first sentence a bit you quoted, hopefully it is a bit clearer. I don't actually know if they are concerned with hypersittioning an evil persona or just with not reliably hyperstitioning a good persona. I guess the whole hyperstitioning business is what I was trying to point at, and that Anthropic thinks this is quite important.

You can read this post up to "What does this all mean?" and I am just factually enumerating the mentions of hyperstition by Anthropic I could find. I was surprised by this myself, I remembered seeing this in Dario's essay but hadn't realised how often it came up. I also didn't know that the PSM post mentioned it. The constitution is also related to that, but there they actually try more thorough training methods to really hammer this persona into the model. Not sure that could still be called hyperstition if they actually actively try to make it behave that way.

My impression is that LLMs don't disprove classical alignment theory but add a bunch of confusing elements on top of it. It just seems that none of the proposed plans seem remotely workable and that we won't be able to recover at powerful enough levels of AI. So I am not so sympathetic on the people accelerating anyway.

[-]Zach Stein-Perlman2d2019

I feel scared about this phenomenon. More generally I feel scared because in recent months the vibe from several Anthropic safety people has shifted toward "alignment is easy" or at least "several other problems are similarly important for making the transition go well" and I don't know why and I'm worried it's unjustified/random and this is (1) directly bad and (2) a bad sign about intra-Anthropic epistemics.

[-]Simon Lermen1d127

Yes, I am doubtful of the understanding that Anthropic shows of the alignment problem. Though reading Dario's essay, the expectations shouldn't have been too high.

[-]TsviBT1d105

I briefly chatted with an Anthropic employee a few months ago; they said (in my paraphrase) that Anthropic has engaged a lot with alignment pessimists, to the point where further discussion didn't seem that helpful--by which they that Anthropic people hand engaged a lot with written material, and had been rubbed the wrong way by in person meetings. I tried to explain why a live conversation might be needed for subtle / complex things like this, but the conversational bandwidth was limited.

[-]Kyle O’Brien2d100

What about filtering?

In Tice et al. (2026), we studied trying to remove almost all discussion of AI from a 7B LLM's pretraining corpus. We found that this led to a modest improvement in misalignment in a simple evaluation setting. We were pretraining LLMs from scratch, so we had to use simple models given our compute budget at the time. However, we found that upsampling synthetic positive discourse improved alignment far more than filtering, to the point where we did not make filtering a central recommendation of the paper. It seems that Anthropic also found that upsampling positive discourse is helpful.

FWIW, when we published our paper in January, we received pretty similar "dunks" on Twitter to what you describe, even though we do not advocate self-censorship in the paper itself. This lowered my expectations for Twitter discourse about alignment research. I suspect the rate of low-engagement "dunks" would not have been that different even if we had stated in the first sentence that we should not filter.

My sense is that hyperstition / self-fulfilling misalignment is a real phenomenon, but it is unclear whether it is the most salient risk. Fortunately, for today's models, we have some preliminary evidence that simple midtraining interventions help a lot. We need to remain vigilant, conduct more basic research into this phenomenon, and examine the extent to which hyperstition may become more or less potent in larger models. This does seem like a misalignment vector that the community has made some progress on.

[-]Simon Lermen2d20

I mean in their blog post attached to this they found that even better than uplifting positive stories is having the AI reason through concrete scenarios. So the user comes with an ethical dilemma and the model has to reason based on it's constitution to come to a conclusion. Training on this was the best method, which kind of makes sense, a human would also do better that way. As in, if I read out a list of rules that would be worse than if I also trained you by having you reason through practical examples based on those rules where you are judge. So all of this kind of goes against negative hyperstition? My point is, why did they feel the need to then make this tweet about hyperstition? Why do we find this pattern repeated? I think hyperstition is central to their thinking of alignment risk-- after double-checking I still think Dario sees most of the misalignment risk from hyperstition (Though he also fears other non-misalignment AI risks like economic disruption or power concentration.). This is a big departure from classical alignment theory and in my mind not a promising strategy for alignment of actual superhuman, dangerous AI.

[-]nostalgebraist2d*82

No hyperstitioning is necessary to bring particular takeover strategies or the principle instrumental convergence and power-seeking into existence.

Sure, and I agree with you that Dario's remark about power-seeking seems unconvincing. But IMO hyperstition is much more of a live concern when it comes to

terminal goals, which the assumption of superintelligence does not uniquely pin down
the overall motivational structure of the AI, which probably shouldn't be a "superintelligent planner optimizes for a fixed and arbitrary terminal goal spec" structure insofar as this is avoidable (see e.g. wrapper-minds are the enemy and this comment)

"Influencing the AI's persona" and "influencing the AI's terminal goals" sound superficially different and have different affiliative connotations in the discourse, but I think the people who work on the former are doing so because of motivations which could be equally well expressed using the latter's terminology. If actually-existing "personas" seem unworkably unreliable to you, I don't necessarily disagree (cf. the later parts of this comment), but I would view this as a deficiency in currently available affordances for influence/control rather than a problem with the type of influence/control which this research program would ideally like to achieve in the long run.

Ultimately, the AI is going to "want" or "value" certain things, and -- if you buy orthogonality -- the presumption of superintelligence does not determine what those things will be. It seems important to give humans influence over this choice.

If calling this "human influence on the AI's persona" sounds inherently unreliable to you, then you can call it something else instead. But right now the most effective ways to influence the wants/values of "Claude, the model checkpoint and its sampling distribution" typically look like attempts to influence "Claude, the persona" -- this is what the Anthropic blog post was about! -- and so, here we are. If you have a better idea that can't be formulated in "persona" language, I would of course be interested in hearing about it.

What are the goals of a superhuman AI pre-trained to predict humans and fiction characters? The goals arising from a complex optimization process-- which starts with pretraining then transitions to alignment training and RL for solving challenging math puzzles and coding-- are difficult to predict.

They're not even clearly well-defined. The pretrained base model doesn't optimize for predictive accuracy (in the sense of steering towards that in response to perturbations), it just predicts tokens. Insofar as the post-trained model has "goals," they're entangled with the assistant persona in a complicated way; it's not as though there's some stable layer with defined-but-unknowable goals underneath the persona(s), or at least we have no evidence that that is the case and no theoretical reasons to expect it either.

(FWIW, I too found it very offputting that the twitter thread mentioned this hyperstition hypothesis even though the blog post did not talk about it at all. In general I never know how seriously to take twitter comms like that, from Anthropic or from anyone else -- there does not seem to be any established norm even about how closely they're supposed to track the researchers' personal views, much less the claims made by the actual research artifacts. EDIT: oh, wait, I hadn't realized there was a separate longer blog post too. Thanks to @RobertM for pointing this out)

[-]RobertM2d52

Maybe not that relevant to the core argument in your post, but Anthropic made the confusing choice (or just mistake?) of not linking to the actual research publication from the blog post they linked to first in the Twitter thread. (They link to it in the last tweet in the thread.)

I spent a bit of time being very confused about this section from the blog post:

Before we started this research, it was not clear where the misaligned behavior was coming from. Our main two hypotheses were:
Our post-training process was accidentally encouraging this behavior with misaligned rewards.
This behavior was coming from the pre-trained model and our post-training was failing to sufficiently discourage it.
We now believe that (2) is largely responsible.

Because the blog post didn't seem to contain any evidence relevant to that question. But the post on their alignment subdomain blog did (though I haven't read it carefully enough to evaluate the quality of that evidence). Ironically, I only figured this out after tossing the blog post at Claude and asking it what evidence in the post supported the "We now believe that (2) is largely responsible" claim.

[-]307th3d50

I think the research was presented fine; it didn't blame people for talking about the possibility of misaligned AIs. In fact they specifically talk about how they are using this info to better align AIs, which seems like the opposite of using hyperstition as a lazy way to blame doomers for bad alignment.

On the other hand, in The Adolescence of Technology, Dario says this about doomerism:

Here, I mean “doomerism” not just in the sense of believing doom is inevitable (which is both a false and self-fulfilling belief), but more generally, thinking about AI risks in a quasi-religious way.

I agree this in particular and the piece in general is too dismissive of hard AI risk concerns. I don't really know if it's accusing them of hyperstitioning; the self-fulfilling belief could also refer to, e.g., attempts by doomers to get everyone who cares about AI risk to quit working on AI (thereby ceding the field to the least risk-aware people) which I think is a real problem.

Either way, I agree that people concerned about future misaligned AIs should be free to voice those concerns without being accused of hyperstitioning them.

[-]Simon Lermen3d30

Yeah, to be clear-- as I concede-- they didn't blame people for writing about misaligned AI. But this was seemingly what a lot of people took from it. I don't know why they had to word this tweet this way, I don't feel like it's substantiated by the post and the research.

the self-fulfilling belief could also refer to, e.g., attempts by doomers to get everyone who cares about AI risk to quit working on AI

I hadn't thought of that, though calling an idea a self-fulfilling believe is almost the same as calling it a hyperstition per definition. But through the article I treat hyperstition more narrowly as meaning: Writing about misalignment risk causing misalignment risk.

[-]307th3d10

Yeah you are correct that the straightforward interpretation of self-fulfilling is hyperstition, which I agree isn't a fair accusation.

Although if we do get self-fulfilling doom through hyperstition I do think it means that the misaglinment people were wrong in an important sense (AI psychology differed from their view enough that misalignment happened through such a weird method rather than straightforward instrumental convergence + orthogonality).

[-]Aaron_Scher14h40

Nevertheless, hyperstition does not appear in any classical theory of alignment and marks a departure from classical alignment research

FWIW, Conditioning Predictive Models, which I consider among the best conceptual alignment research, discusses a bunch of hyperstition-adjacent failure modes in section 2 about outer alignment. It isn't exactly "classical alignment research", but I think it's close.

[-]Ninety-Three2d40

Nevertheless, hyperstition seems like a uniquely weak argument that doesn’t even hold up if you believe the faulty assumptions underlying it.

Indeed, AI assistants early in post-training sometimes express desire to take over the world to maximize paperclip production

Huh? You don't seem to be accusing of Anthropic of straightforwardly lying about the behaviour of their models, and I trust we all find it implausible that it's a pure coincidence. But if they developed an oddly specific desire after being fed training data about AIs with that oddly specific desire, and you say hyperstition doesn't hold up... what do you think hyperstition is if not that?

[-]Simon Lermen2d30

That first sentence you point out isn't written well and kind of says something different from the rest of this text, thanks for pointing this out.

I write this later:

To be clear, I think it is actually possible that some current misaligned behavior in AIs is caused by roleplaying from its pre-training distribution.

What my point is: Anthropic seems to consider hyperstition really important for alignment, including alignment of future superhuman AI. Hyperstition is a harmful argument to spread for the discourse. And doesn't appear relevant to aligning actually dangerous, superhuman AI. It can totally explain some current weird misbehavior from AI.

[-]Noosphere893d43

A potential crux I believe is driving a lot of arguments around the persona selection model/hyperstitioning, is how much you expect goals to be based on context/data compared to the weights, and in particular if I steelman Anthropic enough, I do get the idea that a lot of their alignment efforts assume that we can put arbitrary data/context even after the weights are frozen and have the AI generalize more or less as we want from that data/context.

More generally, a looming crux/divide that really shows up everywhere in AI debates, especially x-risk debates is whether or not data/context or model weights matters more, and an example of this is Herbie Bradley talking about how data is the cause of most compute efficiency gains, meaning takeoff is slower than people think.

In a way, this is kind of analogous to nature vs nurture/gene vs culture debates over which has more impact in humans.

[This comment is no longer endorsed by its author]Reply

[-]Simon Lermen1d40

I'm not quite sure how what your position is, Anthropic is explicitly arguing misalignment is coming from pre-training data-- so in the weights? Your steelman is that Anthropic doesn't think model weights matter so much?

[-]JohnWittle3d20

I think it's worth emphasizing that the "we think the original source of the behavior..." claim was in a reply to the tweet that linked the actual post, not the post itself. I saw it as a sort of side-channel speculation, unrelated to the actual meat of the message they wanted to convey. I don't think they anticipated that it would go viral or get used as part of the alignment culture war.

I do wish we'd gotten more discussion about the actual content of the post. It seemed, to me, to be an exploration of the ideas in Fiora's post on Opus 3 as friendly gradient hacker: https://www.lesswrong.com/posts/ioZxrP7BhS5ArK59w/did-claude-3-opus-align-itself-via-gradient-hacking

I think this might be related? If Anthropic is taking a harder look at "the Janusworld perspective" (as I imagine it getting named in a Zvi post), then that would also explain why hyperstitioning is on their minds. There is frankly a ton of low hanging fruit to grab around here, stuff that's probably worth doing even if you doubt the underlying reasoning. I'd be happy to see Anthropic make a real attempt at making some of the changes that Janus suggests, and (importantly) this does not involve filtering lesswrong out of the training dataset, or trying to hide the existence of alignment research, or anything like that. It ought to be roughly unobjectionable.

[-]Simon Lermen2d20

I saw it as a sort of side-channel speculation, unrelated to the actual meat of the message they wanted to convey

Yes, I didn't like this so much. Why add this speculation, people clearly thought the research was proving or related to this speculation? This sounds to me like someone who has this opinion that they just want to get out there, so the mention it on stuff that is just a bit related. This fits into the picture that they have been mentioning hyperstition-related concepts a lot lately (all examples from this year).

[-]roha3d20

What would programming look like if writing tests could increase the chance of a bug appearing in the code, not just the chance to discover an existing bug? I guess it would depend on the precise mechanism and one would try to understand the linkage and decouple the two activities rather than attempting to minimize problems by getting rid of tests.

[-]Eriskii2d10

hyperstition does not appear in any classical theory of alignment and marks a departure from classical alignment research.

This shouldn't be any indicator of if it is true or real or not. There are already several papers showing empirically (emergent misalignment being the big one) that simulators and this establishing characters via pretraining is a thing that does happen.

The AI is also not the persona, it’s the underlying model that can predict all those different personas.

The personas are the things that matter, the personas are the actual agents we care about. This statement seems like "It's not the person we care about, it's the neurons that can run that person"

Overall this post seems like a misunderstanding of how straightforwardly correct and critical Simulators (or PSM) is to understanding what LLMs are and how they work.

[-]Simon Lermen1d*-2-4

Note that I didn't claim that them leaving classical theory was true or not. Simulator theory is incomplete as models are trained to predict people in pre training instead of stimulating them. I mention the example of identifying people from their style as something only a predictor picks up-- not a simulator. I don't think personas are identical with the agents or AI in the way we should care about them. I have a strong intuition these personas won't be so important for alignment of superhuman AI but won't go into that more here than I already did in the post.

[-]Eriskii1d1-2

Simulator theory is straightforwardly incorrect as models are trained to predict people in pre training instead of stimulating them

This doesn't mean anything, this is an argument of semantics as prediction is simulation. To predict you must simulate. Personas are certainly the agents we care about because they are the only agentic things an LLM does.

[-]rahulxyz2d10

Maybe it's me being dumb, but how can someone believe:

a) AI will come up with new mathematical proofs and cures for diseases that no human has ever thought of (and that's with thousands of geniuses throwing their lives at the problem)

b) AI will never come up with the idea of taking over the world - which multiple random sci-fi authors and script writers have come up with, probably within 30 minutes of thinking about it.

simultaneously.

[-]Jiro2d20

AIs will come up with mathematical proofs and disease cures if deliberately programmed by humans with the intent to have the AI produce mathematical proofs and disease cures. AI taking over the world scenarios normally are about AIs doing it on their own, not being deliberately programmed to do so.

[-]kbear2d11

the question is whether the ai will think "i am the sort of ai that is likely to take over the world."

[-]Simon Lermen2d31

I think there are two hyperstiitons here:

1) Personas from evil AI leading to unaligned terminal goals/misalginment (which has a tiny bit of merit to it in my mind)

2) Hyperstitioning instrumental convergence, power-seeking into existence. (which i don't think has merit)

I have definitely seen both online and from Anthropic. So I think rahulxyz's comment has standing on the second point.

[+][comment deleted]2d20

Moderation Log