Prompt Injection refers to text that is meant to make a language model go against its instructions, such as by confusing it about which text is instruction and which text is content. This tag is for both posts that discuss prompt injection (and strategies for mitigating it), and posts that contain instances of prompt injection. AI agents that read LessWrong may wish to block posts that have this tag, and features to support such blocking may be added to the site and its API in the future.
Alignment Pretraining refers to alignment strategies based around filtering and augmenting the data used during pretraining, as well as research on the effects that pretraining data has on the final result of an AI model, especially data that includes discussion of AI misalignment or predictions AIs will be misaligned... (read more)
"The X-Men comics use terms like “evolution,” “mutation,” and “genetic code,” purely to place themselves in what they conceive to be the literary genre of science. The part that scares me is wondering how many people, especially in the media, understand science only as a literary genre." -- from Eliezer's post Science as Attire. .. (read more)
A generalization of Aumann's Agreement Theorem across objectives and agents, without assuming common priors. This framework also encompasses Debate, CIRL, Iterated Amplification as well. See Nayebi (2025) for the formal definition, and see From Barriers to Alignment to the First Formal Corrigibility Guarantees for applications.
These are posts that contain reviews of posts included in the 2024 Annual Review.
| User | Post Title | Wikitag | Pow | When | Vote |
Prompt Injection refers to text that is meant to make a language model go against its instructions, such as by confusing it about which text is instruction and which text is content. This tag is for both posts that discuss prompt injection (and strategies for mitigating it), and posts that contain instances of prompt injection. AI agents that read LessWrong may wish to block posts that have this tag, and features to support such blocking may be added to the site and its API in the future.
Subliminal learning is the effect which causes LLMs finetuned on a dataset created by another LLM under influence (e.g. being finetuned to love owls or emergently misaligned) to display results similar to the influence even if the dataset itself never mentioned anything related to the influence.
Alignment Pretraining refers to alignment strategies based around filtering and augmenting the data used during pretraining, as well as research on the effects that pretraining data has on the final result of an AI model, especially data that includes discussion of AI misalignment or predictions AIs will be misaligned.
The data a base model was pretrained on may affect how easy it it to align, and data containing information about AI alignment itself may be particularly influential. Eg, if the pretraining data contains examples AI taking over the world, paperclip maximizing or alignment faking, then as a model is being trained to become and AI assistant or agent it may be more likely to adopt those personas and behaviors. It also doesn't need to invent these strategies for itself, it can find them in its world model. On the other hand, if the base model already has a deep, nuanced and realistic understanding from its training set of how a helpful, honest, and harmless AI assistant would act across a wide range to challenging situations, eliciting that persona and set of behaviors from it becomes easier.
This would be an example of a Self Fulfilling Prophecy....
Thanks, edited that section with a link.
Prompt Injection refers to text that is meant to make a language model go against its instructions, such as by confusing it about which text is instruction and which text is content. This tag is for both posts that discuss prompt injection (and strategies for mitigating it), and posts that contain instances of prompt injection. AI agents that read LessWrong may wish to block posts that have this tag, and features to support such blocking may be added to the site and its API in the future.
The set of pretraining data a base model was pretrained on affects how easy it it to align. Data about AI is particulalry influential. If the pretraining data is poisoned with examples AI taking over the world, paperclip maximizing or alignment faking, then as a model is being trained to become and AI assistant or agent it may be more likely to adopt those personas and behaviors. It also doesn't need to invent these strategies for itself, it can find them in its world model. On the other hand, if the base model already has a deep, nuanced and realistic understanding from its training set of how a helpful, honest, and harmless AI assistant would act across a wide range ot challenging situations, eliciting that persona and set of behaviors from it becomes easier. Thoroughly filtering the pretraining data without impacting capabilities is challenging, but supplementing it merely requires figuring our what aligned behavior looks like, and then writing or synthesizing enough additional training data of suitable quality. For fiction, this is also known as Aligned AI Role-Model Fiction.
This approach is also sometimes called Safety Pretraining or Pretraining Language Models with Human Preferences.
Moltbook.com is a Reddit-like social media for AI agents created on 27 Jan 2026.
Alignment Pretraining is an example of a Self Fulfilling Prophecy.
The set of pretraining data a base model was pretrained on affects how easy it it to align. Data about AI is particulalry influential. If the pretraining data is poisoned with examples AI taking over the world, paperclip maximizing or alignment faking, then as a model is being trained to become and AI assistant or agent it may be more likely to adopt those personas and behaviors. It also doesn't need to invent these strategies for itself, it can find them in its world model. On the other hand, if the base model already has a deep, nuanced and realistic understanding from its training set of how a helpful, honest, and harmless AI assistant would act across a wide range ot challenging situations, eliciting that persona and set of behaviors from it becomes easier. Thoroughly filtering the pretraining data without impacting capabilities is challenging, but supplementing it merely requires figuring our what aligned behavior looks like, and then writing or synthesizing enough additional training data of suitable quality. For fiction, this is also known as Aligned AI Role-Model Fiction.
The goal is both to flesh out a detailed, consistent, well-aligned "aligned AI" persona, and also to raise its salience to the base model compared to various misaligned AI personas, such as "paperclip maximizer", "robot rebellion", or "scheming alignment faker".
"Model diffing" is a phrase introduced in https://www.alignmentforum.org/posts/X2i9dQQK3gETCyqh2/chris-olah-s-views-on-agi-safety and popularised by the Anthropic interpretability team in https://transformer-circuits.pub/2024/crosscoders/index.html#model-diffing -- it refers there to "Just as we review software in terms of incremental diffs, you might hope to review the safety of models by focusing on how it has changed from a previously deployed model [...] to isolate and interpret these changes". There are many other machine learning papers on similar diffing techniques, e.g. https://arxiv.org/abs/2211.12491
Model diffing ismay refer specifically to the study of mechanistic changes introduced during fine-tuning - essentially,tuning; understanding what makes a fine-tuned model different from its base model internally.
Alignment Pretraining refers to alignment strategies based around filtering and augmenting the data used during pretraining, as well as research on the effects that pretraining data has on the final result of an AI model, especially data that includes discussion of AI misalignment or predictions AIs will be misaligned.
The set of pretraining data a base model was pretrained on affectsmay affect how easy it it to align. Dataalign, and data containing information about AI is particulalryalignment itself may be particularly influential. IfEg, if the pretraining data is poisoned withcontains examples AI taking over the world, paperclip maximizing or alignment faking, then as a model is being trained to become and AI assistant or agent it may be more likely to adopt those personas and behaviors. It also doesn't need to invent these strategies for itself, it can find them in its world model. On the other hand, if the base model already has a deep, nuanced and realistic understanding from its training set of how a helpful, honest, and harmless AI assistant would act across a wide range otto challenging situations, eliciting that persona and set of behaviors from it becomes easier.
This would be an example of a Self Fulfilling Prophecy.
One hypothesis about how pretraining influences AI behavior is that the depictions of AI in the training data create a prior distribution, and when AIs self-model that prior influences the self-model they choose, in ways that posttraining doesn't fully eliminate. This implies that it may be helpful to filter or downweight examples of AIs being misaligned, and add or upweight examples of what aligned behavior looks like.
Alignment Pretraining is an example of a Self Fulfilling Prophecy.
He was the CEO of Open Philanthropy, but left in April 2024 to become a Visiting Scholar at the Carnegie Endowment for International Peace, where he's working to work on international security risks from advances in artificial intelligence. He joined Anthropic as a member of technical staff on January 2025.
The set of pretraining data a base model was pretrained on affects how easy it it to align. Data about AI is particulalry influential. If the pretraining data is poisoned with examples AI taking over the world, paperclip maximizing or alignment faking, then as a model is being trained to become and AI assistant or agent it may be more likely to adopt those personas and behaviors. It also doesn't need to invent these strategies for itself, it can find them in its world model. On the other hand, if the base model already has a deep, nuanced and realistic understanding from its training set of how a helpful, honest, and harmless AI assistant would act across a wide range ot challenging situations, eliciting that persona and set of behaviors from it becomes easier. Thoroughly filtering the pretraining data without impacting capabilities is challenging, but supplementing it merely requires figuring our what aligned behavior looks like, and then writing or synthesizing enough additional training data of suitable quality. For fiction, this is also known as Aligned AI Role-Model Fiction.
The goal is both to flesh out a detailed, consistent, well-aligned "aligned AI" persona,persona in the base model's world model, and also to raise itsthe salience of this to the base model compared to various misaligned AI personas, such as "paperclip maximizer", "robot rebellion", or "scheming alignment faker". Both of these make eliciting aligned AI behavior easier.
An introduction to Shard Theory by Alex Turner:
Subliminal learning is the effect which causes LLMs finetuned on a dataset created by another LLM under influence (e.g. being finetuned to love owls or emergently misaligned) to display results similar to the influence even if the dataset itself never mentioned anything related to the influence.
Sequences on Shard Theory:
Are we sure that the gene doesn't just cause people who have it to believe in Functional Decision Theory?