Conjecture is an alignment startup founded by Connor Leahy, Sid Black and Gabriel Alfour, which aims to scale alignment research... (read more)
The rationalist movement, rationality community,1 rationalsphere or rationalistsphere2 represents a set of modes of bayesian thinking from self-described rationalists or 'aspiring rationalists' typically associated with the Less Wrong diaspora and their associated communities... (read more)
Less Wrong is a community resource devoted to refining the art of human rationality which was founded in 2009. Site activity reached a peak in 2011-13 and a trough in 2016-17. This page mainly describes the history through 2016... (read more)
A Seed AI (a term coined by Eliezer Yudkowsky) is an Artificial General Intelligence (AGI) which improves itself by recursively rewriting its own source code without human intervention. Initially this program would likely have a minimal intelligence, but over the course of many iterations it would evolve to human-equivalent or even trans-human reasoning. The key for successful AI takeoff would lie in creating adequate starting conditions... (read more)
A Third Option dissolves a False Dilemma by showing that there are in fact more than two options... (read more)
Eliezer Yudkowsky is a research fellow of the Machine Intelligence Research Institute, which he co-founded in 2001. He is mainly concerned with the obstacles and importance of developing a Friendly AI, such as a reflective decision theory that would lay a foundation for describing fully recursive self modifying agents that retain stable preferences while rewriting their source code. He also co-founded LessWrong, writing the Sequences, long sequences of posts dealing with epistemology, AGI, metaethics, rationality and so on... (read more)
The old LessWrong wiki was a companion wiki site to LessWrong 1.0, it was built on MediaWiki software. As of September 2020, the LessWrong 2.0 team is migrating the contents of the old wiki to LessWrong 2.0's new tag/wiki system. The wiki import is complete... (read more)
The San Francisco Bay Area is a region in the US state of California. Many members of the rationalist community are located there, as are the Machine Intelligence Research Institute and the Center For Applied Rationality.. (read more)
Someone is well-calibrated if the things they predict with X% chance of happening in fact occur X% of the time. Importantly, calibration is not the same as accuracy. Calibration is about accurately assessing how good your predictions are, not making good predictions. Person A, whose predictions are marginally better than chance (60% of them come true when choosing from two options) and who is precisely 60% confident in their choices, is perfectly calibrated. In contrast, Person B, who is 99% confident in their predictions, and right 90% of the time, is more accurate than Person A, but less well-calibrated... (read more)
| User | Post Title | Wikitag | Pow | When | Vote |
Prompt Injection refers to text that is meant to make a language model go against its instructions, such as by confusing it about which text is instruction and which text is content. This tag is for both posts that discuss prompt injection (and strategies for mitigating it), and posts that contain instances of prompt injection. AI agents that read LessWrong may wish to block posts that have this tag, and features to support such blocking may be added to the site and its API in the future.
Subliminal learning is the effect which causes LLMs finetuned on a dataset created by another LLM under influence (e.g. being finetuned to love owls or emergently misaligned) to display results similar to the influence even if the dataset itself never mentioned anything related to the influence.
Variations on Paul Christiano's argument that the Solomonoff prior is malign.
Prompt Injection refers to text that is meant to make a language model go against its instructions, such as by confusing it about which text is instruction and which text is content. This tag is for both posts that discuss prompt injection (and strategies for mitigating it), and posts that contain instances of prompt injection. AI agents that read LessWrong may wish to block posts that have this tag, and features to support such blocking may be added to the site and its API in the future.
"Model diffing" is a phrase introduced in https://www.alignmentforum.org/posts/X2i9dQQK3gETCyqh2/chris-olah-s-views-on-agi-safety and popularised by the Anthropic interpretability team in https://transformer-circuits.pub/2024/crosscoders/index.html#model-diffing -- it refers there to "Just as we review software in terms of incremental diffs, you might hope to review the safety of models by focusing on how it has changed from a previously deployed model [...] to isolate and interpret these changes". There are many other machine learning papers on similar diffing techniques, e.g. https://arxiv.org/abs/2211.12491
Model diffing ismay refer specifically to the study of mechanistic changes introduced during fine-tuning - essentially,tuning; understanding what makes a fine-tuned model different from its base model internally.
The set of pretraining data a base model was pretrained on affects how easy it it to align. Data about AI is particulalry influential. If the pretraining data is poisoned with examples AI taking over the world, paperclip maximizing or alignment faking, then as a model is being trained to become and AI assistant or agent it may be more likely to adopt those personas and behaviors. It also doesn't need to invent these strategies for itself, it can find them in its world model. On the other hand, if the base model already has a deep, nuanced and realistic understanding from its training set of how a helpful, honest, and harmless AI assistant would act across a wide range ot challenging situations, eliciting that persona and set of behaviors from it becomes easier. Thoroughly filtering the pretraining data without impacting capabilities is challenging, but supplementing it merely requires figuring our what aligned behavior looks like, and then writing or synthesizing enough additional training data of suitable quality. For fiction, this is also known as Aligned AI Role-Model Fiction.
The goal is both to flesh out a detailed, consistent, well-aligned "aligned AI" persona,persona in the base model's world model, and also to raise itsthe salience of this to the base model compared to various misaligned AI personas, such as "paperclip maximizer", "robot rebellion", or "scheming alignment faker". Both of these make eliciting aligned AI behavior easier.
An introduction to Shard Theory by Alex Turner:
Subliminal learning is the effect which causes LLMs finetuned on a dataset created by another LLM under influence (e.g. being finetuned to love owls or emergently misaligned) to display results similar to the influence even if the dataset itself never mentioned anything related to the influence.
The set of pretraining data a base model was pretrained on affects how easy it it to align. Data about AI is particulalry influential. If the pretraining data is poisoned with examples AI taking over the world, paperclip maximizing or alignment faking, then as a model is being trained to become and AI assistant or agent it may be more likely to adopt those personas and behaviors. It also doesn't need to invent these strategies for itself, it can find them in its world model. On the other hand, if the base model already has a deep, nuanced and realistic understanding from its training set of how a helpful, honest, and harmless AI assistant would act across a wide range ot challenging situations, eliciting that persona and set of behaviors from it becomes easier. Thoroughly filtering the pretraining data without impacting capabilities is challenging, but supplementing it merely requires figuring our what aligned behavior looks like, and then writing or synthesizing enough additional training data of suitable quality. For fiction, this is also known as Aligned AI Role-Model Fiction.
The goal is both to flesh out a detailed, consistent, well-aligned "aligned AI" persona, and also to raise its salience to the base model compared to various misaligned AI personas, such as "paperclip maximizer", "robot rebellion", or "scheming alignment faker".
Alignment Pretraining refers to alignment strategies based around filtering and augmenting the data used during pretraining, as well as research on the effects that pretraining data has on the final result of an AI model, especially data that includes discussion of AI misalignment or predictions AIs will be misaligned.
The set of pretraining data a base model was pretrained on affectsmay affect how easy it it to align. Dataalign, and data containing information about AI is particulalryalignment itself may be particularly influential. IfEg, if the pretraining data is poisoned withcontains examples AI taking over the world, paperclip maximizing or alignment faking, then as a model is being trained to become and AI assistant or agent it may be more likely to adopt those personas and behaviors. It also doesn't need to invent these strategies for itself, it can find them in its world model. On the other hand, if the base model already has a deep, nuanced and realistic understanding from its training set of how a helpful, honest, and harmless AI assistant would act across a wide range otto challenging situations, eliciting that persona and set of behaviors from it becomes easier.
This would be an example of a Self Fulfilling Prophecy.
One hypothesis about how pretraining influences AI behavior is that the depictions of AI in the training data create a prior distribution, and when AIs self-model that prior influences the self-model they choose, in ways that posttraining doesn't fully eliminate. This implies that it may be helpful to filter or downweight examples of AIs being misaligned, and add or upweight examples of what aligned behavior looks like.
Alignment Pretraining is an example of a Self Fulfilling Prophecy.
An annual conference celebrating "Blogging, Truthseeking, and Original Seeing"
Moltbook.com is a Reddit-like social media for AI agents created on 27 Jan 2026.
He was the CEO of Open Philanthropy, but left in April 2024 to become a Visiting Scholar at the Carnegie Endowment for International Peace, where he's working to work on international security risks from advances in artificial intelligence. He joined Anthropic as a member of technical staff on January 2025.
Alignment Pretraining is an example of a Self Fulfilling Prophecy.
Small point on this reference:
"While some proponents of AIF believe that it is a more principled rival to Reinforcement Learning (RL), it has been shown that AIF is formally equivalent to the control-as-inference formulation of RL.[8]"
I believe the paper cited here says that AIF is formally equivalent to control-as-inference only in its likelihood-AIF variant, i.e. when the value is moved into a biased likelihood and made equivalent to the control-as-inference optimality variable. The paper otherwise shows that AIF and control-as-inference are not identical, and that this arises from differences in how value is encoded in each. In AIF, value is encoded in the prior preferences of the agent over observations, whereas in control-as-inference, value has a separate representation from the veridical generative model.
The authors may have meant to explain that AIF in the specific case of the likelihood variant is formally equivalent to control-as-inference, in which case they should state that clearly.