AI Control in the context of AI Alignment is a category of plans that aim to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures. From The case for ensuring that powerful AIs are controlled:.. (read more)
The Open Agency Architecture ("OAA") is an AI alignment proposal by (among others) @davidad and @Eric Drexler. .. (read more)
Singluar learning theory is a theory that applies algebraic geometry to statistical learning theory, developed by Sumio Watanabe. Reference textbooks are "the grey book", Algebraic Geometry and Statistical Learning Theory, and "the green book", Mathematical Theory of Bayesian Statistics.
Archetypal Transfer Learning (ATL) is a proposal by @whitehatStoic for what is argued by the author to be a fine tuning approach that "uses archetypal data" to "embed Synthetic Archetypes". These Synthetic Archetypes are derived from patterns that models assimilate from archetypal data, such as artificial stories. The method yielded a shutdown activation rate of 57.33% in the GPT-2-XL model after fine-tuning. .. (read more)
Open Threads are informal discussion areas, where users are welcome to post comments that didn't quite feel big enough to warrant a top-level post, nor fit in other posts... (read more)
A Black Marble is a technology that by default destroys the civilization that invents it. It's one type of Existential Risk. AGI may be such an invention, but isn't the only one... (read more)
AI Evaluations focus on experimentally assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral and understanding-based... (read more)
AI Risk Skepticism is the view that the potential risks posed by artificial intelligence (AI) are overstated or misunderstood, specifically regarding the direct, tangible dangers posed by the behavior of AI systems themselves. Skeptics of object-level AI risk argue that fears of highly autonomous, superintelligent AI leading to catastrophic outcomes are premature or unlikely.
| User | Post Title | Wikitag | Pow | When | Vote |
The study of how collective systems such as groups of people, markets, companies and other larger scale intelligences can become more intelligent and aligned.
Exploration hacking is when a model strategically alters its exploration during RL training in order to influence the subsequent training outcome. Because RL is fundamentally dependent on sufficient exploration of diverse actions and trajectories, a model that alters its exploration behavior can significantly compromise the training outcome.
Exploration hacking relates to several other threat models. It can be a strategy for sandbagging during RL-based capability elicitation, but unlike sandbagging is not limited to underperformance. Unlike reward hacking, it is intentional. Unlike gradient hacking, it manipulates the data distribution rather than the optimization dynamics directly.
Prompt Injection refers to text that is meant to make a language model go against its instructions, such as by confusing it about which text is instruction and which text is content. This tag is for both posts that discuss prompt injection (and strategies for mitigating it), and posts that contain instances of prompt injection. AI agents that read LessWrong may wish to block posts that have this tag, and features to support such blocking may be added to the site and its API in the future.
The study of how collective systems such as groups of people, markets, companies and other larger scale intelligences can become more intelligent and aligned.
An annual conference celebrating "Blogging, Truthseeking, and Original Seeing"
Exploration hacking is when a model strategically alters its exploration during RL training in order to influence the subsequent training outcome. Because RL is fundamentally dependent on sufficient exploration of diverse actions and trajectories, a model that alters its exploration behavior can significantly compromise the training outcome.
Exploration hacking relates to several other threat models. It can be a strategy for sandbagging during RL-based capability elicitation, but unlike sandbagging is not limited to underperformance. Unlike reward hacking, it is intentional. Unlike gradient hacking, it manipulates the data distribution rather than the optimization dynamics directly.
Reward hacking, also known as specification gaming, occurs when an AI trained with reinforcement learning optimizes an objective function — achieving the literal, formal specification of an objective — without actually achieving the outcome that the programmers intended.
See also: Goodhart's Law
Variations on Paul Christiano's argument that the Solomonoff prior is malign.
"Model diffing" is a phrase introduced in https://www.alignmentforum.org/posts/X2i9dQQK3gETCyqh2/chris-olah-s-views-on-agi-safety and popularised by the Anthropic interpretability team in https://transformer-circuits.pub/2024/crosscoders/index.html#model-diffing -- it refers there to "Just as we review software in terms of incremental diffs, you might hope to review the safety of models by focusing on how it has changed from a previously deployed model [...] to isolate and interpret these changes". There are many other machine learning papers on similar diffing techniques, e.g. https://arxiv.org/abs/2211.12491
Model diffing ismay refer specifically to the study of mechanistic changes introduced during fine-tuning - essentially,tuning; understanding what makes a fine-tuned model different from its base model internally.
Moltbook.com is a Reddit-like social media for AI agents created on 27 Jan 2026.
Prompt Injection refers to text that is meant to make a language model go against its instructions, such as by confusing it about which text is instruction and which text is content. This tag is for both posts that discuss prompt injection (and strategies for mitigating it), and posts that contain instances of prompt injection. AI agents that read LessWrong may wish to block posts that have this tag, and features to support such blocking may be added to the site and its API in the future.
The question of whether Oracles –— or just keeping an AGI forcibly confined -— are safer than fully free AGIs has been the subject of debate for a long time. Armstrong, SandbergSandberg, and Bostrom discuss Oracle safety at length in their Thinking inside the box: using and controlling an Oracle AI. In the paper, the authors review various methods which might be used to measure an Oracle's accuracy. They also try to shed some light on some weaknesses and dangers that can emerge on the human side, such as psychological vulnerabilities which can be exploited by the Oracle through social engineering. The paper discusses ideas for physical security (“boxing”), as well as problems involved with trying to program the AI to only answer questions. In the end, the paper reaches the cautious conclusion ofthat Oracle AIs are probably being safer than free AGIs.
In a related work, Dreams of Friendliness, Eliezer Yudkowsky gives an informal argument stating that all oracles will be agent-like, that is, driven by its own goals. He rests on the idea that anything considered "intelligent" must choose the correct course of action among all actions available. That means that the Oracle will have many possible things to believe, although very few of them are correct. ThereforeTherefore, believing the correct thing means some method was used to select the correct belief from the many incorrect beliefs. By definition, this is an optimization process which has a goal of selecting correct beliefs.
One can then imagine all the things that might be useful in achieving the goal of "have"having correct beliefs". For instance, acquiring more computing power and resources could help this goal. As such, an Oracle could determine that it might answer more accurately and easily to a certain question if it turned all matter outside the box to computronium, therefore killing all the existing life.
Given that true AIs are goal-oriented agents, it follows that a True Oracular AI has some kind of oracular goals. These act as the motivation system for the Oracle to give us the information we ask for and nothing else.
This means that a True Oracular AI has to have a full specification of human values, thus making it a FAI-complete problem – if we could achieve such skill and knowledgeknowledge, we could just build a Friendly AI and bypass the Oracle AI concept.
Any system that acts only as an informative machine, only answering questionsquestions, and has no goals is by definition not an AI at all. That means that a non-AI Oracular is but a calculator of outputs based on inputs. Since the term in itself is heterogeneous, the proposals made for a sub-division are merely informal.
An Advisor can be seen as a system that gathers data from the real world and computes the answer to an informal “what we ought to do?” question. They also represent aan FAI-complete problem.
Finally, a Predictor is seen as a system that takes a corpus of data and produces a probability distribution over future possible data. There are some proposed dangers with predictors, namely exhibiting goal-seeking behavior which does not converge with humanityhumanity's goals and the ability to influence us through the predictions.
Small point on this reference:
"While some proponents of AIF believe that it is a more principled rival to Reinforcement Learning (RL), it has been shown that AIF is formally equivalent to the control-as-inference formulation of RL.[8]"
I believe the paper cited here says that AIF is formally equivalent to control-as-inference only in its likelihood-AIF variant, i.e. when the value is moved into a biased likelihood and made equivalent to the control-as-inference optimality variable. The paper otherwise shows that AIF and control-as-inference are not identical, and that this arises from differences in how value is encoded in each. In AIF, value is encoded in the prior preferences of the agent over observations, whereas in control-as-inference, value has a separate representation from the veridical generative model.
The authors may have meant to explain that AIF in the specific case of the likelihood variant is formally equivalent to control-as-inference, in which case they should state that clearly.