Reward hacking, also known as specification gaming, occurs when an AI trained with reinforcement learning optimizes an objective function — achieving the literal, formal specification of an objective — without actually achieving the outcome that the programmers intended... (read more)
Exploration hacking is when a model strategically alters its exploration during RL training in order to influence the subsequent training outcome. Because RL is fundamentally dependent on sufficient exploration of diverse actions and trajectories, a model that alters its exploration behavior can significantly compromise the training outcome... (read more)
Prompt Injection refers to text that is meant to make a language model go against its instructions, such as by confusing it about which text is instruction and which text is content. This tag is for both posts that discuss prompt injection (and strategies for mitigating it), and posts that contain instances of prompt injection. AI agents that read LessWrong may wish to block posts that have this tag, and features to support such blocking may be added to the site and its API in the future.
| User | Post Title | Wikitag | Pow | When | Vote |
The study of how collective systems such as groups of people, markets, companies and other larger scale intelligences can become more intelligent and aligned.
Exploration hacking is when a model strategically alters its exploration during RL training in order to influence the subsequent training outcome. Because RL is fundamentally dependent on sufficient exploration of diverse actions and trajectories, a model that alters its exploration behavior can significantly compromise the training outcome.
Exploration hacking relates to several other threat models. It can be a strategy for sandbagging during RL-based capability elicitation, but unlike sandbagging is not limited to underperformance. Unlike reward hacking, it is intentional. Unlike gradient hacking, it manipulates the data distribution rather than the optimization dynamics directly.
Prompt Injection refers to text that is meant to make a language model go against its instructions, such as by confusing it about which text is instruction and which text is content. This tag is for both posts that discuss prompt injection (and strategies for mitigating it), and posts that contain instances of prompt injection. AI agents that read LessWrong may wish to block posts that have this tag, and features to support such blocking may be added to the site and its API in the future.
Subliminal learning is the effect which causes LLMs finetuned on a dataset created by another LLM under influence (e.g. being finetuned to love owls or emergently misaligned) to display results similar to the influence even if the dataset itself never mentioned anything related to the influence.
"Model diffing" is a phrase introduced in https://www.alignmentforum.org/posts/X2i9dQQK3gETCyqh2/chris-olah-s-views-on-agi-safety and popularised by the Anthropic interpretability team in https://transformer-circuits.pub/2024/crosscoders/index.html#model-diffing -- it refers there to "Just as we review software in terms of incremental diffs, you might hope to review the safety of models by focusing on how it has changed from a previously deployed model [...] to isolate and interpret these changes". There are many other machine learning papers on similar diffing techniques, e.g. https://arxiv.org/abs/2211.12491
Model diffing ismay refer specifically to the study of mechanistic changes introduced during fine-tuning - essentially,tuning; understanding what makes a fine-tuned model different from its base model internally.
Subliminal learning is the effect which causes LLMs finetuned on a dataset created by another LLM under influence (e.g. being finetuned to love owls or emergently misaligned) to display results similar to the influence even if the dataset itself never mentioned anything related to the influence.
Prompt Injection refers to text that is meant to make a language model go against its instructions, such as by confusing it about which text is instruction and which text is content. This tag is for both posts that discuss prompt injection (and strategies for mitigating it), and posts that contain instances of prompt injection. AI agents that read LessWrong may wish to block posts that have this tag, and features to support such blocking may be added to the site and its API in the future.
An annual conference celebrating "Blogging, Truthseeking, and Original Seeing"
Exploration hacking is when a model strategically alters its exploration during RL training in order to influence the subsequent training outcome. Because RL is fundamentally dependent on sufficient exploration of diverse actions and trajectories, a model that alters its exploration behavior can significantly compromise the training outcome.
Exploration hacking relates to several other threat models. It can be a strategy for sandbagging during RL-based capability elicitation, but unlike sandbagging is not limited to underperformance. Unlike reward hacking, it is intentional. Unlike gradient hacking, it manipulates the data distribution rather than the optimization dynamics directly.
Variations on Paul Christiano's argument that the Solomonoff prior is malign.
Reward hacking, also known as specification gaming, occurs when an AI trained with reinforcement learning optimizes an objective function — achieving the literal, formal specification of an objective — without actually achieving the outcome that the programmers intended.
See also: Goodhart's Law
Moltbook.com is a Reddit-like social media for AI agents created on 27 Jan 2026.
The study of how collective systems such as groups of people, markets, companies and other larger scale intelligences can become more intelligent and aligned.
Small point on this reference:
"While some proponents of AIF believe that it is a more principled rival to Reinforcement Learning (RL), it has been shown that AIF is formally equivalent to the control-as-inference formulation of RL.[8]"
I believe the paper cited here says that AIF is formally equivalent to control-as-inference only in its likelihood-AIF variant, i.e. when the value is moved into a biased likelihood and made equivalent to the control-as-inference optimality variable. The paper otherwise shows that AIF and control-as-inference are not identical, and that this arises from differences in how value is encoded in each. In AIF, value is encoded in the prior preferences of the agent over observations, whereas in control-as-inference, value has a separate representation from the veridical generative model.
The authors may have meant to explain that AIF in the specific case of the likelihood variant is formally equivalent to control-as-inference, in which case they should state that clearly.