Wikitag Dashboard — LessWrong

x

Wikitag Dashboard — LessWrong

Edited by (+81/-67) Feb 16th 2026 GMT 3

Discuss this tag

Collective Intelligence

Edited by (+158) Feb 13th 2026 GMT 2

Discuss this tag

Collective Intelligence

New tag created by Jonas Hallgren at 3d

The study of how collective systems such as groups of people, markets, companies and other larger scale intelligences can become more intelligent and aligned.

Discuss this tag

Secret Loyalties

New tag created by Joe Kwon at 4d

Discuss this tag

List: value-alignment subjects

Edited by (+11/-8) Feb 12th 2026 GMT 4

Discuss this wiki

Edited by (+295) Feb 11th 2026 GMT 4

Discuss this wiki

Exploration Hacking

Edited by (+690) Feb 11th 2026 GMT 4

Discuss this tag

Exploration Hacking

New tag created by Joschka Braun at 5d

Exploration hacking is when a model strategically alters its exploration during RL training in order to influence the subsequent training outcome. Because RL is fundamentally dependent on sufficient exploration of diverse actions and trajectories, a model that alters its exploration behavior can significantly compromise the training outcome.

Exploration hacking relates to several other threat models. It can be a strategy for sandbagging during RL-based capability elicitation, but unlike sandbagging is not limited to underperformance. Unlike reward hacking, it is intentional. Unlike gradient hacking, it manipulates the data distribution rather than the optimization dynamics directly.

Discuss this tag

Edited by (+78) Feb 10th 2026 GMT 2

Discuss this tag

New tag created by Ruby at 7d

An annual conference celebrating "Blogging, Truthseeking, and Original Seeing"

Discuss this tag

Multi-Agent Safety

New tag created by Hiroshi Yamakawa at 7d

Discuss this tag

Free Energy Principle

Joseph Emerson8d10

Small point on this reference:

"While some proponents of AIF believe that it is a more principled rival to Reinforcement Learning (RL), it has been shown that AIF is formally equivalent to the control-as-inference formulation of RL.^[8]"

I believe the paper cited here says that AIF is formally equivalent to control-as-inference only in its likelihood-AIF variant, i.e. when the value is moved into a biased likelihood and made equivalent to the control-as-inference optimality variable. The paper otherwise shows that AIF and control-as-inference are not identical, and that this arises from differences in how value is encoded in each. In AIF, value is encoded in the prior preferences of the agent over observations, whereas in control-as-inference, value has a separate representation from the veridical generative model.

The authors may have meant to explain that AIF in the specific case of the likelihood variant is formally equivalent to control-as-inference, in which case they should state that clearly.

Malign Prior Arguments

Edited by (+77) Feb 7th 2026 GMT 2

Discuss this tag

Malign Prior Arguments

New tag created by abramdemski at 10d

Variations on Paul Christiano's argument that the Solomonoff prior is malign.

Discuss this tag

Prompt Injection

Edited by (+493) Feb 5th 2026 GMT 2

Discuss this tag

Prompt Injection

New tag created by jimrandomh at 11d

Prompt Injection refers to text that is meant to make a language model go against its instructions, such as by confusing it about which text is instruction and which text is content. This tag is for both posts that discuss prompt injection (and strategies for mitigating it), and posts that contain instances of prompt injection. AI agents that read LessWrong may wish to block posts that have this tag, and features to support such blocking may be added to the site and its API in the future.

Discuss this tag

Edited by (+649/-23) Feb 3rd 2026 GMT 2

Discuss this tag

Edited by (+81) Jan 31st 2026 GMT 1

Discuss this tag

New tag created by Sean Herrington at 16d

Moltbook.com is a Reddit-like social media for AI agents created on 27 Jan 2026.

Discuss this tag

Subliminal Learning

Edited by (+296) Jan 27th 2026 GMT 1

Discuss this tag

The question of whether Oracles –— or just keeping an AGI forcibly confined -— are safer than fully free AGIs has been the subject of debate for a long time. Armstrong, ~~Sandberg~~Sandberg, and Bostrom discuss Oracle safety at length in their Thinking inside the box: using and controlling an Oracle AI. In the paper, the authors review various methods which might be used to measure an Oracle's accuracy. They also try to shed some light on some weaknesses and dangers that can emerge on the human side, such as psychological vulnerabilities which can be exploited by the Oracle through social engineering. The paper discusses ideas for physical security (“boxing”), as well as problems involved with trying to program the AI to only answer questions. In the end, the paper reaches the cautious conclusion ofthat Oracle AIs are probably ~~being~~ safer than free AGIs.

In a related work, Dreams of Friendliness, Eliezer Yudkowsky gives an informal argument stating that all oracles will be agent-like, that is, driven by its own goals. He rests on the idea that anything considered "intelligent" must choose the correct course of action among all actions available. That means that the Oracle will have many possible things to believe, although very few of them are correct. ~~Therefore~~Therefore, believing the correct thing means some method was used to select the correct belief from the many incorrect beliefs. By definition, this is an optimization process which has a goal of selecting correct beliefs.

One can then imagine all the things that might be useful in achieving the goal of ~~"have~~"having correct beliefs". For instance, acquiring more computing power and resources could help this goal. As such, an Oracle could determine that it might answer more accurately and easily to a certain question if it turned all matter outside the box to computronium, therefore killing all the existing life.

Given that true AIs are goal-oriented agents, it follows that a True Oracular AI has some kind of oracular goals. These act as the motivation system for the Oracle to give us the information we ask for and nothing else.

This means that a True Oracular AI has to have a full specification of human values, thus making it a FAI-complete problem – if we could achieve such skill and ~~knowledge~~knowledge, we could just build a Friendly AI and bypass the Oracle AI concept.

Any system that acts only as an informative machine, only answering ~~questions~~questions, and has no goals is by definition not an AI at all. That means that a non-AI Oracular is but a calculator of outputs based on inputs. Since the term in itself is heterogeneous, the proposals made for a sub-division are merely informal.

An Advisor can be seen as a system that gathers data from the real world and computes the answer to an informal “what we ought to do?” question. They also represent aan FAI-complete problem.

Finally, a Predictor is seen as a system that takes a corpus of data and produces a probability distribution over future possible data. There are some proposed dangers with predictors, namely exhibiting goal-seeking behavior which does not converge with ~~humanity~~humanity's goals and the ability to influence us through the predictions.

Dreams of Friendliness
Thinking inside the box: using and controlling an Oracle AI by Armstrong, ~~Sandberg~~Sandberg, and Bostrom

Value identification
- Edge instantiation
- Unforeseen maximums
- Ontology identification
  - Cartesian boundary
  - Human identification
- Inductive value learning
  - Ambiguity-querying
  - Moral uncertainty
    - Indifference
Patch resistance
- Nearest Unblocked ~~Neighbor~~Strategy
Corrigibility
- Anapartistic reasoning
  - Programmer deception
  - Early conservatism
  - Reasoning under confusion
- User maximization / Unshielded argmax
  - Hypothetical user maximization
Genie theory
- Limited AI
  - Weak optimization
    - Safe optimization measure (such that we are confident it has no Edge that secretly optimizes more)
      - Factoring of an agent by stage/component optimization power
    - 'Checker' smarter than 'inventor / chooser'
      - 'Checker' can model humans, 'strategizer' cannot
  - Transparency
  - Domain restriction
    - Behaviorism
  - Effable optimization (opposite of cognitive uncontainability; uses only comprehensible strategies)
    - Minimal concepts (simple, not simplest, that contains fewest whitelisted strategies)
- Genie preferences
  - Low-impact AGI
    - Minimum Safe AA (just flip off switch and shut down safely)
    - Safe impact measure
    - Armstrong-style permitted output channels
    - Shutdown utility function
  - Oracle utility function
    - Safe indifference?
  - Online checkability
    - Reporting without programmer maximization
  - Do What I Know I Mean
Superintelligent security (all subproblems placing us in adversarial context vs. other SIs)
- Bargaining
  - Non-blackmailability
  - Secure counterfactual reasoning
  - First-mover penalty / epistemic low ground advantage
  - Division of gains from trade
- Epistemic exclusion of distant SIs
  - Distant superintelligences can coerce the most probable environment of your AI
  - Breaking out of hypotheses
'Philosophical' problems
- One True Prior
  - Pascal's Mugging / leverage prior
  - Second-orderness
  - Anthropics
    - How would an AI decide what to think about QTI?
- Mindcrime
  - Nonperson predicates (and unblocked neighbor problem)
- Do What I Don't Know I Mean
  - CEV
- Philosophical competence
  - Unprecedented excursions

Misleading Encouragement / context change / treacherous designs for naive projects
- Programmer prediction & infrahuman domains hide complexity of value
- Context change problems
- Problems that only appear in advanced regimes
- Problem classes that seem debugged in infrahuman regimes and suddenly break again in advanced regimes
- Methodologies that only work in infrahuman regimes
- Programmer deception
Academic inadequacy
- 'Ethics' work neglects technical problems that need longest serial research times and fails to give priority to astronomical failures over survivable small hits, but 'ethics' work has higher prestige, higher publishability, and higher cognitive accessibility
- Understanding of big technical picture currently very rare
  - Most possible funding sources cannot predict for themselves what might be technically useful in 10 years
  - Many possible funding sources may not regard MIRI as trusted to discern this
- Noise problems
  - Ethics research drowns out technical research
    - And provokes counterreaction
    - And makes the field seem nontechnical
  - Naive technical research drowns out sophisticated technical research
    - And makes problems look more solvable than they really are
    - And makes tech problems look trivial, therefore nonprestigious
    - And distracts talent/funding from hard problems
  - Bad methodology louder than good methodology
    - So projects can appear safety-concerned while adopting bad methodologies
- Future adequacy counterfactuals seem distant from the present regime
(To classify)
- Coordinative AI development hypothetical