LESSWRONG
Wikitags Dashboard
LW

2178

Wikitags in Need of Work

Newest Wikitag

Wikitag Voting Activity

Combined Wikitags Activity Feed

Wikitags in Need of Work

Reset Filter Collapse Wikitags
All Wikitags

The Unjournal is a nonprofit organisation that works to organize and fund public journal-independent feedback, rating, and evaluation of hosted papers and dynamically-presented research projects. Their initial focus is on quantitative work that informs global priorities, especially in economics, policy, and social science. They aim to encourage better research by making it easier for researchers to get feedback and credible ratings on their work... (read more)

Posts exploring what extremely good futures look like.

Computation in Superposition (Comp-in-Sup) is a sub-field of Mechanistic Interpretability (Mech-Interp)... (read more)

LessWrong has its own palette of clickable reacts to text! They are quite different from those on other websites... (read more)

A "scissors statement" is a sentence selected to cause incendiary disagreements (because many people disagree sharply about it, and care a lot). The phrase was coined by Scott Alexander in his short story Sort by Controversial.

If Anyone Builds It, Everyone Dies (shortened as IABIED) is a book by Eliezer Yudkowsky and Nate Soares, released in September 2025... (read more)

CS 2881r is a class by @boazbarak on AI Safety and Alignment at Harvard. .. (read more)

AI-Fizzle refers to futures in which transformative AI (TAI) is never created. In AI-Fizzle scenarios, the overall effect of AI on human civilization remains smaller than that of the industrial revolution or the computer revolution. The term "AI-Fizzle" was introduced in the post Five Worlds of AI, by Scott Aaronson and Boaz Barak.

(testing whether capitalisation affects tagging capabilities, please ignore)

AI Consciousness
Load More

Newest Wikitags

Wikitag Voting Activity

Recent Wikitag Activity

Other Work Needed / See Discussion
Convert to Tag Candidate
Needs Relevance Sorting
Split Candidate
Needs Description
Marked for Deletion
Needs Related Pages
Very Few Posts
Merge Candidate
Needs Updating
High Priority
Stub
Convert to Wiki-Only Candidate
Description Improvements (see discussion)
User Post Title Wikitag Pow When Vote
david reinstein
The Unjournal (1)
8h
plex
Utopia (32)
20d
Linda Linsefors
Comp-In-Sup (4)
1mo
AnnaSalamon
Scissors Statements (3)
1mo
Load More (4/948)
The Unjournal
Edited by (+743) Nov 7th 2025 GMT 1
Discuss this tag
The Unjournal
New tag created by david reinstein at 8h

The Unjournal is a nonprofit organisation that works to organize and fund public journal-independent feedback, rating, and evaluation of hosted papers and dynamically-presented research projects. Their initial focus is on quantitative work that informs global priorities, especially in economics, policy, and social science. They aim to encourage better research by making it easier for researchers to get feedback and credible ratings on their work.

The Unjournal was founded by David Reinstein and received a $565,000 grant from the Survival and Flourishing Fund in 2023.

Further reading:

An introduction to The Unjournal...

(Read More)

Discuss this tag
Eliezers Lost Alignment Articles The Arbital Sequence
Kabir Kumar1d10

Thanks for putting this together - the first article is especially useful

Reply
Sufficiently optimized agents appear coherent
Edited by (+99/-75) Nov 5th 2025 GMT 2
Discuss this wiki
Sufficiently optimized agents appear coherent
Edited by (+214/-164) Nov 5th 2025 GMT 2
Discuss this wiki
Sufficiently optimized agents appear coherent
Edited by (-27) Nov 5th 2025 GMT 2
Discuss this wiki
Relevant powerful agents will be highly optimized
Edited by (+19/-47) Nov 5th 2025 GMT 2
Discuss this wiki
Relevant powerful agents will be highly optimized
Edited by (+70/-75) Nov 5th 2025 GMT 2
Discuss this wiki
Löb's theorem
Edited by (+1154) Nov 4th 2025 GMT 2
Discuss this tag
Eliezer Yudkowsky
Edited by (+95/-69) Oct 31st 2025 GMT 1
Discuss this wiki
Eliezer Yudkowsky
Edited by (+534/-56) Oct 31st 2025 GMT 1
Discuss this wiki
Third Option
Edited by (+429/-199) Oct 30th 2025 GMT 1
Discuss this wiki
Third Option
Edited by (+14) Oct 30th 2025 GMT 1
Discuss this wiki
Gears-Level
Edited by (+486/-125) Oct 30th 2025 GMT 1
Discuss this tag
Gears-Level
Edited by (+16/-9) Oct 30th 2025 GMT 1
Discuss this tag
Free Energy Principle
Edited by (+30/-9) Oct 29th 2025 GMT 1
Discuss this tag
Free Energy Principle
Edited by (+241/-237) Oct 29th 2025 GMT 1
Discuss this tag
Predictive Processing
Edited by (+23) Oct 29th 2025 GMT 1
Discuss this tag
Predictive Processing
Edited by (+101/-141) Oct 29th 2025 GMT 1
Discuss this tag
Free Energy Principle
Edited by (+3461/-1913) Oct 29th 2025 GMT 1
Discuss this tag
david reinstein
Mateusz Bagiński
Mateusz Bagiński
Mateusz Bagiński
Mateusz Bagiński
Mateusz Bagiński
Ariel Cheng
Ariel Cheng
Ariel Cheng
Ariel Cheng
Ariel Cheng
JenniferRM
gustaf
gustaf
gustaf
gustaf
gustaf
gustaf

The Unjournal is a nonprofit organisation that works to organize and fund public journal-independent feedback, rating, and evaluation of hosted papers and dynamically-presented research projects. Their initial focus is on quantitative work that informs global priorities, especially in economics, policy, and social science. They aim to encourage better research by making it easier for researchers to get feedback and credible ratings on their work.

The Unjournal was founded by David Reinstein and received a $565,000 grant from the Survival and Flourishing Fund in 2023.

Further reading:

An introduction to The Unjournal

The Unjournal's first evaluation

External links:

The Unjournal. Official Website.

Hosted output: https://Unjournal.pubpub.org

 

Summary: Violations of coherence constraints in probability theory and decision theory correspond to qualitatively destructive or dominated behaviors. Coherence violations so easily computed as to be humanly predictable should be eliminated by optimization strong enough and general enough to reliably eliminate behaviors that are qualitatively dominated by cheaply computable alternatives. From our perspectiveperspective, this should produce agents such that, ceteris paribus, we do not think we can predict, in advance, any coherence violation in their behavior.

  • You prefer to be in San Francisco rather than Berkeley, and if you are in BerkeleyBerkeley, you will pay $50 for a taxi ride to San Francisco. (So far, no problem.)
  • You prefer San Jose to San FranciscoFrancisco, and if in San FranciscoFrancisco, you will pay $50 to go to San Jose. (Still no problem so far.)
  • You like Berkeley more than San JoseJose, and if in San Jose will pay $50 to go to Berkeley.

Again, we see a manifestation of a powerful family of theorems showing that agents whichthat cannot be seen as corresponding to any coherent probabilities and consistent utility function will exhibit qualitatively destructive behavior, like paying someone a cent to throw a switch and then paying them another cent to throw it back.

There is similarly a large literature on many classes of coherence arguments that yield classical probability theory, such as the Dutch Book theorems. There is no substantively different rival to probability theory and decision theory whichthat is competitive when it comes to (a) plausibly having some bounded analogue which could appear to describe the uncertainty of a powerful cognitive agent, and (b) seeming highly motivated by coherence constraints, that is, being forced by the absence of qualitatively harmful behaviors that correspond to coherence violations.

Even an incoherent collection of shifting drives and desires may well recognize, after having paid their two cents or $150, that they are wasting money, and try to do things differently (self-modify). An AI's programmers may recognize that, from their own perspective, they would rather not have their AI spendingspend money on circular taxi rides. This implies a path from incoherent non-advanced agents to coherent advanced agents as more and more optimization power is applied to them.

Without knowing in advance the exact specifics of the optimization pressures being applied, it seems that, in advance and ceteris paribus, we should expect that paying a cent to throw a switch and then paying again to switch it back, or throwing away $150 on circular taxi rides, are qualitatively destructive behaviors that optimization would tend to eliminate. E.g., one expects a consequentialist goal-seeking agent would prefer, or a policy reinforcement learner would be reinforced, or a fitness criterion would evaluate greater fitness, etcetera,etc., for eliminating the behavior that corresponds to incoherence, ceteris paribusparibus, and given the option of eliminating it at a reasonable computational cost.

  • There will be some bounded notion of Bayesian rationality that incorporates e.g. a theory of LogicalUncertainty which agents will appear from a human perspective to strictly obey. All departures from this bounded coherence that humans can understand using their own computing power will have been eliminated.
  • OptimizedAppearCoherent: It will not be possible for humans to specifically predict in advance any large coherence violation as e.g. the above intertemporal conjunction fallacy. Anything simple enough and computable cheaply enough for humans to predict in advance will also be computationally possible for the agent to eliminate in advance. Any predictable coherence violation which is significant enough to be humanly worth noticing, will also be damaging enough to be worth eliminating.

The Free Energy Principle (FEP) states that self-organizing systems which maintain a separation from their environments via a Markov blanket---including the brain and other physical systems---minimize their variational free energy (VFE) and expected free energy (EFE) via perception and action respectively[1]. Unlike in other theories of agency, under FEP, action and perception are unified as inference problems under similar objectives. In some cases, variational free energy reduces to prediction error, which is the difference between the predictions made about the environment and the actual outcomes experienced. The mathematical content of Active InferenceFEP is based on Variational Bayesian methods. 

Although FEP has an extremely broad scope, it makes a number of very specific assumptions---e.g. sensory, active, internal and external states have independent random fluctuations; there exists an injective map between the mode of internal states and mode of external states---assumptions[2] that may restrict its applicability to real-world systems. Ongoing theoretical work attempts to reformulate the theory to hold under more realistic assumptions. Some progress has been made: newer formulations of FEP, unlike their predecessors, do not assume a constant Markov blanket (but rather, some Markov blanket trajectory)[2]3] and do not assume the existence of a non-equilibrium steady state[3]4].

There are two FEP process theories most relevant to neuroscience.[4]5] Predictive processing is a process theory of how VFE is minimized in brains during perception. Active Inference (AIF) is a process theory of the "action" part of FEP, which can also be seen as an agent architecture. 

It has been argued[5]6] that AIF as an agent architecture manages the model complexity (i.e. the bias-variance tradeoff) and the exploration-exploitation tradeoff in a principled way; favours explicit, disentangled, and hence more interpretable belief representations; and is amenable for working within hierarchical systems of collective intelligence (which are seen as Active Inference agents themselves[6]7]). Building ecosystems of hierarchical collective intelligence can be seen as a proposed solution for and an alternative conceptualization of the general problem of alignment.

While some proponents of AIF believe that it is a more principled rival to Reinforcement Learning (RL), it has been shown that AIF is formally equivalent to the control-as-inference formulation of RL.[7]8] Additionally, AIF also recovers Bayes-optimal reinforcement learning, optimal control theory, and Bayesian Decision Theory (aka EDT) under different simplifying assumptions[8]9][9]10].

AIF is an energy-based model of intelligence. This likens FEP/Active Inference to Bengio's GFlowNets[10]11] and LeCun's Joint Embedding Predictive Architecture (JEPA)[11]12], which are also energy-based. 

  1. ^

    EFE is closely related to, and can be derived from, VFE. Action does not always minimize EFE; in some cases, it minimizes generalized free energy (a closely related quantity). See this figure for a brief overview.

  2. ^

    E.g. (1) sensory, active, internal and external states have independent random fluctuations; (2) there exists an injective map between the mode of internal states and mode of external

...
Read More (313 more words)

Sometimes written "Loeb's Theorem" (because umlauts are tricky). This is a theorem about proofs of what is provable and how they interact with what is actually provable in ways that surprise some people.

This math result often comes up when attempting to formalize "an agent" or "a value system" as somehow related to "a set of axioms".

Often, when making such mental motions, one wants to take multi-agent interactions seriously, and make the game-theoretically provably endorsable actions "towards an axiom system" be somehow contingent on what that other axiom system might or might not be able to game-theoretically provably endorse.

You end up with proofs about proofs about proofs... and then, without care, the formal proof systems themselves might explode or might give agentically incoherent results on certain test cases.

Sometimes, in this research context, the phrase "loebstacle" or "Löbstacle" comes up. This was an area of major focus (and a common study guide pre-requisite) for MIRI from maybe 2011 to 2016?

It became much less important later after the invention/discovery of the Garrabrant Inductor.

As to the math of Löb's theorem itself...

Perfect epistemic and instrumental coherence is too computationally expensive for bounded agents to achieve. ConsiderConsider, e.g., the conjunction rule of probability that P(P(A&B) <= P(A)∧B)≤P(A). If A is a theorem, and B is a lemma very helpful in proving A, then asking the agent for the probability of A alone may elicit a lower answer than asking the agent about the joint probability of A&B (since thinking of B as a lemma increases the subjective probability of A). This is not a full-blown form of conjunction fallacy since there is no particular time at which the agent explicitly assigns a lower probability to P(P(A&)=P((A∧B %% )∨(A&~B)∧¬B)) than to P(P(A&B)∧B). But even for an advanced agent, if a human was watching the series of probability assignments, the human might be able to say some equivalent of, "Aha, even though the agent was exposed to no new outside evidence, it assigned probability X to P(A)P(A) at time t,t, and then assigned probability Y>X to P(P(A&B)∧B) at time t+2.t+2.".

  • There will be some bounded notion of Bayesian rationality that incorporatesincorporates, e.g., a theory of LogicalUncertaintylogical uncertainty, which agents will appear from a human perspective to strictly obey. All departures from this bounded coherence that humans can understand using their own computing power will have been eliminated.
  • It will not be possible for humans to specifically predict in advance any large coherence violation asas, e.g., the above intertemporal conjunction fallacy. Anything simple enough and computable cheaply enough for humans to predict in advance will also be computationally possible for the agent to eliminate in advance. Any predictable coherence violationviolation, which is significant enough to be humanly worth noticing, will also be damaging enough to be worth eliminating.

Although the first notion of salvageable coherence above seems to us quite plausible, it has a large gap with respect to what this bounded analogue of rationality might be. Insofar as the thesis that optimized agents appearingappear coherent has practical implications, these implications should probably rest upon the second line of argument.

One possible loophole of the second line of argument might be some predictable class of incoherencesincoherences, which are not at all damaging to the agent and hence not worth spending even relatively tiny amounts of computing power to eliminate. If so, this would imply some possible humanly predictable incoherences of advanced agents, but these incoherences would not be exploitable to cause any final outcome that is less than maximally preferred by the agent, including scenarios where the agent spends resources it would not otherwise spend, etc.

Remark one: To advance-predict specific incoherence in an advanced agent, (a) we'd need to know what the superior alternative waswas, and (b) it would need to lead to the equivalent of going around in loops from San Francisco to San Jose to Berkeley.

Remark two: IfIf, on some development...

Read More (149 more words)

Related Tags: Anticipated Experiences, Double-Crux, Empiricism, Falsifiability, Map and Territory
 

The term gears-level was first described on LWLessWrong in the post "Gears in Understanding"understanding:

An example from Gears in Understanding of a gears-level model is (surprise): a box of gears. If you can see a series of interlocked gears, alternately turning clockwise, then counterclockwise, and so on, then you're able to anticipate the direction of any given gear, even if you cannot see it. It would be very difficult to imagine all of the gears turning as they are but only one of them changing direction whilst remaining interlocked. And finally, you would be able to rederive the direction of any given gear if you forgot it.
 

 

Gears vs Behavior by @johnswentworth adds:

That’s the key feature of gears-level models: they make falsifiable predictions about the internals of a system, separate from the externally-visible behavior. If a model could correctly predict all of a system’s externally-visible behavior, but still be falsified by looking inside the box, then that’s a gears-level model.

Related Tags: Anticipated Experiences, Double-Crux, Empiricism, Falsifiability, Map and Territory

  • on LessWrong starting with The Sequences / Rationality A-Z

  • Harry Potter and the Methods Of Rationality (book)

  • Planecrash / Project Lawful (long fictional story)

  • on Arbital (AI Alignment)

  • If Anyone Builds It, Everyone Dies (book)

  • on Twitter/X (mostly retweeting)

  • and on Facebook (mostly retweeting)

  • on Mediumyudkowsky.net (2 posts)

    (personal/fiction)
  • on Tumblr (fiction / on writing)

    • e.g. Masculine Mongoose (short story)
  • on Medium (2 posts)

  • on fanfiction.net (3 stories)

  • on yudkowsky.net (personal/fiction)

  • on Reddit (fiction / on writing)

    • also pseudonymously: Kindness to Kin

An example of a scenario that negates RelevantPowerfulAgentsHighlyOptimizedthe thesis that relevant powerful agents will be highly optimized is KnownAlgorithmNonrecursiveIntelligenceKANSI, where a cognitively powerful intelligence is produced by pouring lots of computing power into known algorithms, and this intelligence is then somehow prohibited from self-modification and the creation of environmental subagents.

A third optionThird Option is a way to breakdissolves a false dilemmaFalse Dilemma, by showing that neither ofthere are in fact more than two options.

The first step in obtaining a Third Alternative is deciding to look for one, and the suggested solutionslast step is a good idea.the decision to accept it. This sounds obvious, and yet most people fail on these two steps, rather than within the search process.

— The Third Alternative

  • False dilemmaGoal factoring (technique for finding and accepting a Third Option)
  • Goal factoringFalse Dilemma
  • Color politics
  • Black swan
  • Defensibility
  • Arguments as soldiers
  • Motivated cognition
  • Challenging the Difficult (example of a false dilemma)

Writing

  • on LessWrong starting with The Sequences / Rationality A-Z

  • Harry Potter and the Methods Of Rationality (book)

  • Planecrash / Project Lawful (long fictional story)

  • on Arbital (AI Alignment)

  • on Twitter/X (mostly retweeting)

  • on Facebook (mostly retweeting)

  • on Medium (2 posts)

  • on Tumblr (fiction / on writing)

    • e.g. Masculine Mongoose (short story)
  • on fanfiction.net (3 stories)

  • on yudkowsky.net (personal/fiction)

  • on Reddit (fiction / on writing)

    • also pseudonymously: Kindness to Kin

Other Links:

  • Eliezer Yudkowsky's posts on Less Wrong
  • A list of all of Yudkowsky's posts to Overcoming Bias (an old blog that has since been ported to LessWrong), Dependency graphs for them
  • Eliezer Yudkowsky Facts by steven0461

Related Tags: Anticipated Experiences, Double-Crux, Empiricism, Falsifiability, Map and Territory
 

1. Does the model paypay rent? If it does, and if it were falsified, how much (and how precisely) could you infer other things from the falsification?

An example from Gears in Understanding of a gears-level model is (surprise): a box of gears. If you can see a series of interlocked gears, alternately turning clockwise, then counterclockwise, and so on, then you're able to anticipate the direction of any given,given gear, even if you cannot see it. It would be very difficult to imagine all of the gears turning as they are but only one of them changing direction whilst remaining interlocked. And finally, you would be able to rederive the direction of any given gear if you forgot it.
 

Active Inference (AIF) can be seen as a generalisationgeneralization of predictive processing. While predictive processing only explains the agent's perception in terms of inference, Active InferenceAIF models both perception and action as inference under closely related unifying objectives: whereas perception minimizes variational free energy (which sometimes reduces to precision-weighted prediction error), action minimizes expected free energy.

The Free Energy Principle is a generalisationgeneralization of Active InferenceAIF that not only attempts to describe biological organisms, but also "things"rather all systems that can be separatedmaintain a separation from their environments via a (Markov blanket) over some timescale.(via Markov blankets).

  • False dilemma
  • Goal factoring
  • Color politics
  • Black swan
  • Defensibility
  • Arguments as soldiers
  • Motivated cognition
  • Challenging the Difficult

The Free Energy Principle (FEP) isstates that self-organizing systems which maintain a principle that suggests that dynamic systems, separation from their environments via a Markov blanket---including the brain and other physical systems,systems---minimize their variational free energy (VFE) and expected free energy (EFE) via perception and action respectively[1]. Unlike in other theories of agency, under FEP, action and perception are organizedunified as inference problems under similar objectives. In some cases, variational free energy reduces to minimize prediction errors, orerror, which is the difference between the predictions made about the environment and the actual outcomes experienced. AccordingThe mathematical content of Active Inference is based on Variational Bayesian methods. 

Although FEP has an extremely broad scope, it makes a number of very specific assumptions---e.g. sensory, active, internal and external states have independent random fluctuations; there exists an injective map between the mode of internal states and mode of external states---that may restrict its applicability to real-world systems. Ongoing theoretical work attempts to reformulate the theory to hold under more realistic assumptions. Some progress has been made: newer formulations of FEP, dynamic systems encode information aboutunlike their environment inpredecessors, do not assume a way to reduce surprisal from its input. The FEP proposes that dynamic systems are motivated to minimize prediction errors in order to maintain stability withinconstant Markov blanket (but rather, some Markov blanket trajectory)[2] and do not assume the environment. existence of a non-equilibrium steady state[3].

Process theories

Since FEP gives riseis an unfalsifiable mathematical principle, it does not make sense to ask whether FEP is true (because it is true mathematically given the assumptions.) Rather, it makes sense to ask whether its assumptions hold for a given system, and, if so, how that system minimizes VFE and EFE. Unlike the FEP itself, a proposal of Active Inferencehow some particular system minimizes VFE and EFE---a process theory---is falsifiable.

There are two FEP process theories most relevant to neuroscience.[1]4]: Predictive processing is a process theory of agency, thathow VFE is minimized in brains during perception. Active Inference (AIF) is a process theory of the "action" part of FEP, which can also be seen both as an explanatory theory and as an agent architecture. In the latter sense, Active Inference rivals Reinforcement Learning.

It has been argued[2]5] that Active InferenceAIF as an agent architecture manages the model complexity (i.e., the bias-variance tradeoff) and the exploration-exploitation tradeoff in a principled way,way; favours explicit, disentangled, and hence more interpretable belief representations,representations; and is amenable for working within hierarchical systems of collective intelligence (which are seen as Active Inference agents themselves[3]6]). Building ecosystems of hierarchical collective intelligence can be seen as a proposed solution for and an alternative conceptualisationconceptualization of the general problem of alignment.

Connections to other theories

FEP/Active InferenceWhile some proponents of AIF believe that...

Read More (581 more words)

The probability that an agent that is cognitively powerful enough to be relevant to existential outcomes,outcomes will have been subject to strong, general optimization pressures. Two (disjunctive) supporting arguments are that, one, pragmatically accessible paths to producing cognitively powerful agents tend to invoke strong and general optimization pressures, and two, that cognitively powerful agents would be expected to apply strong and general optimization pressures to themselves.

Ending up with a scenario along the lines of KnownAlgorithmNonrecursiveIntelligenceKANSI requires defeating both of the above conditions simultaneously. The second condition seems more difficult and seems to require more Corrigibility or CapabilityControl features than the first.

Related Pages: Perceptual Control Theory, Neuroscience, Free Energy Principle

Since FEP is an unfalsifiable mathematical principle, it does not make sense to ask whether FEP is true (because it is true mathematically given the assumptions.) Rather, it makes sense to ask whether its assumptions hold for a given system, and, if so, how that system minimizesimplements the minimization of VFE and EFE. Unlike the FEP itself, a proposal of how some particular system minimizes VFE and EFE---a process theory---is falsifiable.

2Can we do useful meta-analysis? Unjournal evaluations of "Meaningfully reducing consumption of meat... is an unsolved problem..."
david reinstein
8h
0