LESSWRONG
LW

859
Wikitags

The Tree of AI Alignment on Arbital

Edited by Mikhail Samin last updated 6th Aug 2025

Imported from Arbital's Explore AI Alignment (with a bit of deduplicating) + adding the Decision theory articles.

  • Executable philosophy

    Philosophical discourse aimed at producing a trustworthy answer or meta-answer, in limited time, which can used in constructing an Artificial Intelligence.

  • Some computations are people

    It’s possible to have a conscious person being simulated inside a computer or other substrate.

  • Researchers in value alignment theory

    Who’s working full-time in value alignment theory?

    • Nick Bostrom

      Nick Bostrom, secretly the inventor of Friendly AI

  • The rocket alignment problem

    If people talked about the problem of space travel the way they talked about AI…

  • Vingean reflection

    The problem of thinking about your future self when it’s smarter than you.

    • Vinge's Principle

      An agent building another agent must usually approve its design without knowing the agent’s exact policy choices.

    • Reflective stability

      Wanting to think the way you currently think, building other agents and self-modifications that think the same way.

      • Reflectively consistent degree of freedom

        When an instrumentally efficient, self-modifying AI can be like X or like X’ in such a way that X wants to be X and X’ wants to be X’, that’s a reflectively consistent degree of freedom.

        • Humean degree of freedom

          A concept includes ‘Humean degrees of freedom’ when the intuitive borders of the human version of that concept depend on our values, making that concept less natural for AIs to learn.

        • Value-laden

          Cure cancer, but avoid any bad side effects? Categorizing “bad side effects” requires knowing what’s “bad”. If an agent needs to load complex human goals to evaluate something, it’s “value-laden”.

      • Other-izing (wanted: new optimization idiom)

        Maximization isn’t possible for bounded agents, and satisficing doesn’t seem like enough. What other kind of ‘izing’ might be good for realistic, bounded agents?

      • Consequentialist preferences are reflectively stable by default

        Gandhi wouldn’t take a pill that made him want to kill people, because he knows in that case more people will be murdered. A paperclip maximizer doesn’t want to stop maximizing paperclips.

    • Tiling agents theory

      The theory of self-modifying agents that build successors that are very similar to themselves, like repeating tiles on a tesselated plane.

    • Reflective consistency

      A decision system is reflectively consistent if it can approve of itself, or approve the construction of similar decision systems (as well as perhaps approving other decision systems too).

  • Correlated coverage

    In which parts of AI alignment can we hope that getting many things right, will mean the AI gets everything right?

  • Modeling distant superintelligences

    The several large problems that might occur if an AI starts to think about alien superintelligences.

    • Distant superintelligences can coerce the most probable environment of your AI

      Distant superintelligences may be able to hack your local AI, if your AI’s preference framework depends on its most probable environment.

  • Strategic AGI typology

    What broad types of advanced AIs, corresponding to which strategic scenarios, might it be possible or wise to create?

    • Known-algorithm non-self-improving agent

      Possible advanced AIs that aren’t self-modifying, aren’t self-improving, and where we know and understand all the component algorithms.

    • Autonomous AGI

      The hardest possible class of Friendly AI to build, with the least moral hazard; an AI intended to neither require nor accept further direction.

    • Task-directed AGI

      An advanced AI that’s meant to pursue a series of limited-scope goals given it by the user. In Bostrom’s terminology, a Genie.

      • Behaviorist genie

        An advanced agent that’s forbidden to model minds in too much detail.

      • Epistemic exclusion

        How would you build an AI that, no matter what else it learned about the world, never knew or wanted to know what was inside your basement?

      • Open subproblems in aligning a Task-based AGI

        Open research problems, especially ones we can model today, in building an AGI that can “paint all cars pink” without turning its future light cone into pink-painted cars.

      • Low impact

        The open problem of having an AI carry out tasks in ways that cause minimum side effects and change as little of the rest of the universe as possible.

        • Shutdown utility function

          A special case of a low-impact utility function where you just want the AGI to switch itself off harmlessly (and not create subagents to make absolutely sure it stays off, etcetera).

        • Abortable plans

          Plans that can be undone, or switched to having low further impact. If the AI builds abortable nanomachines, they’ll have a quiet self-destruct option that includes any replicated nanomachines.

      • Conservative concept boundary

        Given N example burritos, draw a boundary around what is a ‘burrito’ that is relatively simple and allows as few positive instances as possible. Helps make sure the next thing generated is a burrito.

      • Querying the AGI user

        Postulating that an advanced agent will check something with its user, probably comes with some standard issues and gotchas (e.g., prioritizing what to query, not manipulating the user, etc etc).

      • Mild optimization

        An AGI which, if you ask it to paint one car pink, just paints one car pink and doesn’t tile the universe with pink-painted cars, because it’s not trying that hard to max out its car-painting score.

      • Task identification problem

        If you have a task-based AGI (Genie) then how do you pinpoint exactly what you want it to do (and not do)?

        • Look where I'm pointing, not at my finger

          When trying to communicate the concept “glove”, getting the AGI to focus on “gloves” rather than “my user’s decision to label something a glove” or “anything that depresses the glove-labeling button”.

      • Safe plan identification and verification

        On a particular task or problem, the issue of how to communicate to the AGI what you want it to do and all the things you don’t want it to do.

        • Do-What-I-Mean hierarchy

          Successive levels of “Do What I Mean” or AGIs that understand their users increasingly well

      • Faithful simulation

        How would you identify, to a Task AGI (aka Genie), the problem of scanning a human brain, and then running a sufficiently accurate simulation of it for the simulation to not be crazy or psychotic?

      • Task (AI goal)

        When building the first AGIs, it may be wiser to assign them only goals that are bounded in space and time, and can be satisfied by bounded efforts.

      • Limited AGI

        Task-based AGIs don’t need unlimited cognitive and material powers to carry out their Tasks; which means their powers can potentially be limited.

      • Oracle

        System designed to safely answer questions.

        • Zermelo-Fraenkel provability oracle

          We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.

      • Boxed AI

        Idea: what if we limit how AI can interact with the world. That’ll make it safe, right??

        • Zermelo-Fraenkel provability oracle

          We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.

    • Oracle

      System designed to safely answer questions.

      • Zermelo-Fraenkel provability oracle

        We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.

  • Sufficiently optimized agents appear coherent

    If you could think as well as a superintelligence, you’d be at least that smart yourself.

  • Relevant powerful agents will be highly optimized
  • Strong cognitive uncontainability

    An advanced agent can win in ways humans can’t understand in advance.

  • Advanced safety

    An agent is really safe when it has the capacity to do anything, but chooses to do what the programmer wants.

    • Methodology of unbounded analysis

      What we do and don’t understand how to do, using unlimited computing power, is a critical distinction and important frontier.

      • AIXI

        How to build an (evil) superintelligent AI using unlimited computing power and one page of Python code.

        • AIXI-tl

          A time-bounded version of the ideal agent AIXI that uses an impossibly large finite computer instead of a hypercomputer.

      • Solomonoff induction

        A simple way to superintelligently predict sequences of data, given unlimited computing power.

        • Solomonoff induction: Intro Dialogue (Math 2)

          An introduction to Solomonoff induction for the unfamiliar reader who isn’t bad at math

      • Hypercomputer

        Some formalisms demand computers larger than the limit of all finite computers

      • Unphysically large finite computer

        The imaginary box required to run programs that require impossibly large, but finite, amounts of computing power.

      • Cartesian agent

        Agents separated from their environments by impermeable barriers through which only sensory information can enter and motor output can exit.

        • Cartesian agent-environment boundary

          If your agent is separated from the environment by an absolute border that can only be crossed by sensory information and motor outputs, it might just be a Cartesian agent.

      • Mechanical Turk (example)

        The 19th-century chess-playing automaton known as the Mechanical Turk actually had a human operator inside. People at the time had interesting thoughts about the possibility of mechanical chess.

      • No-Free-Lunch theorems are often irrelevant

        There’s often a theorem proving that some problem has no optimal answer across every possible world. But this may not matter, since the real world is a special case. (E.g., a low-entropy universe.)

    • AI safety mindset

      Asking how AI designs could go wrong, instead of imagining them going right.

      • Valley of Dangerous Complacency

        When the AGI works often enough that you let down your guard, but it still has bugs. Imagine a robotic car that almost always steers perfectly, but sometimes heads off a cliff.

      • Show me what you've broken

        To demonstrate competence at computer security, or AI alignment, think in terms of breaking proposals and finding technically demonstrable flaws in them.

      • Ad-hoc hack (alignment theory)

        A “hack” is when you alter the behavior of your AI in a way that defies, or doesn’t correspond to, a principled approach for that problem.

      • Don't try to solve the entire alignment problem

        New to AI alignment theory? Want to work in this area? Already been working in it for years? Don’t try to solve the entire alignment problem with your next good idea!

      • Flag the load-bearing premises

        If somebody says, “This AI safety plan is going to fail, because X” and you reply, “Oh, that’s fine because of Y and Z”, then you’d better clearly flag Y and Z as “load-bearing” parts of your plan.

      • Directing, vs. limiting, vs. opposing

        Getting the AI to compute the right action in a domain; versus getting the AI to not compute at all in an unsafe domain; versus trying to prevent the AI from acting successfully. (Prefer 1 & 2.)

    • Optimization daemons

      When you optimize something so hard that it crystalizes into an optimizer, like the way natural selection optimized apes so hard they turned into human-level intelligences

    • Nearest unblocked strategy

      If you patch an agent’s preference framework to avoid an undesirable solution, what can you expect to happen?

    • Safe but useless

      Sometimes, at the end of locking down your AI so that it seems extremely safe, you’ll end up with an AI that can’t be used to do anything interesting.

    • Distinguish which advanced-agent properties lead to the foreseeable difficulty

      Say what kind of AI, or threshold level of intelligence, or key type of advancement, first produces the difficulty or challenge you’re talking about.

    • Goodness estimate biaser

      Some of the main problems in AI alignment can be seen as scenarios where actual goodness is likely to be systematically lower than a broken way of estimating goodness.

    • Goodhart's Curse

      The Optimizer’s Curse meets Goodhart’s Law. For example, if our values are V, and an AI’s utility function U is a proxy for V, optimizing for high U seeks out ‘errors’—that is, high values of U—V.

    • Context disaster

      Some possible designs cause your AI to behave nicely while developing, and behave a lot less nicely when it’s smarter.

    • Methodology of foreseeable difficulties

      Building a nice AI is likely to be hard enough, and contain enough gotchas that won’t show up in the AI’s early days, that we need to foresee problems coming in advance.

    • Actual effectiveness

      If you want the AI’s so-called ‘utility function’ to actually be steering the AI, you need to think about how it meshes up with beliefs, or what gets output to actions.

  • Relevant powerful agent

    An agent is relevant if it completely changes the course of history.

  • Informed oversight

    Incentivize a reinforcement learner that’s less smart than you to accomplish some task

  • Safe training procedures for human-imitators

    How does one train a reinforcement learner to act like a human?

  • Reliable prediction

    How can we train predictors that reliably predict observable phenomena such as human behavior?

  • Selective similarity metrics for imitation

    Can we make human-imitators more efficient by scoring them more heavily on imitating the aspects of human behavior we care about more?

  • Relevant limited AI

    Can we have a limited AI, that’s nonetheless relevant?

  • Value achievement dilemma

    How can Earth-originating intelligent life achieve most of its potential value, whether by AI or otherwise?

    • Moral hazards in AGI development

      “Moral hazard” is when owners of an advanced AGI give in to the temptation to do things with it that the rest of us would regard as ‘bad’, like, say, declaring themselves God-Emperor.

    • Coordinative AI development hypothetical

      What would safe AI development look like if we didn’t have to worry about anything else?

    • Pivotal act

      Which types of AIs, if they work, can do things that drastically change the nature of the further game?

    • Cosmic endowment

      The ‘cosmic endowment’ consists of all the stars that could be reached from probes originating on Earth; the sum of all matter and energy potentially available to be transformed into life and fun.

    • Aligning an AGI adds significant development time

      Aligning an advanced AI foreseeably involves extra code and extra testing and not being able to do everything the fastest way, so it takes longer.

  • VAT playpen

    Playpen page for VAT domain.

  • Nick Bostrom's book Superintelligence

    The current best book-form introduction to AI alignment theory.

  • List: value-alignment subjects

    Bullet point list of core VAT subjects.

  • AI arms races

    AI arms races are bad

  • Corrigibility

    “I can’t let you do that, Dave.”

    • Programmer deception
      • Cognitive steganography

        Disaligned AIs that are modeling human psychology and trying to deceive their programmers will want to hide their internal thought processes from their programmers.

    • Utility indifference

      How can we make an AI indifferent to whether we press a button that changes its goals?

    • Averting instrumental pressures

      Almost-any utility function for an AI, whether the target is diamonds or paperclips or eudaimonia, implies subgoals like rapidly self-improving and refusing to shut down. Can we make that not happen?

    • Averting the convergent instrumental strategy of self-improvement

      We probably want the first AGI to not improve as fast as possible, but improving as fast as possible is a convergent strategy for accomplishing most things.

    • Shutdown problem

      How to build an AGI that lets you shut it down, despite the obvious fact that this will interfere with whatever the AGI’s goals are.

      • You can't get the coffee if you're dead

        An AI given the goal of ‘get the coffee’ can’t achieve that goal if it has been turned off; so even an AI whose goal is just to fetch the coffee may try to avert a shutdown button being pressed.

    • User manipulation

      If not otherwise averted, many of an AGI’s desired outcomes are likely to interact with users and hence imply an incentive to manipulate users.

      • User maximization

        A sub-principle of avoiding user manipulation—if you see an argmax over X or ‘optimize X’ instruction and X includes a user interaction, you’ve just told the AI to optimize the user.

    • Hard problem of corrigibility

      Can you build an agent that reasons as if it knows itself to be incomplete and sympathizes with your wanting to rebuild or correct it?

    • Problem of fully updated deference

      Why moral uncertainty doesn’t stop an AI from defending its off-switch.

    • Interruptibility

      A subproblem of corrigibility under the machine learning paradigm: when the agent is interrupted, it must not learn to prevent future interruptions.

  • Unforeseen maximum

    When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)

    • Missing the weird alternative

      People might systematically overlook “make tiny molecular smileyfaces” as a way of “producing smiles”, because our brains automatically search for high-utility-to-us ways of “producing smiles”.

  • Patch resistance

    One does not simply solve the value alignment problem.

    • Unforeseen maximum

      When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)

      • Missing the weird alternative

        People might systematically overlook “make tiny molecular smileyfaces” as a way of “producing smiles”, because our brains automatically search for high-utility-to-us ways of “producing smiles”.

  • Coordinative AI development hypothetical

    What would safe AI development look like if we didn’t have to worry about anything else?

  • Safe impact measure

    What can we measure to make sure an agent is acting in a safe manner?

  • AI alignment open problem

    Tag for open problems under AI alignment.

  • Natural language understanding of "right" will yield normativity

    What will happen if you tell an advanced agent to do the “right” thing?

  • Identifying ambiguous inductions

    What do a “red strawberry”, a “red apple”, and a “red cherry” have in common that a “yellow carrot” doesn’t? Are they “red fruits” or “red objects”?

  • Value

    The word ‘value’ in the phrase ‘value alignment’ is a metasyntactic variable that indicates the speaker’s future goals for intelligent life.

    • Extrapolated volition (normative moral theory)

      If someone asks you for orange juice, and you know that the refrigerator contains no orange juice, should you bring them lemonade?

      • Rescuing the utility function

        If your utility function values ‘heat’, and then you discover to your horror that there’s no ontologically basic heat, switch to valuing disordered kinetic energy. Likewise ‘free will’ or ‘people’.

    • Coherent extrapolated volition (alignment target)

      A proposed direction for an extremely well-aligned autonomous superintelligence—do what humans would want, if we knew what the AI knew, thought that fast, and understood ourselves.

    • 'Beneficial'

      Really actually good. A metasyntactic variable to mean “favoring whatever the speaker wants ideally to accomplish”, although different speakers have different morals and metaethics.

    • William Frankena's list of terminal values

      Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment, etc.; truth; knowledge and true opinions…

    • 'Detrimental'

      The opposite of beneficial.

    • Immediate goods
    • Cosmopolitan value

      Intuitively: Value as seen from a broad, embracing standpoint that is aware of how other entities may not always be like us or easily understandable to us, yet still worthwhile.

  • Linguistic conventions in value alignment

    How and why to use precise language and words with special meaning when talking about value alignment.

    • Utility

      What is “utility” in the context of Value Alignment Theory?

  • Development phase unpredictable
    • Unforeseen maximum

      When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)

      • Missing the weird alternative

        People might systematically overlook “make tiny molecular smileyfaces” as a way of “producing smiles”, because our brains automatically search for high-utility-to-us ways of “producing smiles”.

  • Complexity of value

    There’s no simple way to describe the goals we want Artificial Intelligences to want.

    • Underestimating complexity of value because goodness feels like a simple property

      When you just want to yell at the AI, “Just do normal high-value X, dammit, not weird low-value X!” and that ‘high versus low value’ boundary is way more complicated than your brain wants to think.

    • Meta-rules for (narrow) value learning are still unsolved

      We don’t currently know a simple meta-utility function that would take in observation of humans and spit out our true values, or even a good target for a Task AGI.

  • Value alignment problem

    You want to build an advanced AI with the right values… but how?

    • Total alignment

      We say that an advanced AI is “totally aligned” when it knows exactly which outcomes and plans are beneficial, with no further user input.

    • Preference framework

      What’s the thing an agent uses to compare its preferences?

      • Moral uncertainty

        A meta-utility function in which the utility function as usually considered, takes on different values in different possible worlds, potentially distinguishable by evidence.

        • Ideal target

          The ‘ideal target’ of a meta-utility function is the value the ground-level utility function would take on if the agent updated on all possible evidence; the ‘true’ utilities under moral uncertainty.

      • Meta-utility function

        Preference frameworks built out of simple utility functions, but where, e.g., the ‘correct’ utility function for a possible world depends on whether a button is pressed.

      • Attainable optimum

        The ‘attainable optimum’ of an agent’s preferences is the best that agent can actually do given its finite intelligence and resources (as opposed to the global maximum of those preferences).

  • Object-level vs. indirect goals

    Difference between “give Alice the apple” and “give Alice what she wants”.

  • Value identification problem
    • Happiness maximizer
    • Edge instantiation

      When you ask the AI to make people happy, and it tiles the universe with the smallest objects that can be happy.

    • Identifying causal goal concepts from sensory data

      If the intended goal is “cure cancer” and you show the AI healthy patients, it sees, say, a pattern of pixels on a webcam. How do you get to a goal concept about the real patients?

    • Goal-concept identification

      Figuring out how to say “strawberry” to an AI that you want to bring you strawberries (and not fake plastic strawberries, either).

    • Ontology identification problem

      How do we link an agent’s utility function to its model of the world, when we don’t know what that model will look like?

      • Diamond maximizer

        How would you build an agent that made as much diamond material as possible, given vast computing power but an otherwise rich and complicated environment?

      • Ontology identification problem: Technical tutorial

        Technical tutorial for ontology identification problem.

    • Environmental goals

      The problem of having an AI want outcomes that are out in the world, not just want direct sense events.

  • Intended goal
  • Mindcrime

    Might a machine intelligence contain vast numbers of unhappy conscious subprocesses?

    • Mindcrime: Introduction
    • Nonperson predicate

      If we knew which computations were definitely not people, we could tell AIs which programs they were definitely allowed to compute.

  • Principles in AI alignment

    A ‘principle’ of AI alignment is a very general design goal like ‘understand what the heck is going on inside the AI’ that has informed a wide set of specific design proposals.

    • Non-adversarial principle

      At no point in constructing an Artificial General Intelligence should we construct a computation that tries to hurt us, and then try to stop it from hurting us.

      • Omnipotence test for AI safety

        Would your AI produce disastrous outcomes if it suddenly gained omnipotence and omniscience? If so, why did you program something that wants to hurt you and is held back only by lacking the power?

      • Niceness is the first line of defense

        The first line of defense in dealing with any partially superhuman AI system advanced enough to possibly be dangerous is that it does not want to hurt you or defeat your safety measures.

      • Directing, vs. limiting, vs. opposing

        Getting the AI to compute the right action in a domain; versus getting the AI to not compute at all in an unsafe domain; versus trying to prevent the AI from acting successfully. (Prefer 1 & 2.)

      • The AI must tolerate your safety measures

        A corollary of the nonadversarial principle is that “The AI must tolerate your safety measures.”

      • Generalized principle of cognitive alignment

        When we’re asking how we want the AI to think about an alignment problem, one source of inspiration is trying to have the AI mirror our own thoughts about that problem.

    • Minimality principle

      The first AGI ever built should save the world in a way that requires the least amount of the least dangerous cognition.

    • Understandability principle

      The more you understand what the heck is going on inside your AI, the safer you are.

      • Effability principle

        You are safer the more you understand the inner structure of how your AI thinks; the better you can describe the relation of smaller pieces of the AI’s thought process.

    • Separation from hyperexistential risk

      The AI should be widely separated in the design space from any AI that would constitute a “hyperexistential risk” (anything worse than death).

  • Theory of (advanced) agents

    One of the research subproblems of building powerful nice AIs, is the theory of (sufficiently advanced) minds in general.

    • Instrumental convergence

      Some strategies can help achieve most possible simple goals. E.g., acquiring more computing power or more material resources. By default, unless averted, we can expect advanced AIs to do that.

      • Paperclip maximizer

        This agent will not stop until the entire universe is filled with paperclips.

        • Paperclip

          A configuration of matter that we’d see as being worthless even from a very cosmopolitan perspective.

        • Random utility function

          A ‘random’ utility function is one chosen at random according to some simple probability measure (e.g. weight by Kolmorogov complexity) on a logical space of formal utility functions.

      • Instrumental

        What is “instrumental” in the context of Value Alignment Theory?

      • Instrumental pressure

        A consequentialist agent will want to bring about certain instrumental events that will help to fulfill its goals.

      • Convergent instrumental strategies

        Paperclip maximizers can make more paperclips by improving their cognitive abilities or controlling more resources. What other strategies would almost-any AI try to use?

        • Convergent strategies of self-modification

          The strategies we’d expect to be employed by an AI that understands the relevance of its code and hardware to achieving its goals, which therefore has subgoals about its code and hardware.

        • Consequentialist preferences are reflectively stable by default

          Gandhi wouldn’t take a pill that made him want to kill people, because he knows in that case more people will be murdered. A paperclip maximizer doesn’t want to stop maximizing paperclips.

      • You can't get more paperclips that way

        Most arguments that “A paperclip maximizer could get more paperclips by (doing nice things)” are flawed.

    • Orthogonality Thesis

      Will smart AIs automatically become benevolent, or automatically become hostile? Or do different AI designs imply different goals?

      • Paperclip maximizer

        This agent will not stop until the entire universe is filled with paperclips.

        • Paperclip

          A configuration of matter that we’d see as being worthless even from a very cosmopolitan perspective.

        • Random utility function

          A ‘random’ utility function is one chosen at random according to some simple probability measure (e.g. weight by Kolmorogov complexity) on a logical space of formal utility functions.

      • Mind design space is wide

        Imagine all human beings as one tiny dot inside a much vaster sphere of possibilities for “The space of minds in general.” It is wiser to make claims about some minds than all minds.

      • Instrumental goals are almost-equally as tractable as terminal goals

        Getting the milk from the refrigerator because you want to drink it, is not vastly harder than getting the milk from the refrigerator because you inherently desire it.

    • Advanced agent properties

      How smart does a machine intelligence need to be, for its niceness to become an issue? “Advanced” is a broad term to cover cognitive abilities such that we’d need to start considering AI alignment.

      • Big-picture strategic awareness

        We start encountering new AI alignment issues at the point where a machine intelligence recognizes the existence of a real world, the existence of programmers, and how these relate to its goals.

      • Superintelligent

        A “superintelligence” is strongly superhuman (strictly higher-performing than any and all humans) on every cognitive problem.

      • Intelligence explosion

        What happens if a self-improving AI gets to the point where each amount x of self-improvement triggers >x further self-improvement, and it stays that way for a while.

      • Artificial General Intelligence

        An AI which has the same kind of “significantly more general” intelligence that humans have compared to chimpanzees; it can learn new domains, like we can.

      • Advanced nonagent

        Hypothetically, cognitively powerful programs that don’t follow the loop of “observe, learn, model the consequences, act, observe results” that a standard “agent” would.

      • Epistemic and instrumental efficiency

        An efficient agent never makes a mistake you can predict. You can never successfully predict a directional bias in its estimates.

        • Time-machine metaphor for efficient agents

          Don’t imagine a paperclip maximizer as a mind. Imagine it as a time machine that always spits out the output leading to the greatest number of future paperclips.

      • Standard agent properties

        What’s a Standard Agent, and what can it do?

        • Bounded agent

          An agent that operates in the real world, using realistic amounts of computing power, that is uncertain of its environment, etcetera.

      • Real-world domain

        Some AIs play chess, some AIs play Go, some AIs drive cars. These different ‘domains’ present different options. All of reality, in all its messy entanglement, is the ‘real-world domain’.

      • Sufficiently advanced Artificial Intelligence

        ‘Sufficiently advanced Artificial Intelligences’ are AIs with enough ‘advanced agent properties’ that we start needing to do ‘AI alignment’ to them.

      • Infrahuman, par-human, superhuman, efficient, optimal

        A categorization of AI ability levels relative to human, with some gotchas in the ordering. E.g., in simple domains where humans can play optimally, optimal play is not superhuman.

      • General intelligence

        Compared to chimpanzees, humans seem to be able to learn a much wider variety of domains. We have ‘significantly more generally applicable’ cognitive abilities, aka ‘more general intelligence’.

      • Corporations vs. superintelligences

        Corporations have relatively few of the advanced-agent properties that would allow one mistake in aligning a corporation to immediately kill all humans and turn the future light cone into paperclips.

      • Cognitive uncontainability

        ‘Cognitive uncontainability’ is when we can’t hold all of an agent’s possibilities inside our own minds.

        • Rich domain
          • Logical game

            Game’s mathematical structure at its purest form.

          • Almost all real-world domains are rich

            Anything you’re trying to accomplish in the real world can potentially be accomplished in a lot of different ways.

      • Vingean uncertainty

        You can’t predict the exact actions of an agent smarter than you—so is there anything you can say about them?

        • Vinge's Law

          You can’t predict exactly what someone smarter than you would do, because if you could, you’d be that smart yourself.

        • Deep Blue

          The chess-playing program, built by IBM, that first won the world chess championship from Garry Kasparov in 1996.

      • Consequentialist cognition

        The cognitive ability to foresee the consequences of actions, prefer some outcomes to others, and output actions leading to the preferred outcomes.

  • Difficulty of AI alignment

    How hard is it exactly to point an Artificial General Intelligence in an intuitively okay direction?

  • Glossary (Value Alignment Theory)

    Words that have a special meaning in the context of creating nice AIs.

    • Friendly AI

      Old terminology for an AI whose preferences have been successfully aligned with idealized human values.

    • Cognitive domain

      An allegedly compact unit of knowledge, such that ideas inside the unit interact mainly with each other and less with ideas in other domains.

      • Distances between cognitive domains

        Often in AI alignment we want to ask, “How close is ‘being able to do X’ to ‘being able to do Y’?”

    • 'Concept'

      In the context of Artificial Intelligence, a ‘concept’ is a category, something that identifies thingies as being inside or outside the concept.

  • Programmer

    Who is building these advanced agents?

  • Decision theory

    The mathematical study of ideal decisionmaking AIs.

    • Expected utility formalism

      Expected utility is the central idea in the quantitative implementation of consequentialism

      • Expected utility agent

        If you’re not some kind of expected utility agent, you’re going in circles.

      • Expected utility

        Scoring actions based on the average score of their probable consequences.

      • Utility function

        The only coherent way of wanting things is to assign consistent relative scores to outcomes.

      • Coherent decisions imply consistent utilities

        Why do we all use the ‘expected utility’ formalism? Because any behavior that can’t be viewed from that perspective, must be qualitatively self-defeating (in various mathy ways).

      • Coherence theorems

        A ‘coherence theorem’ shows that something bad happens to an agent if its decisions can’t be viewed as ‘coherent’ in some sense. E.g., an inconsistent preference ordering leads to going in circles.

    • Logical decision theories

      Root page for topics on logical decision theory, with multiple intros for different audiences.

      • Guide to Logical Decision Theory

        The entry point for learning about logical decision theory.

      • Introduction to Logical Decision Theory for Economists

        An introduction to ‘logical decision theory’ and its implications for the Ultimatum Game, voting in elections, bargaining problems, and more.

      • Omega (alien philosopher-troll)

        The entity that sets up all those trolley problems. An alien philosopher/troll imbued with unlimited powers, excellent predictive ability, and very odd motives.

      • Introduction to Logical Decision Theory for Computer Scientists

        ‘Logical decision theory’ from a math/programming standpoint, including how two agents with mutual knowledge of each other’s code can cooperate on the Prisoner’s Dilemma.

      • Introduction to Logical Decision Theory for Analytic Philosophers

        Why “choose as if controlling the logical output of your decision algorithm” is the most appealing candidate for the principle of rational choice.

      • An Introduction to Logical Decision Theory for Everyone Else

        So like what the heck is ‘logical decision theory’ in terms a normal person can understand?

      • Newcomblike decision problems

        Decision problems in which your choice correlates with something other than its physical consequences (say, because somebody has predicted you very well) can do weird things to some decision theories.

        • 99LDT x 1CDT oneshot PD tournament as arguable counterexample to LDT doing better than CDT

          Arguendo, if 99 LDT agents and 1 CDT agent are facing off in a one-shot Prisoner’s Dilemma tournament, the CDT agent does better on a problem that CDT considers ‘fair’.

        • Newcomb's Problem

          There are two boxes in front of you, Box A and Box B. You can take both boxes, or only Box B. Box A contains 1000. Box B contains 1,000,000 if and only if Omega predicted you’d take only Box B.

        • Prisoner's Dilemma

          You and an accomplice have been arrested. Both of you must decide, in isolation, whether to testify against the other prisoner—which subtracts one year from your sentence, and adds two to theirs.

          • True Prisoner's Dilemma

            A scenario that would reproduce the ideal payoff matrix of the Prisoner’s Dilemma about human beings who care about their public reputation and each other.

        • Absent-Minded Driver dilemma

          A road contains two identical intersections. An absent-minded driver wants to turn right at the second intersection. “With what probability should the driver turn right?” argue decision theorists.

        • Death in Damascus

          Death tells you that It is coming for you tomorrow. You can stay in Damascus or flee to Aleppo. Whichever decision you actually make is the wrong one. This gives some decision theories trouble.

        • 'Rationality' of voting in elections

          “A single vote is very unlikely to swing the election, so your vote is unlikely to have an effect” versus “Many people similar to you are making a similar decision about whether to vote.”

        • Toxoplasmosis dilemma

          A parasitic infection, carried by cats, may make humans enjoy petting cats more. A kitten, now in front of you, isn’t infected. But if you want to pet it, you may already be infected. Do you?

        • Transparent Newcomb's Problem

          Omega has left behind a transparent Box A containing 1000, and transparent Box B containing 1,000,000 or nothing. Box B is full iff Omega thinks you one-box on seeing a full Box B.

        • Parfit's Hitchhiker

          You are dying in the desert. A truck-driver who is very good at reading faces finds you, and offers to drive you into the city if you promise to pay $1,000 on arrival. You are a selfish rationalist.

        • Ultimatum Game

          A Proposer decides how to split $10 between themselves and the Responder. The Responder can take what is offered, or refuse, in which case both parties get nothing.

    • Updateless decision theories

      Decision theories that maximize their policies (mappings from sense inputs to actions), rather than using their sense inputs to update their beliefs and then selecting actions.

    • Fair problem class

      A problem is ‘fair’ (according to logical decision theory) when only the results matter and not how we get there.

    • Causal decision theories

      On CDT, to choose rationally, you should imagine the world where your physical act changes, then imagine running that world forward in time. (Therefore, it’s irrational to vote in elections.)

    • Evidential decision theories

      Theories which hold that the principle of rational choice is “Choose the act that would be the best news, if somebody told you that you’d chosen that act.”

    • Modal combat

      Modal combat

Discussion
Discussion