Imported from Arbital's Explore AI Alignment (with a bit of deduplicating) + adding the Decision theory articles.
Philosophical discourse aimed at producing a trustworthy answer or meta-answer, in limited time, which can used in constructing an Artificial Intelligence.
It’s possible to have a conscious person being simulated inside a computer or other substrate.
Researchers in value alignment theory
Who’s working full-time in value alignment theory?
Nick Bostrom, secretly the inventor of Friendly AI
If people talked about the problem of space travel the way they talked about AI…
The problem of thinking about your future self when it’s smarter than you.
An agent building another agent must usually approve its design without knowing the agent’s exact policy choices.
Wanting to think the way you currently think, building other agents and self-modifications that think the same way.
Reflectively consistent degree of freedom
When an instrumentally efficient, self-modifying AI can be like X or like X’ in such a way that X wants to be X and X’ wants to be X’, that’s a reflectively consistent degree of freedom.
A concept includes ‘Humean degrees of freedom’ when the intuitive borders of the human version of that concept depend on our values, making that concept less natural for AIs to learn.
Cure cancer, but avoid any bad side effects? Categorizing “bad side effects” requires knowing what’s “bad”. If an agent needs to load complex human goals to evaluate something, it’s “value-laden”.
Other-izing (wanted: new optimization idiom)
Maximization isn’t possible for bounded agents, and satisficing doesn’t seem like enough. What other kind of ‘izing’ might be good for realistic, bounded agents?
Consequentialist preferences are reflectively stable by default
Gandhi wouldn’t take a pill that made him want to kill people, because he knows in that case more people will be murdered. A paperclip maximizer doesn’t want to stop maximizing paperclips.
The theory of self-modifying agents that build successors that are very similar to themselves, like repeating tiles on a tesselated plane.
A decision system is reflectively consistent if it can approve of itself, or approve the construction of similar decision systems (as well as perhaps approving other decision systems too).
In which parts of AI alignment can we hope that getting many things right, will mean the AI gets everything right?
Modeling distant superintelligences
The several large problems that might occur if an AI starts to think about alien superintelligences.
Distant superintelligences can coerce the most probable environment of your AI
Distant superintelligences may be able to hack your local AI, if your AI’s preference framework depends on its most probable environment.
What broad types of advanced AIs, corresponding to which strategic scenarios, might it be possible or wise to create?
Known-algorithm non-self-improving agent
Possible advanced AIs that aren’t self-modifying, aren’t self-improving, and where we know and understand all the component algorithms.
The hardest possible class of Friendly AI to build, with the least moral hazard; an AI intended to neither require nor accept further direction.
An advanced AI that’s meant to pursue a series of limited-scope goals given it by the user. In Bostrom’s terminology, a Genie.
An advanced agent that’s forbidden to model minds in too much detail.
How would you build an AI that, no matter what else it learned about the world, never knew or wanted to know what was inside your basement?
Open subproblems in aligning a Task-based AGI
Open research problems, especially ones we can model today, in building an AGI that can “paint all cars pink” without turning its future light cone into pink-painted cars.
The open problem of having an AI carry out tasks in ways that cause minimum side effects and change as little of the rest of the universe as possible.
A special case of a low-impact utility function where you just want the AGI to switch itself off harmlessly (and not create subagents to make absolutely sure it stays off, etcetera).
Plans that can be undone, or switched to having low further impact. If the AI builds abortable nanomachines, they’ll have a quiet self-destruct option that includes any replicated nanomachines.
Given N example burritos, draw a boundary around what is a ‘burrito’ that is relatively simple and allows as few positive instances as possible. Helps make sure the next thing generated is a burrito.
Postulating that an advanced agent will check something with its user, probably comes with some standard issues and gotchas (e.g., prioritizing what to query, not manipulating the user, etc etc).
An AGI which, if you ask it to paint one car pink, just paints one car pink and doesn’t tile the universe with pink-painted cars, because it’s not trying that hard to max out its car-painting score.
If you have a task-based AGI (Genie) then how do you pinpoint exactly what you want it to do (and not do)?
Look where I'm pointing, not at my finger
When trying to communicate the concept “glove”, getting the AGI to focus on “gloves” rather than “my user’s decision to label something a glove” or “anything that depresses the glove-labeling button”.
Safe plan identification and verification
On a particular task or problem, the issue of how to communicate to the AGI what you want it to do and all the things you don’t want it to do.
Successive levels of “Do What I Mean” or AGIs that understand their users increasingly well
How would you identify, to a Task AGI (aka Genie), the problem of scanning a human brain, and then running a sufficiently accurate simulation of it for the simulation to not be crazy or psychotic?
When building the first AGIs, it may be wiser to assign them only goals that are bounded in space and time, and can be satisfied by bounded efforts.
Task-based AGIs don’t need unlimited cognitive and material powers to carry out their Tasks; which means their powers can potentially be limited.
System designed to safely answer questions.
Zermelo-Fraenkel provability oracle
We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.
Idea: what if we limit how AI can interact with the world. That’ll make it safe, right??
Zermelo-Fraenkel provability oracle
We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.
System designed to safely answer questions.
Zermelo-Fraenkel provability oracle
We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.
Sufficiently optimized agents appear coherent
If you could think as well as a superintelligence, you’d be at least that smart yourself.
Strong cognitive uncontainability
An advanced agent can win in ways humans can’t understand in advance.
An agent is really safe when it has the capacity to do anything, but chooses to do what the programmer wants.
Methodology of unbounded analysis
What we do and don’t understand how to do, using unlimited computing power, is a critical distinction and important frontier.
How to build an (evil) superintelligent AI using unlimited computing power and one page of Python code.
A time-bounded version of the ideal agent AIXI that uses an impossibly large finite computer instead of a hypercomputer.
A simple way to superintelligently predict sequences of data, given unlimited computing power.
Solomonoff induction: Intro Dialogue (Math 2)
An introduction to Solomonoff induction for the unfamiliar reader who isn’t bad at math
Some formalisms demand computers larger than the limit of all finite computers
Unphysically large finite computer
The imaginary box required to run programs that require impossibly large, but finite, amounts of computing power.
Agents separated from their environments by impermeable barriers through which only sensory information can enter and motor output can exit.
Cartesian agent-environment boundary
If your agent is separated from the environment by an absolute border that can only be crossed by sensory information and motor outputs, it might just be a Cartesian agent.
The 19th-century chess-playing automaton known as the Mechanical Turk actually had a human operator inside. People at the time had interesting thoughts about the possibility of mechanical chess.
No-Free-Lunch theorems are often irrelevant
There’s often a theorem proving that some problem has no optimal answer across every possible world. But this may not matter, since the real world is a special case. (E.g., a low-entropy universe.)
Asking how AI designs could go wrong, instead of imagining them going right.
Valley of Dangerous Complacency
When the AGI works often enough that you let down your guard, but it still has bugs. Imagine a robotic car that almost always steers perfectly, but sometimes heads off a cliff.
To demonstrate competence at computer security, or AI alignment, think in terms of breaking proposals and finding technically demonstrable flaws in them.
Ad-hoc hack (alignment theory)
A “hack” is when you alter the behavior of your AI in a way that defies, or doesn’t correspond to, a principled approach for that problem.
Don't try to solve the entire alignment problem
New to AI alignment theory? Want to work in this area? Already been working in it for years? Don’t try to solve the entire alignment problem with your next good idea!
Flag the load-bearing premises
If somebody says, “This AI safety plan is going to fail, because X” and you reply, “Oh, that’s fine because of Y and Z”, then you’d better clearly flag Y and Z as “load-bearing” parts of your plan.
Directing, vs. limiting, vs. opposing
Getting the AI to compute the right action in a domain; versus getting the AI to not compute at all in an unsafe domain; versus trying to prevent the AI from acting successfully. (Prefer 1 & 2.)
When you optimize something so hard that it crystalizes into an optimizer, like the way natural selection optimized apes so hard they turned into human-level intelligences
If you patch an agent’s preference framework to avoid an undesirable solution, what can you expect to happen?
Sometimes, at the end of locking down your AI so that it seems extremely safe, you’ll end up with an AI that can’t be used to do anything interesting.
Distinguish which advanced-agent properties lead to the foreseeable difficulty
Say what kind of AI, or threshold level of intelligence, or key type of advancement, first produces the difficulty or challenge you’re talking about.
Some of the main problems in AI alignment can be seen as scenarios where actual goodness is likely to be systematically lower than a broken way of estimating goodness.
The Optimizer’s Curse meets Goodhart’s Law. For example, if our values are V, and an AI’s utility function U is a proxy for V, optimizing for high U seeks out ‘errors’—that is, high values of U—V.
Some possible designs cause your AI to behave nicely while developing, and behave a lot less nicely when it’s smarter.
Methodology of foreseeable difficulties
Building a nice AI is likely to be hard enough, and contain enough gotchas that won’t show up in the AI’s early days, that we need to foresee problems coming in advance.
If you want the AI’s so-called ‘utility function’ to actually be steering the AI, you need to think about how it meshes up with beliefs, or what gets output to actions.
An agent is relevant if it completely changes the course of history.
Incentivize a reinforcement learner that’s less smart than you to accomplish some task
Safe training procedures for human-imitators
How does one train a reinforcement learner to act like a human?
How can we train predictors that reliably predict observable phenomena such as human behavior?
Selective similarity metrics for imitation
Can we make human-imitators more efficient by scoring them more heavily on imitating the aspects of human behavior we care about more?
Can we have a limited AI, that’s nonetheless relevant?
How can Earth-originating intelligent life achieve most of its potential value, whether by AI or otherwise?
Moral hazards in AGI development
“Moral hazard” is when owners of an advanced AGI give in to the temptation to do things with it that the rest of us would regard as ‘bad’, like, say, declaring themselves God-Emperor.
Coordinative AI development hypothetical
What would safe AI development look like if we didn’t have to worry about anything else?
Which types of AIs, if they work, can do things that drastically change the nature of the further game?
The ‘cosmic endowment’ consists of all the stars that could be reached from probes originating on Earth; the sum of all matter and energy potentially available to be transformed into life and fun.
Aligning an AGI adds significant development time
Aligning an advanced AI foreseeably involves extra code and extra testing and not being able to do everything the fastest way, so it takes longer.
Playpen page for VAT domain.
Nick Bostrom's book Superintelligence
The current best book-form introduction to AI alignment theory.
List: value-alignment subjects
Bullet point list of core VAT subjects.
AI arms races are bad
“I can’t let you do that, Dave.”
Disaligned AIs that are modeling human psychology and trying to deceive their programmers will want to hide their internal thought processes from their programmers.
How can we make an AI indifferent to whether we press a button that changes its goals?
Averting instrumental pressures
Almost-any utility function for an AI, whether the target is diamonds or paperclips or eudaimonia, implies subgoals like rapidly self-improving and refusing to shut down. Can we make that not happen?
Averting the convergent instrumental strategy of self-improvement
We probably want the first AGI to not improve as fast as possible, but improving as fast as possible is a convergent strategy for accomplishing most things.
How to build an AGI that lets you shut it down, despite the obvious fact that this will interfere with whatever the AGI’s goals are.
You can't get the coffee if you're dead
An AI given the goal of ‘get the coffee’ can’t achieve that goal if it has been turned off; so even an AI whose goal is just to fetch the coffee may try to avert a shutdown button being pressed.
If not otherwise averted, many of an AGI’s desired outcomes are likely to interact with users and hence imply an incentive to manipulate users.
A sub-principle of avoiding user manipulation—if you see an argmax over X or ‘optimize X’ instruction and X includes a user interaction, you’ve just told the AI to optimize the user.
Can you build an agent that reasons as if it knows itself to be incomplete and sympathizes with your wanting to rebuild or correct it?
Problem of fully updated deference
Why moral uncertainty doesn’t stop an AI from defending its off-switch.
A subproblem of corrigibility under the machine learning paradigm: when the agent is interrupted, it must not learn to prevent future interruptions.
When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)
People might systematically overlook “make tiny molecular smileyfaces” as a way of “producing smiles”, because our brains automatically search for high-utility-to-us ways of “producing smiles”.
One does not simply solve the value alignment problem.
When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)
People might systematically overlook “make tiny molecular smileyfaces” as a way of “producing smiles”, because our brains automatically search for high-utility-to-us ways of “producing smiles”.
Coordinative AI development hypothetical
What would safe AI development look like if we didn’t have to worry about anything else?
What can we measure to make sure an agent is acting in a safe manner?
Tag for open problems under AI alignment.
Natural language understanding of "right" will yield normativity
What will happen if you tell an advanced agent to do the “right” thing?
Identifying ambiguous inductions
What do a “red strawberry”, a “red apple”, and a “red cherry” have in common that a “yellow carrot” doesn’t? Are they “red fruits” or “red objects”?
The word ‘value’ in the phrase ‘value alignment’ is a metasyntactic variable that indicates the speaker’s future goals for intelligent life.
Extrapolated volition (normative moral theory)
If someone asks you for orange juice, and you know that the refrigerator contains no orange juice, should you bring them lemonade?
If your utility function values ‘heat’, and then you discover to your horror that there’s no ontologically basic heat, switch to valuing disordered kinetic energy. Likewise ‘free will’ or ‘people’.
Coherent extrapolated volition (alignment target)
A proposed direction for an extremely well-aligned autonomous superintelligence—do what humans would want, if we knew what the AI knew, thought that fast, and understood ourselves.
Really actually good. A metasyntactic variable to mean “favoring whatever the speaker wants ideally to accomplish”, although different speakers have different morals and metaethics.
William Frankena's list of terminal values
Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment, etc.; truth; knowledge and true opinions…
The opposite of beneficial.
Intuitively: Value as seen from a broad, embracing standpoint that is aware of how other entities may not always be like us or easily understandable to us, yet still worthwhile.
Linguistic conventions in value alignment
How and why to use precise language and words with special meaning when talking about value alignment.
What is “utility” in the context of Value Alignment Theory?
When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)
People might systematically overlook “make tiny molecular smileyfaces” as a way of “producing smiles”, because our brains automatically search for high-utility-to-us ways of “producing smiles”.
There’s no simple way to describe the goals we want Artificial Intelligences to want.
Underestimating complexity of value because goodness feels like a simple property
When you just want to yell at the AI, “Just do normal high-value X, dammit, not weird low-value X!” and that ‘high versus low value’ boundary is way more complicated than your brain wants to think.
Meta-rules for (narrow) value learning are still unsolved
We don’t currently know a simple meta-utility function that would take in observation of humans and spit out our true values, or even a good target for a Task AGI.
You want to build an advanced AI with the right values… but how?
We say that an advanced AI is “totally aligned” when it knows exactly which outcomes and plans are beneficial, with no further user input.
What’s the thing an agent uses to compare its preferences?
A meta-utility function in which the utility function as usually considered, takes on different values in different possible worlds, potentially distinguishable by evidence.
The ‘ideal target’ of a meta-utility function is the value the ground-level utility function would take on if the agent updated on all possible evidence; the ‘true’ utilities under moral uncertainty.
Preference frameworks built out of simple utility functions, but where, e.g., the ‘correct’ utility function for a possible world depends on whether a button is pressed.
The ‘attainable optimum’ of an agent’s preferences is the best that agent can actually do given its finite intelligence and resources (as opposed to the global maximum of those preferences).
Object-level vs. indirect goals
Difference between “give Alice the apple” and “give Alice what she wants”.
When you ask the AI to make people happy, and it tiles the universe with the smallest objects that can be happy.
Identifying causal goal concepts from sensory data
If the intended goal is “cure cancer” and you show the AI healthy patients, it sees, say, a pattern of pixels on a webcam. How do you get to a goal concept about the real patients?
Figuring out how to say “strawberry” to an AI that you want to bring you strawberries (and not fake plastic strawberries, either).
Ontology identification problem
How do we link an agent’s utility function to its model of the world, when we don’t know what that model will look like?
How would you build an agent that made as much diamond material as possible, given vast computing power but an otherwise rich and complicated environment?
Ontology identification problem: Technical tutorial
Technical tutorial for ontology identification problem.
The problem of having an AI want outcomes that are out in the world, not just want direct sense events.
Might a machine intelligence contain vast numbers of unhappy conscious subprocesses?
If we knew which computations were definitely not people, we could tell AIs which programs they were definitely allowed to compute.
A ‘principle’ of AI alignment is a very general design goal like ‘understand what the heck is going on inside the AI’ that has informed a wide set of specific design proposals.
At no point in constructing an Artificial General Intelligence should we construct a computation that tries to hurt us, and then try to stop it from hurting us.
Omnipotence test for AI safety
Would your AI produce disastrous outcomes if it suddenly gained omnipotence and omniscience? If so, why did you program something that wants to hurt you and is held back only by lacking the power?
Niceness is the first line of defense
The first line of defense in dealing with any partially superhuman AI system advanced enough to possibly be dangerous is that it does not want to hurt you or defeat your safety measures.
Directing, vs. limiting, vs. opposing
Getting the AI to compute the right action in a domain; versus getting the AI to not compute at all in an unsafe domain; versus trying to prevent the AI from acting successfully. (Prefer 1 & 2.)
The AI must tolerate your safety measures
A corollary of the nonadversarial principle is that “The AI must tolerate your safety measures.”
Generalized principle of cognitive alignment
When we’re asking how we want the AI to think about an alignment problem, one source of inspiration is trying to have the AI mirror our own thoughts about that problem.
The first AGI ever built should save the world in a way that requires the least amount of the least dangerous cognition.
The more you understand what the heck is going on inside your AI, the safer you are.
You are safer the more you understand the inner structure of how your AI thinks; the better you can describe the relation of smaller pieces of the AI’s thought process.
Separation from hyperexistential risk
The AI should be widely separated in the design space from any AI that would constitute a “hyperexistential risk” (anything worse than death).
One of the research subproblems of building powerful nice AIs, is the theory of (sufficiently advanced) minds in general.
Some strategies can help achieve most possible simple goals. E.g., acquiring more computing power or more material resources. By default, unless averted, we can expect advanced AIs to do that.
This agent will not stop until the entire universe is filled with paperclips.
A configuration of matter that we’d see as being worthless even from a very cosmopolitan perspective.
A ‘random’ utility function is one chosen at random according to some simple probability measure (e.g. weight by Kolmorogov complexity) on a logical space of formal utility functions.
What is “instrumental” in the context of Value Alignment Theory?
A consequentialist agent will want to bring about certain instrumental events that will help to fulfill its goals.
Convergent instrumental strategies
Paperclip maximizers can make more paperclips by improving their cognitive abilities or controlling more resources. What other strategies would almost-any AI try to use?
Convergent strategies of self-modification
The strategies we’d expect to be employed by an AI that understands the relevance of its code and hardware to achieving its goals, which therefore has subgoals about its code and hardware.
Consequentialist preferences are reflectively stable by default
Gandhi wouldn’t take a pill that made him want to kill people, because he knows in that case more people will be murdered. A paperclip maximizer doesn’t want to stop maximizing paperclips.
You can't get more paperclips that way
Most arguments that “A paperclip maximizer could get more paperclips by (doing nice things)” are flawed.
Will smart AIs automatically become benevolent, or automatically become hostile? Or do different AI designs imply different goals?
This agent will not stop until the entire universe is filled with paperclips.
A configuration of matter that we’d see as being worthless even from a very cosmopolitan perspective.
A ‘random’ utility function is one chosen at random according to some simple probability measure (e.g. weight by Kolmorogov complexity) on a logical space of formal utility functions.
Imagine all human beings as one tiny dot inside a much vaster sphere of possibilities for “The space of minds in general.” It is wiser to make claims about some minds than all minds.
Instrumental goals are almost-equally as tractable as terminal goals
Getting the milk from the refrigerator because you want to drink it, is not vastly harder than getting the milk from the refrigerator because you inherently desire it.
How smart does a machine intelligence need to be, for its niceness to become an issue? “Advanced” is a broad term to cover cognitive abilities such that we’d need to start considering AI alignment.
Big-picture strategic awareness
We start encountering new AI alignment issues at the point where a machine intelligence recognizes the existence of a real world, the existence of programmers, and how these relate to its goals.
A “superintelligence” is strongly superhuman (strictly higher-performing than any and all humans) on every cognitive problem.
What happens if a self-improving AI gets to the point where each amount x of self-improvement triggers >x further self-improvement, and it stays that way for a while.
Artificial General Intelligence
An AI which has the same kind of “significantly more general” intelligence that humans have compared to chimpanzees; it can learn new domains, like we can.
Hypothetically, cognitively powerful programs that don’t follow the loop of “observe, learn, model the consequences, act, observe results” that a standard “agent” would.
Epistemic and instrumental efficiency
An efficient agent never makes a mistake you can predict. You can never successfully predict a directional bias in its estimates.
Time-machine metaphor for efficient agents
Don’t imagine a paperclip maximizer as a mind. Imagine it as a time machine that always spits out the output leading to the greatest number of future paperclips.
What’s a Standard Agent, and what can it do?
An agent that operates in the real world, using realistic amounts of computing power, that is uncertain of its environment, etcetera.
Some AIs play chess, some AIs play Go, some AIs drive cars. These different ‘domains’ present different options. All of reality, in all its messy entanglement, is the ‘real-world domain’.
Sufficiently advanced Artificial Intelligence
‘Sufficiently advanced Artificial Intelligences’ are AIs with enough ‘advanced agent properties’ that we start needing to do ‘AI alignment’ to them.
Infrahuman, par-human, superhuman, efficient, optimal
A categorization of AI ability levels relative to human, with some gotchas in the ordering. E.g., in simple domains where humans can play optimally, optimal play is not superhuman.
Compared to chimpanzees, humans seem to be able to learn a much wider variety of domains. We have ‘significantly more generally applicable’ cognitive abilities, aka ‘more general intelligence’.
Corporations vs. superintelligences
Corporations have relatively few of the advanced-agent properties that would allow one mistake in aligning a corporation to immediately kill all humans and turn the future light cone into paperclips.
‘Cognitive uncontainability’ is when we can’t hold all of an agent’s possibilities inside our own minds.
Game’s mathematical structure at its purest form.
Almost all real-world domains are rich
Anything you’re trying to accomplish in the real world can potentially be accomplished in a lot of different ways.
You can’t predict the exact actions of an agent smarter than you—so is there anything you can say about them?
You can’t predict exactly what someone smarter than you would do, because if you could, you’d be that smart yourself.
The chess-playing program, built by IBM, that first won the world chess championship from Garry Kasparov in 1996.
The cognitive ability to foresee the consequences of actions, prefer some outcomes to others, and output actions leading to the preferred outcomes.
How hard is it exactly to point an Artificial General Intelligence in an intuitively okay direction?
Glossary (Value Alignment Theory)
Words that have a special meaning in the context of creating nice AIs.
Old terminology for an AI whose preferences have been successfully aligned with idealized human values.
An allegedly compact unit of knowledge, such that ideas inside the unit interact mainly with each other and less with ideas in other domains.
Distances between cognitive domains
Often in AI alignment we want to ask, “How close is ‘being able to do X’ to ‘being able to do Y’?”
In the context of Artificial Intelligence, a ‘concept’ is a category, something that identifies thingies as being inside or outside the concept.
Who is building these advanced agents?
The mathematical study of ideal decisionmaking AIs.
Expected utility is the central idea in the quantitative implementation of consequentialism
If you’re not some kind of expected utility agent, you’re going in circles.
Scoring actions based on the average score of their probable consequences.
The only coherent way of wanting things is to assign consistent relative scores to outcomes.
Coherent decisions imply consistent utilities
Why do we all use the ‘expected utility’ formalism? Because any behavior that can’t be viewed from that perspective, must be qualitatively self-defeating (in various mathy ways).
A ‘coherence theorem’ shows that something bad happens to an agent if its decisions can’t be viewed as ‘coherent’ in some sense. E.g., an inconsistent preference ordering leads to going in circles.
Root page for topics on logical decision theory, with multiple intros for different audiences.
Guide to Logical Decision Theory
The entry point for learning about logical decision theory.
Introduction to Logical Decision Theory for Economists
An introduction to ‘logical decision theory’ and its implications for the Ultimatum Game, voting in elections, bargaining problems, and more.
Omega (alien philosopher-troll)
The entity that sets up all those trolley problems. An alien philosopher/troll imbued with unlimited powers, excellent predictive ability, and very odd motives.
Introduction to Logical Decision Theory for Computer Scientists
‘Logical decision theory’ from a math/programming standpoint, including how two agents with mutual knowledge of each other’s code can cooperate on the Prisoner’s Dilemma.
Introduction to Logical Decision Theory for Analytic Philosophers
Why “choose as if controlling the logical output of your decision algorithm” is the most appealing candidate for the principle of rational choice.
An Introduction to Logical Decision Theory for Everyone Else
So like what the heck is ‘logical decision theory’ in terms a normal person can understand?
Decision problems in which your choice correlates with something other than its physical consequences (say, because somebody has predicted you very well) can do weird things to some decision theories.
99LDT x 1CDT oneshot PD tournament as arguable counterexample to LDT doing better than CDT
Arguendo, if 99 LDT agents and 1 CDT agent are facing off in a one-shot Prisoner’s Dilemma tournament, the CDT agent does better on a problem that CDT considers ‘fair’.
There are two boxes in front of you, Box A and Box B. You can take both boxes, or only Box B. Box A contains 1000. Box B contains 1,000,000 if and only if Omega predicted you’d take only Box B.
You and an accomplice have been arrested. Both of you must decide, in isolation, whether to testify against the other prisoner—which subtracts one year from your sentence, and adds two to theirs.
A scenario that would reproduce the ideal payoff matrix of the Prisoner’s Dilemma about human beings who care about their public reputation and each other.
A road contains two identical intersections. An absent-minded driver wants to turn right at the second intersection. “With what probability should the driver turn right?” argue decision theorists.
Death tells you that It is coming for you tomorrow. You can stay in Damascus or flee to Aleppo. Whichever decision you actually make is the wrong one. This gives some decision theories trouble.
'Rationality' of voting in elections
“A single vote is very unlikely to swing the election, so your vote is unlikely to have an effect” versus “Many people similar to you are making a similar decision about whether to vote.”
A parasitic infection, carried by cats, may make humans enjoy petting cats more. A kitten, now in front of you, isn’t infected. But if you want to pet it, you may already be infected. Do you?
Omega has left behind a transparent Box A containing 1000, and transparent Box B containing 1,000,000 or nothing. Box B is full iff Omega thinks you one-box on seeing a full Box B.
You are dying in the desert. A truck-driver who is very good at reading faces finds you, and offers to drive you into the city if you promise to pay $1,000 on arrival. You are a selfish rationalist.
A Proposer decides how to split $10 between themselves and the Responder. The Responder can take what is offered, or refuse, in which case both parties get nothing.
Decision theories that maximize their policies (mappings from sense inputs to actions), rather than using their sense inputs to update their beliefs and then selecting actions.
A problem is ‘fair’ (according to logical decision theory) when only the results matter and not how we get there.
On CDT, to choose rationally, you should imagine the world where your physical act changes, then imagine running that world forward in time. (Therefore, it’s irrational to vote in elections.)
Theories which hold that the principle of rational choice is “Choose the act that would be the best news, if somebody told you that you’d chosen that act.”
Modal combat