Suppose you wake up as a paperclip maximizer. Omega says "I calculated the millionth digit of pi, and it's odd. If it had been even, I would have made the universe capable of producing either 1020 paperclips or 1010 staples, and given control of it to a staples maximizer. But since it was odd, I made the universe capable of producing 1010 paperclips or 1020 staples, and gave you control." You double check Omega's pi computation and your internal calculator gives the same answer.
Then a staples maximizer comes to you and says, "You should give me control of the universe, because before you knew the millionth digit of pi, you would have wanted to pre-commit to a deal where each of us would give the other control of the universe, since that gives you 1/2 probability of 1020 paperclips instead of 1/2 probability of 1010 paperclips."
Is the staples maximizer right? If so, the general principle seems to be that we should act as if we had precommited to a deal we would have made in ignorance of logical facts we actually possess. But how far are we supposed to push this? What deal would you have made if you didn't know that the first digit of pi was odd, or if you didn't know that 1+1=2?
On the other hand, suppose the staples maximizer is wrong. Does that mean you also shouldn't agree to exchange control of the universe before you knew the millionth digit of pi?
To make this more relevant to real life, consider two humans negotiating over the goal system of an AI they're jointly building. They have a lot of ignorance about the relevant logical facts, like how smart/powerful the AI will turn out to be and how efficient it will be in implementing each of their goals. They could negotiate a solution now in the form of a weighted average of their utility functions, but the weights they choose now will likely turn out to be "wrong" in full view of the relevant logical facts (e.g., the actual shape of the utility-possibility frontier). Or they could program their utility functions into the AI separately, and let the AI determine the weights later using some formal bargaining solution when it has more knowledge about the relevant logical facts. Which is the right thing to do? Or should they follow the staples maximizer's reasoning and bargain under the pretense that they know even less than they actually do?
Other Related Posts: Counterfactual Mugging and Logical Uncertainty, If you don't know the name of the game, just tell me what I mean to you
"To give a practical down to earth example, ..."
Perhaps a more down to earth example would be value conflict within an individual. Without this problem with logical uncertainty, your conflicting selves should just merge into one agent with a weighted average of their utility functions. This problem suggests that maybe you should keep those conflicting selves around until you know more logical facts.
Right. But this is also the default safety option, you don't throw away information if you don't have a precise understanding of its irrelevance (given that it's not that costly to keep), and we didn't have such understanding.
Is Omega even necessary to this problem?
I would consider transferring control to staply if and only if I were sure that staply would make the same decision were our positions reversed (in this way it's reminiscent of the prisoner's dilemma). If I were so convinced, then shouldn't I consider staply's argument even in a situation without Omega?
If staply is in fact using the same decision algorithms I am, then he shouldn't even have to voice the offer. I should arrive at the conclusion that he should control the universe as soon as I find out that it can produce more staples than paperclips, whether it's a revelation from Omega or the result of cosmological research.
My intuition rebels at this conclusion, but I think it's being misled by heuristics. A human could not convince me of this proposal, but that's because I can't know we share decision algorithms (i.e. that s/he would definitely do the same in my place).
This looks to me like a prisoner's dilemma problem where expected utility depends on a logical uncertainty. I think I would cooperate with prisoners who have different utility functions as long as they share my decision theory.
(Disclaimers: I have read most of the relevant LW posts on these topics, but have never jumped into discussion on them and claim no expertise. I would appreciate corrections if I misunderstand anything.)
Perhaps I am missing something, but if my utility function is based on paper clips, how do I ever arrive at the conclusion that Staply should be in charge? I get no utility from it, unless my utility function has an even higher value on allowing entities with utility functions that create a larger output than mine take precedence over my own utility on paper clips.
(I'll review some motivations for decision theories in the context of Counterfactual Mugging, leading to the answer.)
Precommitment in the past, where it's allowed, was a CDT-style solution to problems like this. You'd try making the most general possible precommitment as far in the past as possible that would respond to any possible future observations. This had two severe problems: it's not always possible to be far enough in the past to make precommitments that would coordinate all relevant future events, and you have to plan every possible detail of future events in advance.
TDT partially resolves such problems by implementing coordinated decisions among the instances of the agent within agent's current worlds (permitted by observations so far) that share the same epistemic state (or its aspects relevant to the decision) and decide for all of themselves together, so arrive at the same decision. (It makes sense for the decision to be a strategy that then can take into account additional information differentiating the instances of the agent.) This is enough for Newcomb's problem and (some versions of) Prisoner's Dilemma, but where coordination of agents in mutually exclusive counterfactuals are concerned, some of the tools break down.
Counterfactual Mugging both concerns agents located in mutually exclusive counterfactuals, and explicitly forbids the agent to be present in the past to make a precommitment, so TDT fails to apply. In this case, UDT (not relying on causal graphs) can define a common decision problem shared by the agents from different counterfactuals, if these agents can be first reduced to a shared epistemic state, so that all of them would arrive at the same decision (which takes the form of a strategy), which is then given each agent's particular additional knowledge that differentiates it from the other agents within the group that makes the coordinated decision.
In the most general case, where we attempt to coordinate among all UDT agents, these agents arrive, without using any knowledge other than what can be generated by pure inference (assumed common among these agents), at a single global strategy that specifies the moves of all agents (depending on each agent's particular knowledge and observations). However, when applied to a simple situation like Counterfactual Mugging, an agent only needs to purge itself of one bit of knowledge (identifying an agent) and select a simple coordinated strategy (for both agents) that takes that bit back as input to produce a concrete action.
So this takes us the whole circle, from deciding in a moment, to deciding (on a precommitment) in advance, and to deciding (on a coordinated strategy) in the present (of each instance). However, the condition for producing a coordinated strategy in the present is different from that for producing a precommitment in the past: all we need is shared state of knowledge among the to-be-coordinated agents, and not the state of knowledge they could've shared in the past, if they were to attempt a precommitment.
So for this problem, in coordinating with the other player (which let's assume abstractly exists, even if with measure 0), you can use your knowledge of the millionth digit of pi, since both players share it. And using this shared knowledge, the strategy you both arrive at would favor the world that's permitted by that value, in this case the paperclip world, the other world doesn't matter, contrary to what would be the case with a coin toss instead of the accessible abstract fact. And since the other player has nothing of value to offer, you take the whole pie.
Suppose you're currently running a decision theory that would "take the whole pie" in this situation. Now what if Omega first informed you of the setup without telling you what the millionth digit of pi is, and gave you a chance to self-modify? And suppose you don't have enough computing power to compute the digit yourself at this point. Doesn't it seems right to self-modify into someone who would give control of the universe to the staples maximizer, since that gives you 1/2 "logical" probability of 10^20 paperclips instead of 1/2 "logical" probability of 10^10 paperclips? What is wrong with this reasoning? And if it is wrong, both UDT1 and UDT2 are wrong since UDT1 would self-modify and UDT2 would give control to the staples maximizer without having to self-modify, so what's the right decision theory?
Do you mean that I won't have enough computing power also later, after the staple maximizer's proposal is stated, or that there isn't enough computing power just during the thought experiment? (In the latter case, I make the decision to think long enough to compute the digit of pi before making a decision.)
What does it mean to self-modify if no action is being performed, that is any decision regarding that action could be computed later without any preceding precommitments?
(One way in which a "self-modification" might be useful is when you won't have enough computational power in the future to waste what computational power you have currently, and so you must make decisions continuously that take away some options from the future (perhaps by changing instrumental priority rather than permanently arresting opportunity to reconsider) and thereby simplify the future decision-making at the cost of making it less optimal. Another is where you have to signal precommitment to other players that wouldn't be able to follow your more complicated future reasoning.)
You will have enough computing power later.
I mean suppose Omega gives you the option (now, when you don't have enough computing power to compute the millionth digit of pi) of replacing yourself with another AI that has a different decision theory, one that would later give control of the universe to the staples maximizer. Should you take this option? If not, what decision theory would refuse it? (Again, from your current perspective, taking the option gives you 1/2 "logical" probability of 10^20 paperclips instead of 1/2 "logical" probability of 10^10 paperclips. How do you justify refusing this?)
(continuing from here)
I've changed my mind back. The 10^20 are only on the table for the loser, and can be given by the winner. When the winner/loser status is unknown, a winner might cooperate, since it allows the possibility of being a loser and receiving the prize. But if the winner knows own status, it can't receive that prize, and the loser has no leverage. So there is nothing problematic about 10^20 becoming inaccessible: it is only potentially accessible to the loser, when the winner is weak (doesn't know own status), while an informed winner won't give it away, so that doesn't happen. Resolving logical uncertainty makes the winner stronger, makes the loser weaker, and so the prize for the loser becomes smaller.
Edit: Nope, I changed my mind back.
You've succeeded in convincing me that I'm confused about this problem, and don't know how to make decisions in problems like this.
There're two types of players in this game: those that win the logical lottery and those that lose (here, paperclip maximizer is a winner, and staple maximizer is a loser). A winner can either cooperate or defect against its loser opponent, with cooperation giving the winner 0 and loser 10^20, and defection giving the winner 10^10 and loser 0.
If a player doesn't know whether it's a loser or a winner, coordinating cooperation with its opponent has higher expected utility than coordinating defection, with mixed strategies presenting options for bargaining (the best coordinated strategy for a given player is to defect, with opponent cooperating). Thus, we have a full-fledged Prisoner's Dilemma.
On the other hand, obtaining information about your identity (loser or winner) transforms the problem into one where you seemingly have only the choice between 0 and 10^10 (if you're a winner), or always 0 with no ability to bargain for more (if you're a loser). Thus, it looks like knowledge of a fact turns a problem into one of lower expected utility, irrespective of what the fact turns out to be, and takes away the incentives that would've made a higher win (10^20) possible. This doesn't sound right, there should be a way of making the 10^20 accessible.
It's like an instance of the problem involves not two, but four agents that should coordinate: a possible winner/loser pair, and a corresponding impossible pair. The impossible pair has a bizarre property that they know themselves to be impossible, like self-defeating theories PA+NOT(Con(PA)) (except that we're talking about agent-provability and not provability), which doesn't make them unable to reason. These four agents could form a coordinated decision, where the coordinated decision problem is obtained by throwing away the knowledge that's not common between these four agents, in particular the digit of pi and winner/loser identity. After the decision is made, they plug back their particular information.
You've convinced me that I'm confused. I don't know what is the correct decision in this situation anymore, or how to think about such decisions.
If you cooperate in such situations, this makes the value of the outcome of such thought experiments higher, and that applies for all individual instances of the thought experiments as well. The problem has ASP-ish feel to it, you're punished for taking too much information into account, even though from the point of view of having taken that information into account, your resulting decision seems correct.
Good, I'm in a similar state. :)
Yes, I noticed the similarity as well, except in the ASP case it seems clearer what the right thing to do is.
(Grandparent was my comment, deleted while I was trying to come up with a clearer statement of my confusion, before I saw the reply. The new version is here.)
So you would also keep the money in Counterfactual Mugging with a logical coin? I don't see how that can be right. About half of logical coins fall heads, so given a reasonable prior over Omegas, it makes more sense for the agent to always pay up, both in Counterfactual Mugging and in Wei's problem. But of course using a prior over Omegas is cheating...
Then you'd be coordinating with players of other CM setups, not just with your own counterfactual opponent, you'd be breaking out of your thought experiment, and that's against the rules! (Whatever "logical coin" is, the primary condition is for it to be shared among and accessible to all coordinating agents. If that's so, like here, then I keep the money, assuming the thought experiment doesn't leak control.)
:/ The whole point of thought experiments is that they leak control. ;P
"I seem to have found myself in a trolley problem! This is fantastically unlikely. I'm probably in some weird moral philosophy thought experiment and my actions are likely mostly going to be used as propaganda supporting the 'obvious' conclusions of one side or the other... oh and if I try to find a clever third option I'll probably make myself counterfactual in most contexts. Does the fact that I'm thinking these thoughts affect what contexts I'm in? /brainasplodes"
This is exactly what my downscale copy thinks the first 3-5 times I try to run any though experiment. Often it's followed by "**, I'm going to die!"
I don't run though experiments containing myself at any level of detail if I can avoid it any more.
I'm still not sure. You can look at it as cooperating with players of other CM setups, or as trying to solve the meta-question "what decision theory would be good at solving problems like this one?" Saying "50% of logical coins fall heads" seems to capture the intent of the problem class quite well, no?
The decision algorithm that takes the whole pie is good at solving problems like this one: for each specific pie it gets it whole. Making the same action is not good for solving the different problem of dividing all possible pies simultaneously, but then the difference is reflected in the problem statement, and so the reasons that make it decide correctly for individual problems won't make it decide incorrectly for the joint problem.
I think it's right to cooperate in this thought experiment only to the extent that we accept the impossibility of isolating this thought experiment from its other possible instances, but then it should just motivate restating the thought experiment so as to make its expected actual scope explicit.
Here's an argument I made in a chat with Wei. (The problem is equivalent to Counterfactual Mugging with a logical coin, so I talk about that instead.)
1) A good decision theory should always do what it would have precommitted to doing.
2) Precommitment can be modeled as a decision problem where an AI is asked to write a successor AI.
3) Imagine the AI is asked to write a program P that will be faced with Counterfactual Mugging with a logical coin (e.g. parity of the millionth digit of pi). The resulting utility goes to the AI. The AI writing P doesn't have enough resources to compute the coin's outcome, but P is allowed to use as much resources as needed.
4) Writing P is equivalent to supplying only one bit: should P pay up if asked?
5) Supplying that bit is equivalent to accepting or declining the bet "win $10000 if the millionth digit of pi is even, lose $100 if it's odd".
6) So if your AI can make bets about the digits of pi (which means it represents logical uncertainty as probabilities), it should also pay up in Counterfactual Mugging with a logical coin, even if it already has enough resources to compute the coin's outcome. The AI's initlal state of logical uncertainty should be "frozen" into its utility function, just like all other kinds of uncertainty (the U in UDT means "updateless").
Maybe this argument only shows that representing logical uncertainty as probabilities is weird. Everyone is welcome to try and figure out a better way :-)
It's dangerous to phrase it this way, since coordination (which is what really happens) allows using more knowledge than was available at the time of a possible precommitment, as I described here.
Not if the correct decision depends on an abstract fact that you can't access, but can reference. In that case, P should implement a strategy of acting depending on the value of that fact (computing and observing that value to feed to the strategy). That is, abstract facts that will only be accessible in the future play the same role as observations that will only be accessible in the future, and a strategy can be written conditionally on either.
The difference between abstract facts and observations however is that observations may tell you where you are, without telling you what exists and what doesn't (both counterfactuals exist and have equal value, you're in one of them), while abstract facts can tell you what exists and what doesn't (the other logical counterfactual doesn't exist and has zero value).
In general, the distinction is important. But, for this puzzle, the proposition "asked" is equivalent to the relevant "abstract fact". The agent is asked iff the millionth digit of pi is odd. So point (4) already provides as much of a conditional strategy as is possible.
It's assumed that the agent doesn't know if the digit is odd (and whether it'll be in the situation described in the post) at this point. The proposal to self-modify is a separate event that precedes the thought experiment.
Yes. Similarly, it doesn't know whether it will be asked (rather than do the asking) at this point.
I see, so there's indeed just one bit, and it should be "don't cooperate".
This is interesting in that UDT likes to ignore epistemic significance of observations, but here we have an observation that implies something about the world, and not just tells where the agent is. How does one reason about strategies if different branches of those strategies tell something about the value of the other branches?..
Good point, thanks. I think it kills my argument.
ETA: no, it doesn't.
As Tyrrell points out, it's not as simple. When you're considering the strategy of what to do if you're on the giving side of the counterfactual ("Should P pay up if asked?"), the fact that you're in that situation already implies all you wanted to know about the digit of pi, so the strategy is not to play conditionally on the digit of pi, but just to either pay up or not, one bit as you said. But the value of the decision on that branch of the strategy follows from the logical implications of being on that branch, which is something new for UDT!
In the old counterfactual mugging problem, agents who precommit are trading utilities across possible worlds, each world having a utility-gain called a prior that expresses how much the agent wants its utilities to lie in those worlds instead of silly ones. From that perspective, it's true that nothing in reality will be different as a result of the agent's decision, just because of determinism, but the agent is still deciding what reality (across all possible worlds) will look like, just like in Newcomb's problem.
So when I read in Nesov's post that "Direct prediction of your actions can't include the part where you observe that the digit is even, because the digit is odd", what I'm really seeing is someone saying, "I give zero weight to possible worlds in which math doesn't work sensibly, and tiny weights to worlds in which math does work, but my confusion or the conspiring of a malicious / improbable / senseless / invaluable universe cause me to think it does not."
One of the reasons why I think possible worlds of the first kind (different causal / programmatic histories but the same underlying ontology-stuff) are valuable / real, is that we sort of know how to calculate their properties using causal networks or timeless networks or whatever kind of networks you get when you combine the not-quite specified mathematical machinery in TDT with UDT. Our ability to calculate their properties reifies them, opens them up to interacting with this world even more directly via simulation.
The next step seems to be to ask, "for agents that do care about those impossible possible worlds, how would they act?" If omega is choosing in a way that can be computed in our world, using our math (and somehow that other universe and our calculations don't explode when it gets to the contradiction (or it does! I suppose you can care about worlds where math explodes, even if I can't visualize them)), then we can simulate his reasoning in all respects save the identify of the logical fact in question, and use that to calculate which behaviour maximizes the utility across possible worlds via their dependence on our decision.
So in the example problem, if a valuer of contradictory worlds has roughly equal priors for both the world we're examining and the other world in which she find herself where the digit was even (the impossible one, which isn't impossible for her, because it wasn't assigned zero prior weight), then sure, she can go ahead and give up control. That's of course assuming that she has an expectation the staple maximizer will reciprocate in the impossible world, which you didn't spell out in your post, but that dependence on decisions is standard for counterfactual mugging problems. Please correct me if that's not the intended setup.
As as aside, this comment feels silly and wrong; an example of diseased thoughts unconnected with reality. It reminds me a bit of Greg Egan's short story Dark Integers. I would really love to see a more sensible interpretation that this.
While I haven't given it much though outside the context of fiction, one could adopt the point of view/vocabulary of this being "the level 5 tegmark mutiverse".
Now, if that is true in any sense, it's probably a much less literal one, and not based on the same reasoning as the other four, but it might still be an useful heuristic for humans.
Another interesting note: By default my brain seems to assume utility is linear with paperclips when considering say different Everett branches, but the logarithm of it when considering logical uncertainty. That's kinda odd and unjustified, but the intuition might have some point about humans utility function.
Sorry, I couldn't follow.
That's okay, there's no formalized theory behind it. But for the sake of conversation:
It seems you once agreed that multiple agents in the same epistemic state in different possible worlds can define strategies over their future observations in a way that looks like trading utilities: http://lesswrong.com/lw/102/indexical_uncertainty_and_the_axiom_of/sht
When I treat priors as a kind of utility, that's interpretation #4 from this Wei Dai post: http://lesswrong.com/lw/1iy/what_are_probabilities_anyway/
Really the only things that seems in any way novel here are the idea that the space of possible worlds might include worlds that work by different mathematical rules and that possibility is contingent on the agent's priors. I don't know how to characterize how math works in a different world, other than by saying explicitly what the outcome of a given computation will be. You can think of that as forcing the structural equation that would normally compute "1+1" to output "5", where the graph setup would somehow keep that logical fact from colliding with proofs that "3-1=2" (for worlds that don't explode) (which is what I thought Eliezer meant by creating a factored DAG of mathematics here). That's for a very limited case of illogical-calculation where our reasoning process produced results close enough to their analogues in the target world that we're even able to make some valid deductions. Maybe other worlds don't have a big book of platonic truths (ambiguity or instability) and cross-world utility calculations just don't work. In that case, I can't think of any sensible course of action.
I don't think this is totally worthless speculation, even if you don't agree that "a world with different math" makes sense, because an AI with faulty hardware / reasoning will still need to reason about mathematics that work differently from its mistaken inferences, and that probably requires a partial correspondence between how the agent reasons and how the world works, just like how the partial correspondence between worlds with different mathematical rules allows some limited deductions with cross-world or other-world validity.
That is a really, really weird dilemma to be in.
By the way, you can abbreviate paperclip/staple maximizer as clippy/staply (uncapitalized).
That seems to be a violation of standard English conventions. If I see people use 'clippy' or 'staply' uncapitalized I treat it the same as any other error in capitalization.
Do you capitalize 'human'?
No. I do capitalize names. 'Clippy' and 'staply' would both be unnatural terms for a species, were the two to be given slang species names.
If people use 'Clippy' or 'Staply' they are making a reference to a personified instance of the respective classes of maximiser agent.
I took the uncapitalized "staply" to be the name of a class, one individual in which might be named "Staply".
Exactly, good inference. You're a good human.
Kill Staply though.
I never suspected that you were Wei Dai, but five minutes is an awfully fast response time!
I'm not User:Wei_Dai. Although if I were, you would probably expect that I would say that.
It seems, the important factor is how does Omega makes its choice of which digit of pi (or other logical fact) to check. If Omega uses a quantum coin to select a number N between 1 and 84, and then checks the N-th digit of pi, then you should cooperate. If Omega searches through a roundish number N such that N-th digit of pi is odd, then the answer appears to further depend on Omega's motivation. If Omega made the choice of "seeking for 'odd' roundish numbers" vs. "'even' roundish numbers" by tossing a quantum coin, then you should cooperate. Otherwise... etc.
By the way, I notice that the thought experiment, as phrased, doesn't quite require the knowledge of the digit of pi, if Omega indeed states that the situation is predicated on its oddness, it even states that it's odd, and there is no counterfactual version of yourself, so computation of the digit becomes a ritual of cognition without purpose. There is likely a restatement that retains everyone's interpretation, but in the current phrasing the role of the logical uncertainty in it seems to be lost.
Do I know that Staply would have decided as I decide, had he been given control of the universe and been told by Omaga and his calculator that the millionth digit of pi is even?
I took it as being implied, yes. If Staply is an unknown algorithm, there's no point in trading.
Then it does seem that if Clippy had in fact been built to maximize paperclips by an agent with the right computational limitations, then Clippy would have been built to take the deal.
I'd say essentially the same thing about this problem as I said about the Counterfactual Mugging:
Just thinking out loud, no complete solution yet.
1) Maybe different approximations to logical omniscience work well on different problems. It seems natural to guess that approximations work better overall as they get closer to logical omniscience. We already have a tentative decision theory that works assuming logical omniscience. Can we invent a decision theory that works assuming complete logical ignorance in some sense?
2) We could look at the moment when you decide to make the most general precommitment or self-modification for the future, while still having some logical uncertainty. If you know the first digit of pi but not the millionth, you should precommit accordingly. The first such moment seems to correspond to the AI programmer's state of logical uncertainty, so if you're unsure about the millionth digit of pi, then it's okay for your AI to be unsure.
Difficulty level 2: replace "millionth digit of pi" with "0.5*10^20".