I think this is the right way to research decision theory!
This is basically a rehash of my comment in your previous post, but I think you are confused in a very particular way which I am not. You are confusing "optimizing with the assumption Agent=X" with "optimizing without that assumption". In other words, optimizing for a decision problem were Omega always samples Agent=X, versus optimizing for your actual described decision problem were Omega samples X randomly.
For example, you describe the first case as one "where no one tries to predict this agent in particular". But actually, if we assume Agent=X, this is by definition "Omega predicting Agent perfectly" (even if, from the third-person perspective of the Programmer, this happened randomly, that is, it seemed unlikely a priori, and was very surprising). You also describe the second case as "direct logical entanglement of agent's behaviour with something that influences agent's utility". But actually, if you are assuming that Omega samples X randomly, then X isn't entangled (correlated) with Agent in any way, by definition.
Here's another way to highlight your confusion. Say Programmer is thinking about what Agent to implement. The following are two different (even if similar-sounding) statements.
The former is true, while the latter is false.
So the only question is: Does Agent, the general reasoner created by Programmer, want to maximize its utility according to the former (updated) or the latter (non-updated) probability distribution?
Or in other words: Does Agent, even upon updating on its observation, care about what happens in other counterfactual branches (different from the one that turned out to be the case)?
It can seem jarring at first that there is not a single clean solution that allows all versions of a reasoner (updated and non-updated) to optimize together and get absolutely everything they want. But it's really not surprising when you look at it: optimizing across many branches at once (because you still don't know which one will materialize) will lead to different behavior than optimizing for one branch alone.
So then, what should a general-reasoning Agent actually do in this situation? What would I do if I suddenly learned I have been living this situation. What probability distribution should we optimize?
This choice is just totally up for grabs! We'll have to query our normative intuitions for which one seems to better fit the spirit of what we want to call "maximization". Again, there is no deep secret to be uncovered here: maximizing from a random distribution, will recommend different actions than maximizing from a conditioned distribution, will recommend different actions than maximizing from a crazy distribution where unicorns exist.
I myself find it intuitive to be updateful here. That just looks more like maximization from my current epistemic state, and heck, it was the Programmer, from their updateless perspective, who could have just made me into a dumb Defect-rock, and it would have worked better for themselves! But I think this is just a subjective take based on vibes, and other decision theorists might have the opposite take, and it might be irreconcilable.
Depends on the complexity of the logical coin. Certainly not for 1+1=2. But probably yes for appropriately complex statements. This is due to strong immediate identification with "my immediately past self who didn't yet know the truth value", and an understanding that "he (my past self) cannot literally rewrite my brain at will to ensure this behavior holds, but it's understood that I will play along to some extent to satisfy his vision (otherwise he would have to invest more in binding my behavior, which sounds like a waste)".
(Of course, I need some kind of proof that the statement has been chosen non-adversarially, and I'm not yet sure that is possible)
I think the universal precommitment / UDT solution is right, and don't quite understand what's weird about it.
This is a later better version of the problem in this post. This problem emerged from my work in the "Deconfusing Commitment Races" project under the Supervised Program for Alignment Research (SPAR), led by James Faville. I'm grateful to SPAR for providing the intellectual environment and to James Faville personally for intellectual discussions and help with the draft of this post. Any mistakes are my own.
I used Claude and Gemini to help me with phrasing and grammar in some parts of this post.
There once lived an alien named Omega who enjoyed giving the Programmer decision theory problems. The answer to each one had to be a program-player that would play the game presented in the problem. Based on the results of the game, the Programmer would receive some amount of utility.
Omega had incredibly large, but still not infinite computational power, so he only accepted programs from set - the set of all programs written in a certain fixed programming language that contain no more than commands. If the program doesn’t halt after steps, Omega stops it and uses empty output.
After approximately three million problems Programmer got tired and wrote a code for a universal consequentialist Agent that optimizes Programmer's utility. Now when Omega gives Programmer a problem, Programmer just inserts a statement of the problem into a string constant in this Agent and sends it.
This is the newest Omega's problem:
Omega randomly selects a program from .
The program-player receives as input "YES" or "NO" - an honest answer to the question "do the source code of equals the source code of the program-player?"
independently receives "YES" as input regardless of anything.Then the program-player and X play a version of the Prisoner's Dilemma:
- If the program-player outputs "COOPERATE" and does not, the Programmer receives nothing.
- If neither the program-player nor outputs "COOPERATE", the Programmer receives unit of utility.
- If both the program-player and output "COOPERATE", the Programmer receives units of utility.
- Finally, if the program-player does not output "COOPERATE", but does, the Programmer receives units of utility.
The Programmer, as usual, inserted the problem statement into the Agent and sent it as the program-player.
The Agent received the input "Yes". It seems Omega was incredibly lucky! What should it do, output "COOPERATE" or not?
You can consider this section as one big footnote. It fixes some exploitable issues in the problem, but doesn't change its main idea.
We can worry that the Agent would have extreme anthropic uncertainty: not only “I can be ” (that was intended), but also “I can be the part of who will choose the opposite action” and “If I’m I don’t actually know what is the problem statement, because my knowledge of it comes from my own source code and ’s source code is random so now I has to check my prior over possible problem statements”. These issues were not intended, so let's make the following corrections:
I discovered that different decision procedures handle the described situation differently.
Input "YES" is very strong evidence that Agent is actually and not program-player (because always receives input "YES" and the program-player receives input "YES" with very small probability). Output "COOPERATE" gives utility if Agent is and utility if Agent is program-player, and Agent should think it's probably . So Agent should output "COOPERATE".
Agent should consider itself a function that definitely determines the behaviour of and may be also the behaviour of program-player (by the same logic as in anthropic approach). Agent wants to output "COOPERATE". It also wants program-player to output something else, but it's less important ( utility vs. ) and also it probably doesn't control program-player's behaviour. So Agent should output "COOPERATE".
Agent should behave as the best program in its place would. Programmer would want to send a program that never outputs "COOPERATE". (Let's call this program . It receives maximum possible utility against any possible including . Other program can receive more utility against its own copy then against , but it doesn't help it against . Also, receives strictly more utility in the case when equals this other program.) So Agent should not output "COOPERATE".
Agent is logically entangled with all programs which implement the same decision procedure. Many such programs are in . Programmer's ex-ante expected utility is higher if Agent's decision procedure makes so all these programs output "COOPERATE". So Agent should output "COOPERATE".
Notice that all three approaches that advice to output "COOPERATE" would nonetheless agree that it would be better for the Programmer to send always-defecting program instead of Agent.
So we have two options, and both are in some sense weird:
Also, the disagreement between UDT and CDT with anthropic uncertainty seemingly contradicts this post. Probably there is no real contradiction and this situation just doesn't satisfy some required assumptions, but maybe it deserves future research.