Wiki Contributions


Discussion with Eliezer Yudkowsky on AGI interventions

Steve Omohundro says:

"1) Nobody powerful wants to create unsafe AI but they do want to take advantage of AI capabilities.
2) None of the concrete well-specified valuable AI capabilities require unsafe behavior"

I think a lot of powerful people / organizations do want take advantage of possibly unsafe AI capabilities, such as ones that would allow them to be the emperors of the universe for all time. Especially if not doing so means that their rivals have a higher chance of becoming the emperors of the universe.

"Should Blackmail Be Legal" Hanson/Zvi Debate (Sun July 26th, 3pm PDT)

Hi Ben. Are you still planning to post a transcript sometime?

Why isn't the following decision theory optimal?

Thanks. I had one question about your Toward Idealized Decision Theory paper.

I can't say I fully understand UDT, but the 'updateless' part does seem very similar to the "act as if you had precommitted to any action that you'd have wanted to precommit to" core idea of NDT. It's not clear to me that the super powerful UDT would make the wrong decision in the game where two players pick numbers between 0-10 and get payouts based on their pick and the total sum.

Wouldn't the UDT reason as follows? "If my algorithm were such that I wouldn't just pick 1 when the human player forced me into it by picking 9 (for instance maybe I always pick 5 in this game), then I may still have a reputation as a powerful predictor but it's much more likely that I'd also have a reputation as an entity that can't be bullied like this, so the human would be less likely to pick 9. That state of the world is better for me, so I shouldn't be the type of agent that makes the greedy choice to pick 1 when I predict the human will pick 9."

The argument in your paper seems to rely on the human assuming the UDT will reason like a CDT once it knows the human will pick 9.

Why isn't the following decision theory optimal?

An AI should certainly cooperate if it discovered that by chance its opposing AI had identical source code.

I read your paper and the two posts in your short sequence. Thanks for the links. I still think it's very unlikely that one of the AIs in your original hypothetical (when they don't examine each other's source code) would do better by defecting.

I accept that if an opposing AI had a model of you that was just decent but not great, then there is some amount of logical connection there. What I haven't seen is any argument about the shape of the graph of logical connection strength vs similarity of entities. I hypothesize that for any two humans who exist today, if you put them in a one shot PD, the logical connection is negligible.

Has anyone written specifically on how exactly to give weights to logical connections between similar but non-identical entities?

Why isn't the following decision theory optimal?

I think defect is the right answer in your AI problem and therefore that NDT gets it right, but I'm aware lots of LWers think otherwise. I haven't researched this enough to want to argue it, but is there a discussion you'd recommend I read that spells out the reasoning? Otherwise I'll just look through LW posts on prisoner's dilemmas.

Secondly, I'd like to try to somehow incorporate logical effects into NDT. I agree they're important. Any suggestions for where I could find lots of examples of decision problems where logical effects matter, to help me think about the general case?

Why isn't the following decision theory optimal?

In the retro blackmail, CDT does not precommit to refusing even if it's given the opportunity to do so before the researcher gets its source code.

To clarify: you mean that CDT doesn't precommit at time t=1 even if the researcher hasn't gotten the code representing CDT's state at time t=0 yet. The CDT doesn't think precommitting will help because it knows the code the researcher will get will be from before its precommitment. I agree that this is true, and a CDT won't want to precommit.

I guess my definition even after my clarification is ambiguous, as it's not clear that what a CDT wishes it could have precomitted to at an earlier time should take precedence over what it would wish to precommit to at a later time. NDT seems to be best when you always prefer the earliest precommitment. The intuition is something like:

You should always make the decision that a CDT-agent would have wished he had precommitted to, if he had magically had the opportunity to costlessly precommit to to a decision at a time before the beginning of the universe.

This would allow you to act is if you had precommitted to things before you existed.

But two agents written in spacelike separation from each other might have decision algorithms that are logically correlated, despite there being no causal connection no matter how far back you go.

Can you give an example of this? Similar to the calculator example in the TDT paper, I'm imagining some scenario where one AI takes instructions for creating you to another galaxy, and another AI keeps a copy of the instructions for creating you on Earth. At some point, both AIs read the instructions and create identical beings, one of which is you. The AI that created you says that you'll be playing a prisoner's dilemma game with the other entity created in the same way, and asks for your decision.

In some sense, there is only a logical connection between these two entities because they've only existed for a short time and are too far away to have a causal effect on each other. However they are very causally related, and I could probably make an argument that they are replicas of the same person.

Do you have an example of a logical connection that has no causal connection at all (or as minimal a causal connection as possible)?

Why isn't the following decision theory optimal?

For example, a decision algorithm based on precommitment is unable to hold selfish preferences (valuing a cookie for me more than a cookie for a copy of me) in anthropic situations

I disagree that it makes sense to talk about one of the future copies of you being "you" whereas the other isn't. They're both you to the same degree (if they're exact copies).

Why isn't the following decision theory optimal?

Eliezer talked about this in his TDT paper. It is possible to hypothesize scenarios where agents get punished or rewarded for arbitrary reasons. For instance an AI could punish agents who made decisions based on the idea of their choices determining the results of abstract computations (as in TDT). This wouldn't show that TDT is a bad decision theory or even that it's no better than any other theory.

If we restrict ourselves to action-determined and decision-determined problems (see Eliezer's TDT paper) we can say that TDT is better than CDT, because it gets everything right that CDT gets right, plus it gets right some things that CDT gets wrong.

Can you think of any way that a situation could be set up that punishes an NDT agent, that doesn't reduce to an AI just not liking NDT agents and arbitrarily trying to hurt them?

Why isn't the following decision theory optimal?

I think my definition of NDT above was worded badly. The problematic part is "if he had previously known he'd be in his currently situation." Consider this definition:

You should always make the decision that a CDT-agent would have wished he had precommitted to, if he previously considered the possibility of his current situation and had the opportunity to costlessly precommit to a decision.

The key is that the NDT agent isn't behaving as if he knew for sure that he'd end up blackmailed when he made his precommitment (since his precommitment affects the probability of his being blackmailed), but rather he's acting "as if" he precommitted to some behavior based on reasonable estimates of the likelihood of his being kidnapped in various cases.

Why isn't the following decision theory optimal?

I believe that NDT gets this problem right.

The paper you link to shows that a pure CDT agent would not self modify into an NDT agent, because a CDT agent wouldn't really have the concept of "logical" connections between agents. The understanding that both logical and causal connections are real things is what would compel an agent to self-modify to NDT.

However, if there was some path by which an agent started out as pure CDT and then became NDT, the NDT agent would still choose correctly on Retro Blackmail even if the researcher had its original CDT source code. The NDT agent's decision procedure explicitly tells it to behave as if it had precommitted before the researcher got its source code.

So even if the CDT --> NDT transition is impossible, since I don't think any of us here are pure CDT agents, we can still adopt NDT and profit.

Load More