Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I wanted to clarify some comments made in this post, which proposed a way to resolve the issues brought up in the Action Counterfactuals subsection of the Embedded Agency paper. I noticed that my ideal edits would end up rewriting the post almost completely (it wasn't particularly clear), and I wanted to more coherently lay out the thinking.

The Problem

The problem is that we'd like agents to consider action counterfactuals, and, perhaps, find a proof that an action leads to the highest reward before taking it. The formulation in the paper leads to some Lobian issues, where agents choose actions by searching over proofs for statements similar to:

argmax(X) This agent chooses X => Gets Reward W

[The paper actually has a much cleaner phrasing of the statements it searches for proofs for, to make the proof issue much clearer.]

This is a rather substantial bummer, because something like that would be great to rely on. It'll find the proofs for what you'd like, but it'll also find proofs for nonsense. The Embedded Agency paper is a great resource for this, but you should also work out why this might happen in your head.

The Confusion

One potential intuition for why this problem exists is the following:

The agent who is taking this argmax always chooses the highest reward thing. If I know the rewards for all actions but the last X, well, choosing X means it was the result of the argmax. It's right there in the source code / function definition. So W must have been the biggest! Therefore, X is the result of the argmax being computed, because even if we don't know W, I know that the agent choosing X means W is bigger than all the other W' values. It must be -- the choice was coming from an agent doing an argmax!

This might be a confusing phrasing. The ideal way to explain this might be setting up the 5-10 problem in a theorem prover and just showing people it'll find the bad proof. But I think the narrative version captures something coherent.

The confusion we should have is that this is not how we, humans, engage in action counterfactuals! Not even close. Normally that's not an issue, but let's pay more attention to see if the difference is substantive.

In the original post, I use the example of a professor considering whether to slash the tires on his car. He ponders what it would imply, if he chose to do that. It would imply he must desperately not want the ability to travel, plausibly. It might imply he's driven into a river and needs access to the air within them for safety, like he's seen in movies. He's a sensible person -- him making the choice must mean there's a very good reason he did so! Much better than the standard "I'm just trying to get to work" reasons to not slash the tires.

But the professor, running his weaker, informal decision making process, does not slash his tires. You don't want to know the rewards implied by choosing to take an action. Although the professor has no Lobian issues in his informal decision-making, he views the future differently: he asks what would happen if he was the same type of person who just made the choice, excluding any consideration of why he might be making the decision, and considers what might happen to that imaginary person.

The Intuition

The intuition I was using was: why can't the agents do something similar? Create a new agent that "teleports them forward in time" by just precommiting to making the decision, and seeing what results? This matches our informal understanding of the phrase "action counterfactual" much better, I think, and because we're creating a new agent (one with this one decision pre-set into it), it's possible the self-reference issues go away.

Does it mechanically work to resolve the issue?

It does in the specific example given in the paper. I can understand a general skepticism about how it generalizes. Can we really do our reasoning with these slightly-modified agents? Doesn't that risk getting very bad behavior from other agents, who can see that the internal reasoning of our agent has changed? Sure, if there's no visibility into how we make a choice, doing it from a precommitment or from deliberation are interchangeable. But what if other agents see why we make decisions? Wouldn't they change their output, if this part of internal state is changed?

Imagine an agent performing an argmax like above, pondering the fates of the counterfactually changed agents. It wouldn't cooperate in the prisoner's dilemma with (the naive encodings of) MIRI's "cooperate iff that's what gets the other agent to change from defect to cooperate". The counterfactual agents will never change what they do, so the counterfactual argmax will always see MIRI defect against the sub-agents, and therefore defect itself.

Note: this is actually very sensitive to the decision theory used by MIRI's agent here -- it should be easy enough, if you look, to find a proof that cooperating with the precommitted agent is the reason that precommitted agent is used against it, and therefore the reason why the agent it sees in the base-level is cooperating, but we'd have to make extremely specific statements about the structure of the MIRI agent to be sure, because it's not always true that the decision theory you're using would lead it to cooperating with another, related agent.

This is quite a bummer!

The Additional Necessary Assumption Needed For This Solution

This class of problem, which in the original post I described as a "failure to reason" (in that the agent does have the trait MIRI's agent is looking for, but it doesn't find that trait), goes away if you add in an additional constraint: agents ought not distinguish precommitments and "live" decision making, when they look at your source code. Remember, when they run your agent code, it might look live, but it isn't. The setup doesn't allow the agents to distinguish between being simulated and being asked the genuine question. So this additional constraint seems plausible, if not totally ideal.

But why might this be useful?

Well, imagine your MIRI's agent, scaled up to cooperate more generally. You've now strongly incentivized literally no one to ever precommit to cooperating with you. This is, plausibly, not the behavior of a maximally cooperative agent. But you've only been able to do this because you allow yourself to peer into the decision making process of counterparty agents. In reality, if you had that sort of ability, it seems plausible that precommitments would be precisely the type of thing you'd want to incentivize, because now you can check that those precommitments exist.

You can choose to defect against CooperateBot, but that means no one will ever send CooperateBot to play prisoner's dilemma with you, and I think we can all agree, that's definitely the nicest Bot to play with.

Yes, the MIRI proposal wasn't meant to solve all agent cooperation dynamics once and for all. It was merely an example of a plausibly very good way to cooperate. But before we reject another system for being unable to cooperate with it, we should ponder if it handles all possible cases in an ideal fashion. I submit that the sensibilities here can be improved.

In this particular instance, an analogous agent is "if they cooperate with an Always Cooperator, become that cooperator, otherwise defect". That cooperates with itself (and many other general agents) and defects against players who it seems clear you ought to defect against.

Do I think this resolves the issue in the paper?

I do, but given it requires an additional assumption, it'd be best to say I consider it one possible resolution to the issue described in the paper. Without any additional structure to the problem, there really doesn't appear to be any solution, so I imagine all possible resolutions will have a similar "add a constraint, design a process to use that constraint" shape. Hopefully this one is useful.

And since I've given multi-agent examples, I should clarify that I think this approach is a viable option for all of Section 2: Decision Theory, in the Embedded Agency paper, albeit if you actually want to interact with agents like Omega, it's plausible they wouldn't have to follow assumptions about not distinguishing precommitments and live choices, in the most general case of alien intelligences. Perhaps there is a value in refusing this constraint on agents, it certainly seems quite restrictive to me. I'd appreciate any scenarios that can be offered to help refine this idea. While the original post is slightly old, I still consider this idea larval -- still gathering the resources to grow into something beautiful, and I'd appreciate your help.

New to LessWrong?

New Comment