New Answer

New Comment

1 Answers sorted by
top scoring

Oct 30, 2019

I am somewhat somewhat new to this site and to the design problem of counter-factual reasoning in embedded agents. So I can say something about whether I think your approach goes in the right direction, but I can't really answer your question if this has all been done before.

Based on my own reading and recent work so far, if your goal is to create a working "what happens if I do a" reasoning system for an embedded agent, my intuition is that your approach is necessary, and that it is going in the right direction. I will unpack this statement at length further below.

I also get the impression that you are wondering whether doing the proposed work will move forward the discussion about Newcomb's problem. Not sure if this is really a part of your question, but in any case, I am not in a position to give you any good insights on what is needed to move forward the Newcomb discussion.

Here are my longer thoughts on your approach to building counterfactual reasoning. I need to start by setting the scene. Say we have an embedded agent containing a very advanced predictive world model, a model that includes all details about the agent's current internal computational structure. Say this agent wants to compute the next action that will best maximize its utility function. The agent can perform this computation in two ways.

1. One way to pick the correct action is to initiate a simulation that runs its entire world model forward, and to observe the copy of itself inside the simulation. The action that its simulated copy picks is the action that it wants. There is an obvious problem with using this approach. If the simulated copy takes the same approach, and the agent it simulates in turn does likewise, and so on, we get infinite recursion and the predictive computation will never end. So at one point in the chain, the (real or n-level simulated) agent has to use a second way to compute the action that maximizes utility.

2. The second way is the argmax what-if method. Say the agent wants to choose between 10 possible actions. It can do this by running 10 simulations that compute 'the utility of what will happen if I take action a[i]'. Now, the agent knows full well that it will end up choosing only one of these 10 actions in its real future. So only one of these 10 simulations is a simulation of a future that actually contains the agent itself. It is therefore completely inappropriate for the agent to include any statement like 'this future will contain myself' inside at least 9 of the 10 world models that it supplies to its simulator. It has no choice but to 'edit out' any world model information that could cause the simulation engine to make this inference, or else 9 of the 10 simulations will have the 5-and-10 problem, and their results cannot be trusted anymore as valid inputs to its argmax calculation.

The above reasoning shows that the agent cannot avoid ending up in a place where it has to edit some details about itself out of the simulator input. (Exception: we could assume that the simulation engine somehow does the needed editing, just after getting the input inside its of the computation, but this amounts to basically the same thing. I interpret some proposals about updating FDT that I have read in the links as proposals to use such a special simulation engine.)

If the agent must do some editing on the simulator input, we can worry that the agent will do 'too much' editing, or the wrong type of editing. I have a theory that this worry is at least partly responsible for keeping the discussions around FDT and counterfactuals so lively.

The obvious worry is that, if the agent edits itself out of the simulation runs, it will no longer be able to reason about itself. In a philosophical sense, can we still say that such an agent possesses something like consciousness, or a robust theory of self? If not, does this mean that such an editing agent can never get to a state of super-intelligence? For me, these philosophical questions are really questions about the meaning of words, so I do not worry about AGI safety too much if these questions remain unresolved. However, editing also implies a more practical worry: how will editing impact the observable behaviour of the agent? If the agent edits itself out of the simulations, does this inevitably mean that it will lose some observable properties that we would expect to find in a highly intelligent agent, properties like showing an emergent drive to protect its own utility function?

When we look at the world models of many existing agents, e.g. the AlphaZero chess playing agent, we can interpret them as being 'pre-edited': these models do not contain any direct representation of the agent's utility function or the hardware its runs on. Therefore, the simulation engine in the agent runs no risk of blowing up, no risk of creating a 5-and-10 contradiction when it combines available pieces of world knowledge. Some of the world knowledge needed to create the contradiction never made it into the world model. However, this type of construction also makes the a agent indifferent to protecting its utility function. I cover this topic in more detail in section 5.2 of my paper here. In section 5.3 I introduce an agent that uses a predictive world model that is 'less edited': this model makes the utility function of the agent show up inside the simulation runs. The pleasing result is that the agent then gets an emergent drive to preserve its utility function.

A similar agent design, that that turns the editing knob towards greater embeddedness, is in the paper Self-modification of policy and utility function in rational agents. Both these papers get pretty heavy on the math, when they start proving agent properties. This may be a price we have to pay for turning the knob. I am currently investigating how to turn the knob even closer to embeddedness. There some hints that the math might get simpler again in that case, but we'll see.

10 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:31 PM

[-]cousin_it6y60

The problem of logical counterfactuals is exactly about what it means to "replace all instances" of a subprogram in a larger program, when the subprogram can occur in multiple places, in modified but logically equivalent forms, etc.

[-]reavowed6y30

I agree that that is a problem that both this approach to counterfactuals and the FDT logical-counterfactual approach share. The particular problem I was hoping this approach avoids was the one of assuming mutually-exclusive logical facts, such that all-but-one of them must necessarily be false, and the implications this has for the agent's consistency and reasoning about its actions. Are you saying that they are the same problem, that the second problem is comparatively not worth solving, or something else?

[-]cousin_it6y40

I'm saying they are the same problem. These sentences are logically equivalent:

"A() = 1 provably implies B() = 2"

"There's a program C provably equivalent to B such that, if we replace all exact instances of A inside C with 1, the resulting program will return 2"

[-]reavowed6y10

I think my core issue with the above is the nature of the specification of the problem of "replacing $A$ in $B$ ". Allowing passing to an arbitrary equivalent program $C$ before replacing exact instances of $A$ seems overly permissive, and to allow in exactly the kind of principle-of-explosion issue that logical counterfactuals have. Suppose for instance that $A$ and $B$ both halt with a defined value, say $A () = a$ . As framed above, I can take $C$ to be a program that computes $B^{'} () + x (A () - a)$ (for some $x$ ), where $B^{'}$ is the result of substituting all exact instances of $A$ in $B$ with any program returning $a$ that is not identically $A$ . Then $C$ is (provably) equivalent to $B$ , but if $a \neq 1$ I can ensure that the substituted version of $C$ returns any value I like just by varying $x$ . This doesn't say anything interesting about the appearance of $A$ in $B$ - in fact, the replacement of $B$ with $B^{'}$ is exactly to ensure that the main substitution is a no-op.

I feel like a statement that more accurately corresponds to the substitution I'm trying to do above would be something like "There's a program C that can be obtained from B by replacing subprograms that are provably equivalent to A with the constant program returning 1, such that C() = 2". For one thing, this appears to have the useful property that if $B$ makes no mention of $A$ or anything equivalent to it, then I can't stop such a $C$ from being equivalent (in fact, identical) to $B$ no matter what the value of $A ()$ is. Does this still have the same problem?

[-]cousin_it6y30

Let's say the world is a grid of Game of Life cells evolving in discrete time. What counts as a "subprogram"?

Let's further say that world contains embedded computers that run programs. Will your notion of "subprogram", specified purely in terms of Game of Life cells, be strong enough to capture subprograms of these programs?

[-]reavowed6y10

Again, I agree that the problem of identifying what logical structures (whereever they occur) count as implementing a particular function is a deep and interesting one, and not one that I am claiming to have solved. But again, I do not agree that it is a problem I have introduced? An FDT agent correctly inferring the downstream causal results of setting $F D T (P - -, G - -) = a$ would, in general, have to identify $F D T (P - -, G - -)$ being computed inside a Game of Life simulation, if and where such a calculation so occured.

While I am indeed interested in exploring the answers to your questions, I don't see that they represent a specific challenge to the idea that the above notion of counterfactuals might be worth exploring further, in the way that your original claim would.

[-]Shmi6y20

The issue of so-called logical counterfactuals has been discussed here and on the alignment forum quite a few times, including a bunch of posts by Chris_Leong, and at least one by yours truly. Consider browsing through them before embarking on original research:

https://www.google.com/search?q=logical+counterfactuals+site:lesswrong.com

[-]reavowed6y*10

I have indeed read many of those posts already (though I appreciate some reference to them in the original post would have been sensible, I apologise). Chris_Leong's Deconfusing Logical Counterfactuals comes pretty close to this - the counterfactual model I'm interested in corresponds to their notion of "Raw Counterfactual", but AFAICT they're going in a somewhat different direction with the notion of "erasure" (I don't think it should be necessary to forget that you've seen a full box in the transparent variant of Newcomb's problem, if you explicitly consider that you be in Omega's simulation), so they haven't followed the notion through to a full description of what a raw-counterfactual-based decision theory would look like. I can't find any further discussion of the idea, and Chris_Leong now seems to be walking back on it (that's part of the reason I'm asking the question).

[-]Shmi6y10

I suspect the real underlying issue is that of free will: all decision theories assume we can make different decisions in EXACT SAME circumstances, whereas from what we understand about the physical world, there is no such thing, and the only non-dualist proposal on the table is that of Scott Aaronson's freebits. I have written a related post last year. We certainly do have a very realistic illusion of free will, to the degree where any argument to the contrary tends to be rejected, ignored, strawmanned or misinterpreted. If you read through the philosophical writings on compatibilism, people keep talking past each other all the time, never getting to the crux of their disagreement. Not that it (or anything else) matters in the universe where there is no freedom of choice, anyway.

[-]TAG6y*-30

whereas from what we understand about the physical world, there is no such thing, and the only non-dualist proposal on the table is that of Scott Aaronson’s freebits.

That contains two falsehoods.

physics has not settled the issue of determinism versus determinism. See https://philpapers.org/rec/EARDWW
Scott Aaronson is not the first person in history to propose a nondualustic theory of free will.

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

3

[ Question ]

Decisions with Non-Logical Counterfactuals: request for input

3

3

1 Answers sorted by
top scoring

Oct 30, 2019

3

[ Question ]

Decisions with Non-Logical Counterfactuals: request for input

3

3

1 Answers sorted by top scoring

Oct 30, 2019

1 Answers sorted by
top scoring