This tends to assume that we can detangle things enough to see outcomes as a function of our actions.

No. The assumption is that an agent has *agency* over some degrees of freedom of the environment. It's not even an assumption, really; it's part of the definition of an agent. What is an agent with no agency?

If the agent's actions have no influence on the state of the environment, then it can't drive the state of the environment to satisfy any objective. The whole point of building an internal model of the environment is to understand how the agent's actions influence the environment. In other words: "detangling things enough to see outcomes as functions of [the agent's] actions" isn't just an assumption, it's essential.

The only point I can see in writing the above sentence would be if you said that a function isn't, generally; enough to describe the relationship between an agent's actions and the outcome: that you generally need some higher-level construct like a Turing machine. That would be fair enough if it weren't for the fact that the theory you're comparing yours to is AIXI which explicitly models the relationship between actions and outcomes via Turing machines.

AIXI represents the agent and the environment as separate units which interact over time through clearly defined I/O channels so that it can then choose actions maximizing reward.

Do you propose a model in which the relationship between the agent and the environment are undefined?

When the agent model is part of the environment model, it can be significantly less clear how to consider taking alternative actions.

Really? It seems you're applying magical thinking to the consequences of embedding one Turing machine within another. Why would it's I/O or internal modeling change so drastically? If I use a virtual machine to run Windows within Linux, does that make the experience of using MS Paint fundamentally different then running Windows in a native boot?

...there can be other copies of the agent, or things very similar to the agent.
Depending on how you draw the boundary around "yourself", you might think you control the action of both copies or only your own.

How is that unclear? If the agent doesn't actually control the copies, then there's no reason to imagine it does. If it's trying to figure out how best to exercise its agency to satisfy its objective, then imagining it has any more agency than it actually does is silly. You don't need to wander into the philosophical no-mans-land of defining the "self". It's irrelevant. What are your degrees of freedom? How can you uses them to satisfy your objective? At some point, the I/O channels *must be* well defined. It's not like a processor has an ambiguous number of pins. It's not like a human has an ambiguous number of motor neurons.

For all intents and purposes: the agent IS the degrees of freedom it controls. The agent can only change it's state, which; being a sub-set of the environment's state, changes the environment in some way. You can't lift a box, you can only change the position of your arms. If that results in a box being lifted, good! Or maybe you can't change the position of those arms, you can only change the electric potential on some motor neurons, if that results in arms moving, good! Play that game long enough and, at some point; the set of actions you can do is finite and clearly defined.

Your five-or-ten problem is one of many that demonstrate the brittleness problem of logic-based systems operating in the real world. This is well known. People have all but abandoned logic-based systems for stochastic systems when dealing with real-world problems specifically because it's effectively impossible to make a robust logic-based system.

This is the crux of a lot of your discussion. When you talk about an agent "knowing" its own actions or the "correctness" of counterfactuals, you're talking about definitive results which a real-world agent would never have access to.

It's possible (though unlikely) for a cosmic ray to damage your circuits, in which case you could go right -- but you would then be insane.

If a rare, spontaneous occurrence causes you to go right, you must be insane? What? Is that really the only conclusion you could draw from that situation? If I take a photo and a cosmic ray causes one of the pixels to register white, do I need to throw my camera out because it might be "inasane"?!

Maybe we can force exploration actions so that we learn what happens when we do things?

First of all, who is "we" in this case? Are we the agent or are we some outside system "forcing" the agent to explore?

Ideally, nobody would have to force the agent to explore its world. It would want to explore and experiment as an instrumental goal to lower uncertainty in its model of the world so that it can better pursue its objective.

A bad prior can think that exploring is dangerous

That's not a bad prior. Exploring *is* fundamentally dangerous. You're encountering the unknown. I'm not even sure if the risk/reward ratio of exploring is decidable. It's certainly a hard problem to determine when it's better to explore, and when it's too dangerous. Millions of the most sophisticated biological neural networks the planet Earth has to offer have grappled with the question for hundreds of years with no clear answer.

Forcing it to take exploratory actions doesn't teach it what the world would look like if it took those actions deliberately.

What? Again *who* is doing the "forcing" in this situation and how? Do you really want to tread into the other philosophical no-mans-land of free-will? Why would the question of whether the agent really wanted to take an action have any bearing whatsoever on the result of that action? I'm so confused about what this sentence even means.

EDIT: It's also unclear to me the point of the discussion on counterfactuals. Counterfactuals are of dubious utility for short-term evaluation of outcomes. They become less useful the further you separate the action from the result in time. I could think, "damn! I should have taken an alternate route to work this morning!" which is arguably useful and may actually be wrong, but if I think, "damn, if Eric the Red hadn't sailed to the new world, Hitler would have never risen to power!" That's not only extremely questionable, but also what use would that pondering be even if it were correct?

It seems like you're saying an embedded agent can't enumerate the possible outcomes of its actions before taking them, so it can only do so in retrospect. In which case, why can't an embedded agent perform a pre-emptive tree search like any other agent? What's the point of counterfactuals?

2dxu1yThis statement is precisely what is being challenged--and for good reason: it's untrue. The reason it's untrue is because the concept of "I/O channels" does not exist within physics as we know it; the true laws of physics make no reference to inputs, outputs, or indeed any kind of agents at all. In reality, that which is considered a computer's "I/O channels" are simply arrangements of matter and energy, the same as everything else in our universe. There are no special XML tags attached to those configurations of matter and energy, marking them "input", "output", "processor", etc. Such a notion is unphysical. Why might this distinction be important? It's important because an algorithm that is implemented on physically existing hardware can be physically disrupted. Any notion of agency which fails to account for this possibility--such as, for example, AIXI, which supposes that the only interaction it has with the rest of the universe is by exchanging bits of information via the input/output channels--will fail to consider the possibility that its own operation may be disrupted. A physical implementation of AIXI would have no regard for the safety of its hardware, since it has no means of representing the fact that the destruction of its hardware equates to its own destruction. AIXI also fails on various decision problems that involve leaking information via a physical side channel that it doesn't consider part of its output; for example, it has no regard for the thermal emissions [https://www.lesswrong.com/posts/8Hzw9AmXHjDfZzPjo/failures-of-an-embodied-aixi] it may produce as a side effect of its computations. In the extreme case, AIXI is incapable of conceptualizing the possibility that an adversarial agent may be able to inspect its hardware, and hence "read its mind". This reflects a broader failure on AIXI's part: it is incapable of representing an entire class of hypotheses--namely, hypotheses that involve AIXI itself being modeled by other agents in the env
3Abe Dillon1yYes. They most certainly do. The only truly consistent interpretation I know of current physics is information theoretic anyway, but I'm not interested in debating any of that. The fact is I'm communicating to you with physical I/O channels right now so I/O channels certainly exist in the real world. Agents are emergent phenomenon. They don't exist on the level of particles and waves. The concept is an abstraction. An I/O channel doesn't imply modern computer technology. It just means information is collected from or imprinted upon the environment. It could be ant pheromones, it could be smoke signals, its physical implementation is secondary to the abstract concept of sending and receiving information of some kind. You're not seeing the forest through the trees. Information most certainly does exist. I've explained in previous posts that AIXI is a special case of AIXI_lt. AIXI_lt can be conceived of in an embedded context, in which case; its model of the world would include a model of itself which is subject to any sort of environmental disturbance. To some extent, an agent must trust its own operation to be correct, because you quickly run into infinite regression if the agent is modeling all the possible that it could be malfunctioning. What if the malfunction effects the way it models the possible ways it could malfunction? It should model all the ways a malfunction could disrupt how it models all the ways it could malfunction, right? It's like saying "well the agent could malfunction, so it should be aware that it can malfunction so that it never malfunctions". If the thing malfunctions, it malfunctions, it's as simple as that. Aside from that, AIXI is meant to be a purely mathematical formalization, not a physical implementation. It's an abstraction by design. It's meant to be used as a mathematical tool for understanding intelligence. Do you consider how the 30 Watts leaking out of your head might effect your plans to every day? I mean, it might cause a
The concept is an abstraction.

*Yes, it is. The fact that it is an abstraction is precisely why it breaks down under certain circumstances.

An I/O channel doesn't imply modern computer technology. It just means information is collected from or imprinted upon the environment. It could be ant pheromones, it could be smoke signals, its physical implementation is secondary to the abstract concept of sending and receiving information of some kind. You're not seeing the forest through the trees. Information most certainly does exist.

The claim is not that... (read more)

Decision Theory

by abramdemski, Scott Garrabrant 1 min read31st Oct 201837 comments

101

Ω 24


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(A longer text-based version of this post is also available on MIRI's blog here, and the bibliography for the whole sequence can be found here.)

The next post in this sequence, 'Embedded Agency', will come out on Friday, November 2nd.

Tomorrow’s AI Alignment Forum sequences post will be 'What is Ambitious Value Learning?' in the sequence 'Value Learning'.