Research Scientist at DeepMind


(A -> B) -> A in Causal DAGs

Glad you liked it.

Another thing you might find useful is Dennett's discussion of what an agent is (see first few chapters of Bacteria to Bach). Basically, he argues that an agent is something we ascribe beliefs and goals to. If he's right, then an agent should basically always have a utility function.

Your post focuses on the belief part, which is perhaps the more interesting aspect when thinking about strange loops and similar.

(A -> B) -> A in Causal DAGs

There is a paper which I believe is trying to do something similar to what you are attempting here:

Networks of Influence Diagrams: A Formalism for Representing Agents’ Beliefs and Decision-Making Processes, Gal and Pfeffer, Journal of Artificial Intelligence Research 33 (2008) 109-147

Are you aware of it? How do you think their ideas relate to yours?

Wireheading is in the eye of the beholder

Is this analogous to the stance-dependency of agents and intelligence?

Defining AI wireheading

Thanks Stuart, nice post.

I've moved away from the wireheading terminology recently, and instead categorize the problem a little bit differently:

The top-level category is reward hacking / reward corruption, which means that the agent's observed reward differs from true reward/task performance.

Reward hacking has two subtypes, depending on whether the agent exploited a misspecification in the process that computes the rewards, or modified the process. The first type is reward gaming and the second reward tampering.

Tampering can subsequently be divided into further subcategories. Does the agent tamper with its reward function, its observations, or the preferences of a user giving feedback? Which things the agent might want to tamper with depends on how its observed rewards are computed.

One advantage with this terminology is that it makes it clearer what we're talking about. For example, its pretty clear what reward function tampering refers to, and how it differs from observation tampering, even without consulting a full definition.

That said, I think you're post nicely puts the finger on what we usually mean when we say wireheading, and it is something we have been talking about a fair bit. Translated into my terminology, I think your definition would be something like "wireheading = tampering with goal measurement".

Computational Model: Causal Diagrams with Symmetry

Thanks for a nice post about causal diagrams!

Because our universe is causal, any computation performed in our universe must eventually bottom out in a causal DAG.

Totally agree. This is a big part of the reason why I'm excited about these kinds of diagrams.

This raises the issue of abstraction - the core problem of embedded agency. ... how can one causal diagram (possibly with symmetry) represent another in a way which makes counterfactual queries on the map correspond to some kind of counterfactual on the territory?

Great question, I really think someone should look more carefully into this. A few potentially related papers:

In general, though, how to learn causal DAGs with symmetry is still an open question. We’d like something like Solomonoff Induction, but which can account for partial information about the internal structure of the causal DAG, rather than just overall input-output behavior.

Again, agreed. It would be great if we could find a way to make progress on this question.

"Designing agent incentives to avoid reward tampering", DeepMind

Actually, I would argue that the model is naturalized in the relevant way.

When studying reward function tampering, for instance, the agent chooses actions from a set of available actions. These actions just affect the state of the environment, and somehow result in reward or not.

As a conceptual tool, we label part of the environment the "reward function", and part of the environment the "proper state". This is just to distinguish between effects that we'd like the agent to use from effects that we don't want the agent to use.

The current-RF solution doesn't rely on this distinction, it only relies on query-access to the reward function (which you could easily give an embedded RL agent).

The neat thing is that when we look at the objective of the current-RF agent using the same conceptual labeling of parts of the state, we see exactly why it works: the causal paths from actions to reward that pass the reward function have been removed.

"Designing agent incentives to avoid reward tampering", DeepMind

We didn't expect this to be surprising to the LessWrong community. Many RL researchers tend to be surprised, however.

"Designing agent incentives to avoid reward tampering", DeepMind

Yes, that is partly what we are trying to do here. By summarizing some of the "folklore" in the community, we'll hopefully be able to get new members up to speed quicker.

"Designing agent incentives to avoid reward tampering", DeepMind

Hey Steve,

Thanks for linking to Abram's excellent blog post.

We should have pointed this out in the paper, but there is a simple correspondence between Abram's terminology and ours:

Easy wireheading problem = reward function tampering

Hard wireheading problem = feedback tampering.

Our current-RF optimization corresponds to Abram's observation-utility agent.

We also discuss the RF-input tampering problem and solutions (sometimes called the delusion box problem), which I don’t fit into Abram’s distinction.

Load More