Research Scientist at DeepMind
Glad she likes the name :) True, I agree there may be some interesting subtleties lurking there.
(Sorry btw for slow reply; I keep missing alignmentforum notifications.)
Thanks Stuart and Rebecca for a great critique of one of our favorite CID concepts! :)
We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.
What control incentives do capture are the instrumental goals of the agent. Controlling X can be a subgoal for achieving utility if and only if the CID admits a control incentive on X. For this reason, we have decided to slightly update the terminology: in the latest version of our paper (accepted to AAAI, just released on arXiv) we prefer the term instrumental control incentive (ICI), to emphasize that the distinction to "control as a side effect.
Glad you liked it.
Another thing you might find useful is Dennett's discussion of what an agent is (see first few chapters of Bacteria to Bach). Basically, he argues that an agent is something we ascribe beliefs and goals to. If he's right, then an agent should basically always have a utility function.
Your post focuses on the belief part, which is perhaps the more interesting aspect when thinking about strange loops and similar.
There is a paper which I believe is trying to do something similar to what you are attempting here:
Networks of Influence Diagrams: A Formalism for Representing Agents’ Beliefs and Decision-Making Processes, Gal and Pfeffer, Journal of Artificial Intelligence Research 33 (2008) 109-147
Are you aware of it? How do you think their ideas relate to yours?
Is this analogous to the stance-dependency of agents and intelligence?
Thanks Stuart, nice post.
I've moved away from the wireheading terminology recently, and instead categorize the problem a little bit differently:
The top-level category is reward hacking / reward corruption, which means that the agent's observed reward differs from true reward/task performance.
Reward hacking has two subtypes, depending on whether the agent exploited a misspecification in the process that computes the rewards, or modified the process. The first type is reward gaming and the second reward tampering.
Tampering can subsequently be divided into further subcategories. Does the agent tamper with its reward function, its observations, or the preferences of a user giving feedback? Which things the agent might want to tamper with depends on how its observed rewards are computed.
One advantage with this terminology is that it makes it clearer what we're talking about. For example, its pretty clear what reward function tampering refers to, and how it differs from observation tampering, even without consulting a full definition.
That said, I think you're post nicely puts the finger on what we usually mean when we say wireheading, and it is something we have been talking about a fair bit. Translated into my terminology, I think your definition would be something like "wireheading = tampering with goal measurement".
Thanks for a nice post about causal diagrams!
Because our universe is causal, any computation performed in our universe must eventually bottom out in a causal DAG.
Totally agree. This is a big part of the reason why I'm excited about these kinds of diagrams.
This raises the issue of abstraction - the core problem of embedded agency. ... how can one causal diagram (possibly with symmetry) represent another in a way which makes counterfactual queries on the map correspond to some kind of counterfactual on the territory?
Great question, I really think someone should look more carefully into this. A few potentially related papers:
In general, though, how to learn causal DAGs with symmetry is still an open question. We’d like something like Solomonoff Induction, but which can account for partial information about the internal structure of the causal DAG, rather than just overall input-output behavior.
Again, agreed. It would be great if we could find a way to make progress on this question.
Actually, I would argue that the model is naturalized in the relevant way.
When studying reward function tampering, for instance, the agent chooses actions from a set of available actions. These actions just affect the state of the environment, and somehow result in reward or not.
As a conceptual tool, we label part of the environment the "reward function", and part of the environment the "proper state". This is just to distinguish between effects that we'd like the agent to use from effects that we don't want the agent to use.
The current-RF solution doesn't rely on this distinction, it only relies on query-access to the reward function (which you could easily give an embedded RL agent).
The neat thing is that when we look at the objective of the current-RF agent using the same conceptual labeling of parts of the state, we see exactly why it works: the causal paths from actions to reward that pass the reward function have been removed.
Thanks for the Dewey reference, we'll add it.
We didn't expect this to be surprising to the LessWrong community. Many RL researchers tend to be surprised, however.