in the book of why, he basically now says it's impossible to learn causality from data, which is a bit of a confusing message if you come from his previous books.

but now with language models, I think his hopes are up again, since models can basically piggy-back on causal relationships inferred by humans

Reward Hacking from a Causal Perspective

tom4everitt2moΩ330

Sorry, this post got stuck on the backburner for a little bit. But the content will largely be from "Robust Agents Learn Causal World Models"

A Shutdown Problem Proposal

tom4everitt10moΩ440

The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:
Does not want to manipulate the shutdown button
Does respond to the shutdown button
Does want to make any child-agents it creates responsive-but-not-manipulative to the shutdown button, recursively (i.e. including children-of-children etc)
If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.

From a quick read, your proposal seems closely related to Jessica Taylor's causal-counterfactual utility indifference. Ryan Carey and I also recently had a paper formalising some similar ideas, with some further literature review https://arxiv.org/abs/2305.19861

3. Premise three & Conclusion: AI systems can affect value change trajectories & the Value Change Problem

tom4everitt1yΩ450

I really like this articulation of the problem!

To me, a way to point to something similar is to say that preservation (and enhancement) of human agency is important (value change being one important way that human agency can be reduced). https://www.alignmentforum.org/s/pcdHisDEGLbxrbSHD/p/Qi77Tu3ehdacAbBBe

One thing I've been trying to argue for is that we might try to pivot agent foundations research to focus more on human agency instead of artificial agency. For example, I think value change is an example of self-modification, which has been studied a fair bit for artificial agents.

Reward Hacking from a Causal Perspective

tom4everitt1y10

I see, thanks for the careful explanation.

I think the kind of manipulation you have in mind is bypassing the human's rational deliberation, which is an important one. This is roughly what I have in mind when I say "covert influence".

So in response to your first comment: given that the above can be properly defined, there should also be a distinction between using and not using covert influence?

Whether manipulation can be defined as penetration of a Markov blanket, it's possible. I think my main question is how much it adds to the analysis, to characterise it in terms of a Markov blanket. Because it's non-trivial to define the membrane variable, in a way that information that "covertly" passes through my eyes and ears bypasses the membrane, while other information is mediated by the membrane.

The SEP article does a pretty good job at spelling out the many different forms manipulation can take https://plato.stanford.edu/entries/ethics-manipulation/

Reward Hacking from a Causal Perspective

tom4everitt1y20

The point here isn't that the content recommender is optimised to use covert means in particular, but that it is not optimised to avoid them. Therefore it may well end up using them, as they might be the easiest path to reward.

Re Markov blankets, won't any kind of information penetrate a human's Markov blanket, as any information received will alter the human's brain state?

Agency from a causal perspective

tom4everitt1y20

Thanks, that's a nice compilation, I added the link to the post. Let me check with some of the others in the group, who might be interested in chatting further about this

Agency from a causal perspective

tom4everitt1y10

fixed now, thanks! (somehow it added https:// automatically)

Causality: A Brief Introduction

tom4everitt1yΩ240

Sure, I think we're saying the same thing: causality is frame dependent, and the variables define the frame (in your example, you and the sensor have different measurement procedures for detecting the purple cube, so you don't actually talk about the same random variable).

How big a problem is it? In practice it seems usually fine, if we're careful to test our sensor / double check we're using language in the same way. In theory, scaled up to super intelligence, it's not impossible it would be a problem.

But I would also like to emphasize that the problem you're pointing to isn't restricted to causality, it goes for all kinds of linguistic reference. So to the extent we like to talk about AI systems doing things at all, causality is no worse than natural language, or other formal languages.

I think people sometimes hold it to a higher bar than natural language, because it feels like a formal language could somehow naturally intersect with a programmed AI. But of course causality doesn't solve the reference problem in general. Partly for this reason, we're mostly using causality as a descriptive language to talk clearly and precisely (relative to human terms) about AI systems and their properties.

Causality: A Brief Introduction

tom4everitt1yΩ110

The way I think about this, is that the variables constitute a reference frame. They define particular well-defined measurements that can be done, which all observers would agree about. In order to talk about interventions, there must also be a well-defined "set" operation associated with each variable, so that the effect of interventions is well-defined.

Once we have the variables, and a "set" and "get" operation for each (i.e. intervene and observe operations), then causality is an objective property of the universe. Regardless who does the experiment (i.e. sets a few variables) and does the measurement (i.e. observes some variables), the outcome will follow the same distribution.

So in short, I don't think we need to talk about an agent observer beyond what we already say about the variables.