All of tom4everitt's Comments + Replies

I really like this articulation of the problem!

To me, a way to point to something similar is to say that preservation (and enhancement) of human agency is important (value change being one important way that human agency can be reduced).

One thing I've been trying to argue for is that we might try to pivot agent foundations research to focus more on human agency instead of artificial agency. For example, I think value change is an example of self-modification, which has been studied a fair bit for artificial agents.

I see, thanks for the careful explanation.

I think the kind of manipulation you have in mind is bypassing the human's rational deliberation, which is an important one. This is roughly what I have in mind when I say "covert influence". 

So in response to your first comment: given that the above can be properly defined, there should also be a distinction between using and not using covert influence?

Whether manipulation can be defined as penetration of a Markov blanket, it's possible. I think my main question is how much it adds to the analysis, to charact... (read more)

The point here isn't that the content recommender is optimised to use covert means in particular, but that it is not optimised to avoid them. Therefore it may well end up using them, as they might be the easiest path to reward.

Re Markov blankets, won't any kind of information penetrate a human's Markov blanket, as any information received will alter the human's brain state?

Yes but I'm not sure that there is such a distinction as "using them" or "not using them" No-- For example: imagine a bacterium with a membrane. The bacterium has methods of controlling what influence flows in and out, e.g. it has ion channels. So, here I define "irresistible manipulation" as "influence that stabs through the bacterium's membrane".  But influence that the bacterium "willingly" allows through its ion channels/whatever is fine (because if it didn't "want" the influence it didn't have to let it in). Andrew Critch (in «Boundaries» 3a) defines this as longer explanation from a draft i'm writing-- Formalizing (irresistible) aggression Markov blankets Past work has formalized what I mean here by irresistible manipulation via Markov blankets. In this section, I will explain what Markov blankets mean for this purpose. By the end of this section, you will be able to understand this (Pearlian causal) diagram: (Note: I will assume that you have basic familiarity with Markov chains.) First, I want you to imagine a simple Markov chain that represents the fact that a human influences itself over time: Second, I want you to imagine a Markov chain that represents the fact that the environment (~ the complement of the human; the rest of the universe minus the human) influences itself over time: Okay. Now, notice that in between the human and the environment there’s some kind of membrane. For example, their skin (physical membrane) and their interpretation/cognition (informational membrane). If this were not a human but instead a bacterium, then the membrane I mean (mostly) be the bacterium’s literal membrane. Third, imagine a Markov chain that represents that membrane influencing itself over time: Okay, so we have these three Markov chains running in parallel: But they also influence each other, so let’s build that too.  How does the environment affect a human? Notice that whenever the environment affects a human, it doesn’t influence the

Thanks, that's a nice compilation, I added the link to the post. Let me check with some of the others in the group, who might be interested in chatting further about this

fixed now, thanks! (somehow it added https:// automatically)

Sure, I think we're saying the same thing: causality is frame dependent, and the variables define the frame (in your example, you and the sensor have different measurement procedures for detecting the purple cube, so you don't actually talk about the same random variable).

How big a problem is it? In practice it seems usually fine, if we're careful to test our sensor / double check we're using language in the same way. In theory, scaled up to super intelligence, it's not impossible it would be a problem.

But I would also like to emphasize that the problem yo... (read more)

2Gordon Seidoh Worley5mo
Fair. For what it's worth I strongly agree that causality is just one domain where this problem becomes apparent, and we should be worried about it generally for super intelligent agents, much more so than I think many folks seem (in my estimation) to worry about it today.

The way I think about this, is that the variables constitute a reference frame. They define particular well-defined measurements that can be done, which all observers would agree about. In order to talk about interventions, there must also be a well-defined "set" operation associated with each variable, so that the effect of interventions is well-defined.

Once we have the variables, and a "set" and "get" operation for each (i.e. intervene and observe operations), then causality is an objective property of the universe. Regardless who does the experiment (i.... (read more)

3Gordon Seidoh Worley5mo
Yes, the variables constitute a reference frame, which is to say an ultimately subjective way of viewing the world. Even if there is high inter-observer agreement about the shape of the reference frame, it's not guaranteed unless you also posit something like Wentworth's natural abstraction hypothesis to be true. Perhaps a toy example will help explain my point. Suppose the grass should only be watered when there's a violet cube on the lawn. To automate this a sensor is attached to the sprinklers that turns them on only when the sensor sees a violet cube. I place a violet cube on the lawn to make sure the lawn is watered. I return a week later and find the grass is dead. What happened? The cube was actually painted with a fine mix of red and blue paint. My eyes interpreted purple as violet, but which the sensor did not. Conversely, if it was my job to turn on the sprinklers rather than the sensor, I would have been fooled by the purple cube into turning them on. It's perhaps tempting to say this doesn't count because I'm now part of the system, but that's also kind of the point. I, an observer of this system trying to understand its causality, am also embedded within the system (even if I think I can isolate it for demonstration purposes, I can't do this in reality, especially when AI are involved and will reward hack by doing things that were supposed to be "outside" the system). So my subjective experience not only matters to how causality is reckoned, but also how the physical reality being mapped by causality plays out.

nice, yes, I think logical induction might be a way to formalise this, though others would know much more about it

I had intended to be using the program's output as a time series of bits, where we are considering the bits to be "sampling" from A and B. Let's say it's a program that outputs the binary digits of pi. I have no idea what the bits are (after the first few) but there is a sense in which P(A) = 0.5 for either A = 0 or A = 1, and at any timestep. The same is true for P(B). So P(A)P(B) = 0.25. But clearly P(A = 0, B = 0) = 0.5, and P(A = 0, B = 1) = 0, et cetera. So in that case, they're not probabilistically independent, and therefore there is a correlation n

... (read more)
Yeah, I think I agree that the resolution here is something about how we should use these words. In practice I don't find myself having to distinguish between "statistics" and "probability" and "uncertainty" all that often. But in this case I'd be happy to agree that "all statistical correlations are due to casual influences" given that we mean "statistical" in a more limited way than I usually think of it. A group of LessWrong contributors has made a lot of progress on these ideas of logical uncertainty and (what I think they're now calling) functional decision theory over the last 15ish years, although I don't really follow it myself, so I'm not sure how close they'd say we are to having it properly formalized.

Thanks for the suggestion. We made an effort to be brief, but perhaps we went too far. In our paper Reasoning about causality in games, we have a longer discussion about probabilistic, causal, and structural models (in Section 2), and Pearl's book A Primer also offers a more comprehensive introduction.

I agree with you that causality offers a way to make out-of-distribution predictions (in post number 6, we plan to go much deeper into this). In fact, a causal Bayesian network is equivalent to an exponentially large set of probability distributions, where th... (read more)

Preferences and goals are obviously very important. But I'm not sure they are inherently causal, which is why they don't have their own bullet point on that list.  We'll go into more detail in subsequent posts

I'm not sure I entirely understand the question, could you elaborate? Utility functions will play a significant role in follow-up posts, so in that sense we're heavily building on VNM.

Yeah, what I meant was that "goals" or "preferences" are often emphasized front and center, but here not so much, because it seems like you want to reframe that part under the banner of "intention" It just felt a little odd to me that so much bubbled up from your decomposition except utility, but you only mention "goals" as this thing that "causes" behaviors without zeroing in on a particular formalism. So my guess was that vnm would be hiding behind this "intention" idea.

The idea ... works well on mechanised CIDs whose variables are neatly divided into object-level and mechanism nodes. ... But to apply this to a physical system, we would need a way to obtain such a partition those variables

Agree, the formalism relies on a division of variable. One thing that I think we should perhaps have highlighted much more is Appendix B in the paper, which shows how you get a natural partition of the variables from just knowing the object-level variables of a repeated game.

Does a spinal reflex count as a policy?

A spinal reflex would be... (read more)

This makes sense, thanks for explaining. So a threat model with specification gaming as its only technical cause, can cause x-risk under the right (i.e. wrong) societal conditions.

For instance: why expect that we need a multi-step story about consequentialism and power-seeking in order to deceive humans, when RLHF already directly selects for deceptive actions?

Is deception alone enough for x-risk? If we have a large language model that really wants to deceive any human it interacts with, then a number of humans will be deceived. But it seems like the danger stops there. Since the agent lacks intent to take over the world or similar, it won't be systematically deceiving humans to pursue some particular agenda of the agent. 

As I understand it, this is why we need the extra assumption that the agent is also a misaligned power-seeker.

For that part, the weaker assumption I usually use is that AI will end up making lots of big and fast (relative to our ability to meaningfully react) changes to the world, running lots of large real-world systems, etc, simply because it's economically profitable to build AI which does those things. (That's kinda the point of AI, after all.) In a world where most stuff is run by AI (because it's economically profitable to do so), and there's RLHF-style direct incentives for those AIs to deceive humans... well, that's the starting point to the Getting What You Measure scenario. Insofar as power-seeking incentives enter the picture, it seems to me like the "minimal assumptions" entry point is not consequentialist reasoning within the AI, but rather economic selection pressures. If we're using lots of AIs to do economically-profitable things, well, AIs which deceive us in power-seeking ways (whether "intentional" or not) will tend to make more profit, and therefore there will be selection pressure for those AIs in the same way that there's selection pressure for profitable companies. Dial up the capabilities and widespread AI use, and that again looks like Getting What We Measure. (Related: the distinction here is basically the AI version of the distinction made in Unconscious Economics.)

I think the point that even an aligned agent can undermine human agency is interesting and important. It relates to some of our work on defining agency and preventing manipulation. (Which I know you're aware of, so I'm just highlighting the connection for others.)

Sorry, I worded that slightly too strongly. It is important that causal experiments can in principle be used to detect agents. But to me, the primary value of this isn't that you can run a magical algorithm that lists all the agents in your environment. That's not possible, at least not yet. Instead, the primary value (as i see it) is that the experiment could be run in principle, thereby grounding our thinking. This often helps, even if we're not actually able to run the experiment in practice.

I interpreted your comment as "CIDs are not useful, because ca... (read more)

The way I see it, the primary value of this work (as well as other CID work) is conceptual clarification. Causality is a really fundamental concept, which many other AI-safety relevant concepts build on (influence, response, incentives, agency, ...). The primary aim is to clarify the relationships between concepts and to derive relevant implications. Whether there are practical causal inference algorithms or not is almost irrelevant. 

TLDR: Causality > Causal inference :)

3Roman Leventov1y
This note runs against the fact that in the paper, you repeatedly use language like "causal experiments", "empirical data", "real systems", etc.

Sure, humans are sometimes inconsistent, and we don't always know what we want (thanks for the references, that's useful!). But I suspect we're mainly inconsistent in borderline cases, which aren't catastrophic to get wrong. I'm pretty sure humans would reliably state that they don't want to be killed, or that lots of other people die, etc. And that when they have a specific task in mind , they state that they want the task done rather than not. All this subject to that they actually understand the main considerations for whatever plan or outcome is in question, but that is exactly what debate and rrm are for

alignment of strong optimizers simply cannot be done without grounding out in something fundamentally different from a feedback signal.

I don't think this is obvious at all.  Essentially, we have to make sure that humans give feedback that matches their preferences, and that the agent isn't changing the human's preferences to be more easily optimized.

We have the following tools at our disposal:

  1. Recursive reward modelling / Debate. By training agents to help with feedback, improvements in optimization power boosts both the feedback and the process potent
... (read more)

Minor rant about this is particular:

Essentially, we have to make sure that humans give feedback that matches their preferences...

Humans' stated preferences do not match their preferences-in-hindsight, neither of those matches humans' self-reported happiness/satisfaction in-the-moment, none of that matches humans' revealed preferences, and all of those are time-inconsistent. IIRC the first section of Kahnemann's textbook Well Being: The Foundations of Hedonic Psychology is devoted entirely to the problem of getting feedback from humans on what they actually... (read more)

If the problem is "humans don't give good feedback", then we can't directly train agents to "help" with feedback; there's nothing besides human feedback to give a signal of what's "helping" in the first place. We can choose some proxy for what we think is helpful, but then that's another crappy proxy which will break down under optimization pressure. It's not just about "fooling" humans, though that alone is a sufficient failure mode. Bear in mind that in order for "helping humans not be fooled" to be viable as a primary alignment strategy it must be the case that it's easier to help humans not be fooled than to fool them in approximately all cases, because otherwise a hostile optimizer will head straight for the cases where humans are fallible. And I claim it is very obvious, from looking at existing real-world races between those trying to deceive and those trying to expose the deception, that there will be plenty of cases where the expose-deception side does not have a winning strategy. The agent changing "human preferences" is another sufficient failure mode. The strategy of "design an agent that optimizes the hypothetical feedback that would have been given" is indeed a conceptually-valid way to solve that problem, and is notably not a direct feedback signal in the RL sense. At that point, we're doing EU maximization, not reinforcement learning. We're optimizing for expected utility from a fixed model, we're not optimizing a feedback signal from the environment. Of course a bunch of the other problems of human feedback still carry over; "the hypothetical feedback a human would have given" is still a crappy proxy. But it's a step in the right direction.
6Ramana Kumar1y
The desiderata you mentioned: 1. Make sure the feedback matches the preferences 2. Make sure the agent isn't changing the preferences It seems that RRM/Debate somewhat addresses both of these, and path-specific objectives is mainly aimed at addressing issue 2. I think (part of) John's point is that RRM/Debate don't address issue 1 very well, because we don't have very good or robust processes for judging the various ways we could construct or improve these schemes. Debate relies on a trustworthy/reliable judge at the end of the day, and we might not actually have that.

Really interesting, even though the result aren't that surprising. I'd be curious to see how the results improve (or not) with more recent language models. I also wonder if there are other formats to test causal understanding. For example, what if receives a more natural story plot (about Red Riding Hood, say), and asked about some causal questions ("what would have happened if grannma wasn't home when the wolf got there?", say).

It's less clean, but it could be interesting to probe it in a few different ways.

2Marius Hobbhahn1y
I would expect the results to be better on, let's say PaLM. I would also expect it to base more of its answers on content than form.  I think there are a ton of experiments in the direction of natural story plots that one could test and I would be interested in seeing them tested.  The reason we started with relatively basic toy problems is that they are easier to control. For example, it is quite hard to differentiate whether the model learned based on form or content in a natural story context.  Overall, I expect there to be many further research projects and papers in this direction. 

Nice post! The Game Theory / Bureaucracy is interesting. It reminds me of Drexler's CAIS proposal, where services are combined into an intelligent whole. But I (and Drexler, I believe) agree that much more work could be spent on figuring out how to actually design/combine these systems.

Thanks Marius and David, really interesting post, and super glad to see interest in causality picking up!

I very much share your "hunch that causality might play a role in transformative AI and feel like it is currently underrepresented in the AI safety landscape."

Most relevant, I've been working with Mary Phuong on a project which seems quite related to what you are describing here. I don't want to share too many details publicly without checking with Mary first, but if you're interested perhaps we could set up a call sometime?

I also think causality is rel... (read more)

1Marius Hobbhahn2y
I'm very interested in a collaboration!! Let's switch to DMs for calls and meetings.

There are numerous techniques for this, based on e.g. symmetries, conserved properties, covariances, etc.. These techniques can generally be given causal justification.


I'd be curious to hear more about this, if you have some pointers

Sure! I wrote "etc.", but really the main ones I can think of are probably the ones I listed there. Let's start with correlations since this is the really old-school one. The basic causal principle behind a correlation/covariance-based method is that if you see a correlation where the same thing appears in different places, then that correlation is due to there being a shared cause. This is in particular useful for representation learning, because the shared cause is likely not just an artifact of your perception ("this pixel is darker or lighter") but instead a feature of the world itself ("the scene depicted in this image has properties XYZ"). This then leads to the insight of Factor Analysis[1]: it's easy to set up a linear generative model with a fixed number of independent latent variables to model your data. Factor Analysis still gets used a lot in various fields like psychology, but for machine perception it's too bad because perception requires nonlinearity. (Eigenfaces are a relic of the past.) However, the core concept that correlations imply latent variables, and that these latent variables are likely more meaningful features of reality contains to be relevant in many models: * Variational autoencoders, generative adversarial networks, etc., learn to encode the distribution of images, and tend to contain a meaningful "latent space" that you can use to generate counterfactual images at a high level of abstraction. They rely on covariances because they must fit the latent variables so that they capture the covariances between different parts of the images. * Triplet loss for e.g. facial recognition tries to filter out the features that correlate for different images of a single person, vs the features that do not correlate for people on different images and are thus presumably artifacts of the image. Covariance-based techniques aren't the only game in town, though; a major alternate one is symmetries. Often, we know that some symmetries hold; for ins

Thanks Ilya for those links, in particular the second one looks quite relevant to something we’ve been working on in a rather different context (that's the benefit of speaking the same language!)

We would also be curious to see a draft of the MDP-generalization once you have something ready to share!

3IlyaShpitser2y (this really is preliminary, e.g. they have not yet uploaded a newer version that incorporates peer review suggestions). --- Can't do stuff in the second paper without worrying about stuff in the first (unless your model is very simple).


  • I think the existing approach and easy improvements don't seem like they can capture many important incentives such that you don't want to use it as an actual assurance (e.g. suppose that agent A is predicting the world and agent B is optimizing A's predictions about B's actions---then we want to say that the system has an incentive to manipulate the world but it doesn't seem like that is easy to incorporate into this kind of formalism).


This is what multi-agent incentives are for (i.e. incentive analysis in multi-agent CIDs).  We're still ... (read more)

Glad she likes the name :) True, I agree there may be some interesting subtleties lurking there. 

(Sorry btw for slow reply; I keep missing alignmentforum notifications.)

Thanks Stuart and Rebecca for a great critique of one of our favorite CID concepts! :)

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent. 

What control incentives do capture are the instrumental goals of the agent. Controlling X can be a subgoal for achieving utility if and only if the CID admits a control incentive... (read more)

Cheers; Rebecca likes the "instrumental control incentive" terminology; she claims it's more in line with control theory terminology. I think it's more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn't seem to do what it intended.

Glad you liked it.

Another thing you might find useful is Dennett's discussion of what an agent is (see first few chapters of Bacteria to Bach). Basically, he argues that an agent is something we ascribe beliefs and goals to. If he's right, then an agent should basically always have a utility function.

Your post focuses on the belief part, which is perhaps the more interesting aspect when thinking about strange loops and similar.

There is a paper which I believe is trying to do something similar to what you are attempting here:

Networks of Influence Diagrams: A Formalism for Representing Agents’ Beliefs and Decision-Making Processes, Gal and Pfeffer, Journal of Artificial Intelligence Research 33 (2008) 109-147

Are you aware of it? How do you think their ideas relate to yours?

Very interesting, thank you for the link! Main difference between what they're doing and what I'm doing: they're using explicit utility & maximization nodes; I'm not. It may be that this doesn't actually matter. The representation I'm using certainly allows for utility maximization - a node downstream of a cloud can just be a maximizer for some utility on the nodes of the cloud-model. The converse question is less obvious: can any node downstream of a cloud be represented by a utility maximizer (with a very artificial "utility")? I'll probably play around with that a bit; if it works, I'd be able to re-use the equivalence results in that paper. If it doesn't work, then that would demonstrate a clear qualitative difference between "goal-directed" behavior and arbitrary behavior in these sorts of systems, which would in turn be useful for alignment - it would show a broad class of problems where utility functions do constrain.

Is this analogous to the stance-dependency of agents and intelligence?

It is analogous, to some extent; I do look into some aspect of Daniel Dennett's classification here: I also had a more focused attempt at defining AI wireheading here: I think you've already seen that?

Thanks Stuart, nice post.

I've moved away from the wireheading terminology recently, and instead categorize the problem a little bit differently:

The top-level category is reward hacking / reward corruption, which means that the agent's observed reward differs from true reward/task performance.

Reward hacking has two subtypes, depending on whether the agent exploited a misspecification in the process that computes the rewards, or modified the process. The first type is reward gaming and the second reward tampering.

Tampering can subsequently be divi... (read more)

Thanks for a nice post about causal diagrams!

Because our universe is causal, any computation performed in our universe must eventually bottom out in a causal DAG.

Totally agree. This is a big part of the reason why I'm excited about these kinds of diagrams.

This raises the issue of abstraction - the core problem of embedded agency. ... how can one causal diagram (possibly with symmetry) represent another in a way which makes counterfactual queries on the map correspond to some kind of counterfactual on the territory?

Great question, I really think someon... (read more)

Actually, I would argue that the model is naturalized in the relevant way.

When studying reward function tampering, for instance, the agent chooses actions from a set of available actions. These actions just affect the state of the environment, and somehow result in reward or not.

As a conceptual tool, we label part of the environment the "reward function", and part of the environment the "proper state". This is just to distinguish between effects that we'd like the agent to use from effects that we don't want the agent to use.

T... (read more)

We didn't expect this to be surprising to the LessWrong community. Many RL researchers tend to be surprised, however.

8Wei Dai4y
Ah, that makes sense. I kind of guessed that the target audience is RL researchers, but still misinterpreted "perhaps surprisingly" as a claim of novelty instead of an attempt to raise the interest of the target audience.

Yes, that is partly what we are trying to do here. By summarizing some of the "folklore" in the community, we'll hopefully be able to get new members up to speed quicker.

Hey Steve,

Thanks for linking to Abram's excellent blog post.

We should have pointed this out in the paper, but there is a simple correspondence between Abram's terminology and ours:

Easy wireheading problem = reward function tampering

Hard wireheading problem = feedback tampering.

Our current-RF optimization corresponds to Abram's observation-utility agent.

We also discuss the RF-input tampering problem and solutions (sometimes called the delusion box problem), which I don’t fit into Abram’s distinction.

Hey Charlie,

Thanks for bringing up these points. The intended audience is researchers more familiar with RL than the safety literature. Rather than try to modify the paper to everyone's liking, let me just give a little intro / context for it here.

The paper is the culmination of a few years of work (previously described in e.g. my thesis and alignment paper). One of the main goals has been to understand whether it is possible to redeem RL from a safety viewpoint, or whether some rather different framework would be necessary to build safe AGI.

As a firs... (read more)

3Charlie Steiner4y
Sure. On the one hand, xkcd. On the other hand, if it works for you, that's great and absolutely useful progress. I'm a little worried about direct applicability to RL because the model is still not fully naturalized - actions that affect goals are neatly labeled and separated rather than being a messy subset of actions that affect the world. I guess this another one of those cases where I think the "right" answer is "sophisticated common sense," but an ad-hoc mostly-answer would still be useful conceptual progress.
I really like this layout, this idea, and the diagrams. Great work.

Glad to hear it :)

I don't agree that counterfactual oracles fix the incentive. There are black boxes in that proposal, like "how is the automated system not vulnerable to manipulation" and "why do we think the system correctly formally measures the quantity in question?" (see more potential problems). I think relying only on this kind of engineering cleverness is generally dangerous, because it produces safety measures we don't see how to break (and probably no
... (read more)

Hey Charlie,

Thanks for your comment! Some replies:

sometimes one makes different choices in how to chop an AI's operation up into causally linked boxes, which can lead to an apples-and-oranges problem when comparing diagrams (for example, the diagrams you use for CIRL and IDI are very different choppings-up of the algorithms)

There is definitely a modeling choice involved in choosing how much "to pack" in each node. Indeed, most of the diagrams have been through a few iterations of splitting and combining nodes. The aim has been to focus on th... (read more)

1Charlie Steiner4y
All good points. The paper you linked was interesting - the graphical model is part of an AI design that actually models other agents using that graph. That might be useful if you're coding a simple game-playing agent, but I think you'd agree that you're using CIDs in a more communicative / metaphorical way?

Chapter 4 in Bacteria to Bach is probably most relevant to what we discussed here (with preceding chapters providing a bit of context).

Yes, it would interesting to see if causal influence diagrams (and the inference of incentives) could be useful here. Maybe there's a way to infer the CID of the mesa-optimizer from the CID of the base-optimizer? I don't have any concrete ideas at the moment -- I can be in touch if I think of something suitable for collaboration!

What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?

Indeed, this is a super slippery question. And I think this is a good reason to stand on the shoulders of a giant like Dennett. Some of the questions he has been tackling are actually quite similar to yours, around the emergence of agency and the emergence of consciousness.

For example, does it ma... (read more)

4Vlad Mikulik4y
I’ve been meaning for a while to read Dennett with reference to this, and actually have a copy of Bacteria to Bach. Can you recommend some choice passages, or is it significantly better to read the entire book? P.S. I am quite confused about DQN’s status here and don’t wish to suggest that I’m confident it’s an optimiser. Just to point out that it’s plausible we might want to call it one without calling PPO an optimiser. P.P.S.: I forgot to mention in my previous comment that I enjoyed the objective graph stuff. I think there might be fruitful overlap between that work and the idea we’ve sketched out in our third post on a general way of understanding pseudo-alignment. Our objective graph framework is less developed than yours, so perhaps your machinery could be applied there to get a more precise analysis?

Thanks for the interesting post! I find the possibility of a gap between the base optimization objective and the mesa/behavioral objective convincing, and well worth exploring.

However, I'm less convinced that the distinction between the mesa-objective and the behavioral objective is real/important. You write:

Informally, the behavioral objective is the objective which appears to be optimized by the system’s behavior. More formally, we can operationalize the behavioral objective as the objective recovered from perfect inverse reinforcement learning (IRL).[

... (read more)
The distinction between the mesa- and behavioral objectives might be very useful when reasoning about deceptive alignment (in which the mesa-optimizer tries to have a behavioral objective that is similar to the base objective, as an instrumental goal for maximizing the mesa-objective).

Thanks for an insightful comment. I think your points are good to bring up, and though I will offer a rebuttal I’m not convinced that I am correct about this.

What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?

Distinguishing the “this system looks like it optimises for X” from “this system internally uses an evaluation of X to make decisions” is us

... (read more)

Thank you! Really inspiring to win this prize. As John Maxwell stated in the previous round, the recognition is more important than the money. Very happy to receive further comments and criticism by email Through debate we grow :)

Thanks for your comments, much appreciated! I'm currently in the middle of moving continents, will update the draft within a few weeks.

Nice writeup. Is one-boxing in Newcomb an equilibria?

My confusion is the following:

Premises (*) and inferences (=>):

  • The primary way for the agent to avoid traps is to delegate to a soft-maximiser.

  • Any action with boundedly negative utility, a soft-maximiser will take with positive probability.

  • Actions leading to traps do not have infinitely negative utility.

=> The agent will fall into traps with positive probability.

  • If the agent falls into a trap with positive probability, then it will have linear regret.

=> The agent will have linear regret.

So when you say in the beginning of the post

... (read more)
1Vanessa Kosoy6y
Your confusion is because you are thinking about regret in an anytime setting. In an anytime setting, there is a fixed policy π, we measure the expected reward of π over a time interval t and compare it to the optimal expected reward over the same time interval. If π has probability p>0 to walk into a trap, regret has the linear lower bound Ω(pt). On other hand, I am talking about policies πt that explicitly depend on the parameter t (I call this a "metapolicy"). Both the advisor and the agent policies are like that. As t goes to ∞, the probability p(t) to walk into a trap goes to 0, so p(t)t is a sublinear function. A second difference with the usual definition of regret is that I use an infinite sum of rewards with geometric time discount e−1/t instead of a step function time discount that cuts off at t. However, this second difference is entirely inessential, and all the theorems work about the same with step function time discount.

So this requires the agent's prior to incorporate information about which states are potentially risky?

Because if there is always some probability of there being a risky action (with infinitely negative value), then regardless how small the probability is and how large the penalty is for asking, the agent will always be better off asking.

(Did you see Owain Evans recent paper about trying to teach the agent to detect risky states.)

0Vanessa Kosoy6y
The only assumptions about the prior are that it is supported on a countable set of hypotheses, and that in each hypothesis the advisor is β-rational (for some fixed β(t)=ω(t2/3)). There is no such thing as infinitely negative value in this framework. The utility function is bounded because of the geometric time discount (and because the momentary rewards are assumed to be bounded), and in fact I normalize it to lie in [0,1] (see the equation defining U in the beginning of the Results section). Falling into a trap is an event associated with Ω(1) loss (i.e. loss that remains constant as t goes to ∞). Therefore, we can risk such an event, as long as the probability is o(1) (i.e. goes to 0 as t goes to ∞). This means that as t grows, the agent will spend more rounds delegating to the advisor, but for any given t, it won't delegate on most rounds (even on most of the important rounds, i.e. during the first O(t)-length "horizon"). In fact, you can see in the proof of Lemma A, that the policy I construct delegates on O(t2/3) rounds. As a simple example, consider again the toy environment from before. Consider also the environments you get from it by applying a permutation to the set of actions A. Thus, you get a hypothesis class of 6 environments. Then, the corresponding DIRL agent will spend O(t2/3) rounds delegating, observe which action is chosen by the advisor most frequently, and perform this action forevermore. (The phenomenon that all delegations happen in the beginning is specific to this toy example, because it only has 1 non-trap state.) If you mean this paper, I saw it?
Load More