Partial Agency

abramdemski

Epistemic status: very rough intuitions here.

I think there's something interesting going on with Evan's notion of myopia.

Evan has been calling this thing "myopia". Scott has been calling it "stop-gradients". In my own mind, I've been calling the phenomenon "directionality". Each of these words gives a different set of intuitions about how the cluster could eventually be formalized.

Stop-Gradients

Nash equilibria are, abstractly, modeling agents via an equation like $a^{*} = {argmax}_{a} f (a, a^{*})$ . In words: $a^{*}$ is the agent's mixed strategy. The payoff $f (., .)$ is a function of the mixed strategy in two ways: the first argument is the causal channel, where actions directly have effects; the second argument represents the "acausal" channel, IE, the fact that the other players know the agent's mixed strategy and this influences their actions. The agent is maximizing across the first channel, but "ignoring" the second channel; that is why we have to solve for a fixed point to find Nash equilibria. This motivates the notion of "stop gradient": if we think in terms of neural-network type learning, we're sending the gradient through the first argument but not the second. (It's a kind of mathematically weird thing to do!)

Myopia

Thinking in terms of iterated games, we can also justify the label "myopia". Thinking in terms of "gradients" suggests that we're doing some kind of training involving repeatedly playing the game. But we're training an agent to play as if it's a single-shot game: the gradient is rewarding behavior which gets more reward within the single round even if it compromises long-run reward. This is a weird thing to do: why implement a training regime to produce strategies like that, if we believe the nash-equilibrium model, IE we think the other players will know our mixed strategy and react to it? We can, for example, win chicken by going straight more often than is myopically rational. Generally speaking, we expect to get better rewards in the rounds after training if we optimized for non-myopic strategies during training.

Directionality

To justify my term "directionality" for these phenomena, we have to look at a different example: the idea that "when beliefs and reality don't match, we change our beliefs". IE: when optimizing for truth, we optimize "only in one direction". How is this possible? We can write down a loss function, such as Bayes' loss, to define accuracy of belief. But how can we optimize it only "in one direction"?

We can see that this is the same thing as myopia. When training predictors, we only consider the efficacy of hypotheses one instance at a time. Consider supervised learning: we have "questions" $x_{1}, x_{2}, . . .$ etc and are trying to learn "answers" $y_{1}, y_{2}, . . .$ etc. If a neural network were somehow able to mess with the training data, it would not have much pressure to do so. If it could give an answer on instance $x_{1}$ which improved its ability to answer on $x_{2}$ by manipulating $y_{2}$ , the gradient would not specially favor this. Suppose it is possible to take some small hit (in log-loss terms) on $y_{1}$ for a large gain on $y_{2}$ . The large gain for $x_{2}$ would not reinforce the specific neural patterns responsible for making $y_{2}$ easy (only the patterns responsible for successfully taking advantage of the easiness). The small hit on $x_{1}$ means there's an incentive not to manipulate $y_{2}$ .

It is possible that the neural network learns to manipulate the data, if by chance the neural patterns which shift $x_{1}$ are the same as those which successfully exploit the manipulation at $x_{2}$ . However, this is a fragile situation: if there are other neural sub-patterns which are equally capable of giving the easy answer on $x_{2}$ , the reward gets spread around. (Think of these as parasites taking advantage of the manipulative strategy without doing the work necessary to sustain it.) Because of this, the manipulative sub-pattern may not "make rent": the amount of positive gradient it gets may not make up for the hit it takes on $x_{1}$ . And all the while, neural sub-patterns which do better on $x_{1}$ (by refusing to take the hit) will be growing stronger. Eventually they can take over. This is exactly like myopia: strategies which do better in a specific case are favored for that case, despite global loss. The neural network fails to successfully coordinate with itself to globally minimize loss.

To see why this is also like stop-gradients, think about the loss function as $l (w, w^{*})$ : the neural weights $w$ determine loss through a "legitimate" channel (the prediction quality on a single instance), plus an "illegitimate" channel (the cross-instance influence which allows manipulation of $y_{2}$ through the answer given for $x_{1}$ ). We're optimizing through the first channel, but not the second.

The difference between supervised learning and reinforcement learning is just: reinforcement learning explicitly tracks helpfulness of strategies across time, rather than assuming a high score at $x_{2}$ has to do with only behaviors at $x_{2}$ ! As a result, RL can coordinate with itself across time, whereas supervised learning cannot.

Keep in mind that this is a good thing: the algorithm may be "leaving money on the table" in terms of prediction accuracy, but this is exactly what we want. We're trying to make the map match the territory, not the other way around.

Important side-note: this argument obviously has some relation to the question of how we should think about inner optimizers and how likely we should expect them to be. However, I think it is not a direct argument against inner optimizers. (1) The emergence of an inner optimizer is exactly the sort of situation where the gradients end up all feeding through one coherent structure. Other potential neural structures cannot compete with the sub-agent, because it has started to intelligently optimize; few interlopers can take advantage of the benefits of the inner optimizer's strategy, because they don't know enough to do so. So, all gradients point to continuing the improvement of the inner optimizer rather than alternate more-myopic strategies. (2) Being an inner optimizer is non synonymous with non-myopic behavior. An inner optimizer could give myopic responses on the training set while internally having less-myopic values. Or, an inner optimizer could have myopic but very divergent values. Importantly, an inner optimizer need not take advantage of any data-manipulation of the training set like that I've described; it need not even have access to any such opportunities.

The Partial Agency Paradox

I've given a couple of examples. I want to quickly give some more to flesh out the clusters as I see them:

As I said, myopia is "partial agency" whereas foresight is "full agency". Think of how an agent with high time-preference (ie steep temporal discounting) can be money-pumped by an agent with low time-preference. But the limit of no-temporal-discounting-at-all is not always well-defined.
An updatefull agent is "partial agency" whereas updatelessness is "full agency": the updateful agent is failing to use some channels of influence to get what it wants, because it already knows those things and can't imagine them going differently. Again, though, full agency seems to be an idealization we can't quite reach: we don't know how to think about updatelessness in the context of logical uncertainty, only more- or less- updatefull strategies.
I gave the beliefs $\leftarrow$ territory example. We can also think about the values $\to$ territory case: when the world differs from our preferences, we change the world, not our preferences. This has to do with avoiding wireheading.
Similarly, we can think of examples of corrigibility -- such as respecting an off button, or avoiding manipulating the humans -- as partial agency.
Causal decision theory is more "partial" and evidential decision theory is less so: EDT wants to recognize more things as legitimate channels of influence, while CDT claims they're not. Keep in mind that the math of causal intervention is closely related to the math which tells us about whether an agent wants to manipulate a certain variable -- so there's a close relationship between CDT-vs-EDT and wireheading/corrigibility.

I think people often take a pro- or anti- partial agency position: if you are trying to one-box in Newcomblike problems, trying to cooperate in prisoner's dilemma, trying to define logical updatelessness, trying for superrationality in arbitrary games, etc... you are generally trying to remove barriers to full agency. On the other hand, if you're trying to avert instrumental incentives, make sure an agent allows you to change its values, or doesn't prevent you from pressing an off button, or doesn't manipulate human values, etc... you're generally trying to add barriers to full agency.

I've historically been more interested in dropping barriers to full agency. I think this is partially because I tend to assume that full agency is what to expect in the long run, IE, "all agents want to be full agents" -- evolutionarily, philosophically, etc. Full agency should result from instrumental convergence. Attempts to engineer partial agency for specific purposes feel like fighting against this immense pressure toward full agency; I tend to assume they'll fail. As a result, I tend to think about AI alignment research as (1) needing to understand full agency much better, (2) needing to mainly think in terms of aligning full agency, rather than averting risks through partial agency.

However, in contrast to this historical view of mine, I want to make a few observations:

Partial agency sometimes seems like exactly what we want, as in the case of map $\leftarrow$ territory optimization, rather than a crude hack which artificially limits things.
Indeed, partial agency of this kind seems fundamental to full agency.
Partial agency seems ubiquitous in nature. Why should I treat full agency as the default?

So, let's set aside pro/con positions for a while. What I'm interested in at the moment is the descriptive study of partial agency as a phenomenon. I think this is an organizing phenomenon behind a lot of stuff I think about.

The partial agency paradox is: why do we see partial agency naturally arising in certain contexts? Why are agents (so often) myopic? Why have a notion of "truth" which is about map $\leftarrow$ territory fit but not the other way around? Partial agency is a weird thing. I understand what it means to optimize something. I understand how a selection process can arise in the world (evolution, markets, machine learning, etc), which drives things toward maximization of some function. Partial optimization is a comparatively weird thing. Even if we can set up a "partial selection process" which incentivises maximization through only some channels, wouldn't it be blind to the side-channels, and so unable to enforce partiality in the long-term? Can't someone always come along and do better via full agency, no matter how our incentives are set up?

Of course, I've already said enough to suggest a resolution to this puzzle.

My tentative resolution to the paradox is: you don't build "partial optimizers" by taking a full optimizer and trying to add carefully balanced incentives to create indifference about optimizing through a specific channel, or anything like that. (Indifference at the level of the selection process does not lead to indifference at the level of the agents evolved by that selection process.) Rather, partial agency is what selection processes incentivize by default. If there's a learning-theoretic setup which incentivizes the development of "full agency" (whatever that even means, really!) I don't know what it is yet.

Why?

Learning is basically episodic. In order to learn, you (sort of) need to do the same thing over and over, and get feedback. Reinforcement learning tends to assume ergodic environments so that, no matter how badly the agent messes up, it eventually re-enters the same state so it can try again -- this is a "soft" episode boundary. Similarly, RL tends to require temporal discounting -- this also creates a soft episode boundary, because things far enough in the future matter so little that they can be thought of as "a different episode".

So, like map $\leftarrow$ territory learning (that is, epistemic learning), we can kind of expect any type of learning to be myopic to some extent.

This fits the picture where full agency is an idealization which doesn't really make sense on close examination, and partial agency is the more real phenomenon. However, this is absolutely not a conjecture on my part that all learning algorithms produce partial agents of some kind rather than full agents. There may still be frameworks which allow us to approach full agency in the limit, such as taking the limit of diminishing discount factors, or considering asymptotic behavior of agents who are able to make precommitments. We may be able to achieve some aspects of full agency, such as superrationality in games, without others.

Again, though, my interest here is more to understand what's going on. The point is that it's actually really easy to set up incentives for partial agency, and not so easy to set up incentives for full agency. So it makes sense that the world is full of partial agency.

Some questions:

To what extent is it really true that settings such as supervised learning disincentivize strategic manipulation of the data? Can my argument be formalized?
If thinking about "optimizing a function" is too coarse-grained (a supervised learner doesn't exactly minimize prediction error, for example), what's the best way to revise our concepts so that partial agency becomes obvious rather than counterintuitive?
Are there better ways of characterizing the partiality of partial agents? Does myopia cover all cases (so that we can understand things in terms of time-preference), or do we need the more structured stop-gradient formulation in general? Or perhaps a more causal-diagram-ish notion, as my "directionality" intuition suggests? Do the different ways of viewing things have nice relationships to each other?
Should we view partial agents as multiagent systems? I've characterized it in terms of something resembling game-theoretic equilibrium. The 'partial' optimization of a function arises from the price of anarchy, or as it's known around lesswrong, Moloch. Are partial agents really bags of full agents keeping each other down? This seems a little true, to me, but also doesn't strike me as the most useful way of thinking about partial agents. For one thing, it takes full agents as a necessary concept to build up partial agents, which seems wrong to me.
What's the relationship between the selection process (learning process, market, ...) and the type of partial agents incentivised by it? If we think in terms of myopia: given a type of myopia, can we design a training procedure which tracks or doesn't track the relevant strategic influences? If we think in terms of stop-gradients: we can take "stop-gradient" literally and stop there, but I suspect there is more to be said about designing training procedures which disincentivize the strategic use of specified paths of influence. If we think in terms of directionality: how do we get from the abstract "change the map to match the territory" to the concrete details of supervised learning?
What does partial agency say about inner optimizers, if anything?
What does partial agency say about corrigibility? My hope is that there's a version of corrigibility which is a perfect fit in the same way that map $\leftarrow$ territory optimization seems like a perfect fit.

Ultimately, the concept of "partial agency" is probably confused. The partial/full clustering is very crude. For example, it doesn't make sense to think of a non-wireheading agent as "partial" because of its refusal to wirehead. And it might be odd to consider a myopic agent as "partial" -- it's just a time-preference, nothing special. However, I do think I'm pointing at a phenomenon here, which I'd like to understand better.

But how can we optimize it only “in one direction”?

I'm not sure that SL does optimize only "in one direction". It's true that if you use gradient descent the model won't try to manipulate future questions/answers, but you could end up with a model that manipulates the current answer or loss. For example the training process could produce a mesa-optimizer with a utility function (over the real world) that assigns high utility to worlds where "loss" is minimized, where "loss" is defined as the value of the RAM location that stores the computed loss, or as the difference between its output and the value of the RAM location that stores the training label. This utility function would cause it to output good answers on a very diverse set of questions. But when it builds a sufficiently good world model, the mesa-optimizer could output a string that triggers a flaw in the code path (or hardware) that processes its output, thereby taking over the computer and overwriting the "loss" value or the training label (depending on the specific utility function that it ended up with).

So it seems like SL produces myopia, but not necessarily "in one direction" except that it's usually easier for the model to minimize loss by changing the output than the training label. But at some point if the training process produces a mesa-optimizer and the mesa-optimizer gets sufficiently capable, and there are inherent limits to how far loss can be minimized by just changing its output, it could start changing the training label or the "loss".

(I first saw Alex Turner (TurnTrout) express this concern in the context of Counterfactual Oracles, which I then elaborated here.)

I agree that this is possible, but I would be very surprised if a mesa-optimizer actually did something like this. By default, I expect mesa-optimizers to use proxy objectives that are simple, fast, and easy to specify in terms of their input data (e.g. pain) not those that require extremely complex world models to even be able to specify (e.g. spread of DNA). In the context of supervised learning, having an objective that explicitly cares about the value of RAM that stores its loss seems very similar to explicitly caring about the spread of DNA in that it requires a complex model of the computer the mesa-optimizer is running and is quite complex and difficult to reason about. This is why I'm not very worried about reward-tampering: I think proxy-aligned mesa-optimizers basically never tamper with their rewards (though deceptively aligned mesa-optimizers might, but that's a separate problem).

By default, I expect mesa-optimizers to use proxy objectives that are simple, fast, and easy to specify in terms of their input data (e.g. pain) not those that require extremely complex world models to even be able to specify (e.g. spread of DNA).

But those simpler proxy objectives wouldn't let the model do as well as caring about the "loss" RAM location (if the training data is diverse enough), so if you kept training the model and you had sufficient compute wouldn't you eventually produce a model that used the latter kind of proxy objective?

It seems quite plausible that something else would happen first though, like a deceptively aligned mesa-optimizer is produced. Is that what you'd expect? Also, I'm wondering what an actually aligned mesa-optimizer looks like in the case of using SL to train a general purpose question-answerer, and would be interested in your thoughts on that if you have any. (For example, is it a utility maximizer, and if so what does its utility function look like?)

My intuition here is that you're likely right, but, I do want to understand what Wei is pointing out as part of a full understanding of partial agency.

I agree that this is possible, but I would be very surprised if a mesa-optimizer actually did something like this. By default, I expect mesa-optimizers to use proxy objectives that are simple, fast, and easy to specify

Let me tell a story for why I'm thinking this type of mesa-optimizer misalignment is realistic or even likely for the advanced AIs of the future. The starting point is that the advanced AI is a learning system that continually constructs a better and better world-model over time.

Imagine that the mesa-optimizer actually starts out properly inner-aligned, i.e. the AI puts a flag on "Concept X" in its world-model as its goal, and Concept X really does correspond to our intended supervisory signal of "Accurate answers to our questions". Now over time, as the AI learns more and more, it (by default) comes to have beliefs about itself and its own processing, and eventually develops an awareness of the existence of a RAM location storing the supervisory answer as Wei Dai was saying. Now there's a new "Concept Y" in its world-model, corresponding to its belief about what is in that RAM location.

Now, again, assume the AI is set up to build a better and better world-model by noticing patterns. So, by default, it will eventually notice that Concept X and Concept Y always have the same value, and it will then add into the world-model some kind of relationship between X and Y. What happens next probably depends on implementation details, but I think it's at least possible that the "goal-ness" flag that was previously only attached to X in the world-model, will now partly attach itself to Y, or even entirely transfer from X to Y. If that happens, the AI's mesa-goal has now shifted from aligned ("accurate answers") to misaligned ("get certain bits into RAM").

(This is kinda related to ontological crises.) (I also agree with Wei's comment but the difference is that I'm assuming that there's a training phase with a supervisory signal, then a deployment phase with no supervisory signal, and I'm saying that the mesa-optimizer can go from aligned to misaligned during the deployment phase even in that case. If the training signal is there forever, that's even worse, because like Wei said, Y would match that signal better than X (because of labeling errors) so I would certainly expect Y to get flagged as the goal in that case.)

Yep, I totally agree; I was thinking about this but didn't include it in the post. So the different notions actually aren't equivalent; myopia may be a generally weaker condition.

(I first saw Alex Turner (TurnTrout) express this concern

This link is broken now but I think I found an updated one that works:

https://www.lesswrong.com/posts/yAiqLmLFxvyANSfs2/counterfactual-oracles-online-supervised-learning-with?commentId=FPcEqFisRsfihnLcX

I really like this post. I'm very excited about understanding more about this as I said in my mechanistic corrigibility post (which as you mention is very related to the full/partial agency distinction).

we can kind of expect any type of learning to be myopic to some extent

I'm pretty uncertain about this. Certainly to the extent that full agency is impossible (due to computational/informational constraints, for example), I agree with this. But I think a critical point which is missing here is that full agency can still exhibit pseudo-myopic behavior (and thus get selected for) if using an objective that is discounted over time or if deceptive. Thus, I don't think that having some sort of soft episode boundary is enough to rule out full-ish agency.

Furthermore, it seems to me like it's quite plausible that for many learning setups models implementing algorithms closer to full agency will be simpler than models implementing algorithms closer to partial agency. As you note, partial agency is a pretty weird thing to do from a mathematical standpoint, so it seems like many learning processes might penalize it pretty heavily for that. At the very least, if you count Solomonoff Induction as a learning process, it seems like you should probably expect something a lot closer to full agency there.

That being said, I definitely agree that the fact that epistemic learning seems to just do this by default seems pretty promising for figuring out how to get myopia, so I'm definitely pretty excited about that.

RL tends to require temporal discounting -- this also creates a soft episode boundary, because things far enough in the future matter so little that they can be thought of as "a different episode".

This is just a side note, but RL also tends to have hard episode boundaries if you are regularly resetting the state of the environment as is common in many RL setups.

Thanks, I appreciate your enthusiasm! I'm still not sure how much sense all of this makes.

I agree with your simplicity point, but it may be possible to ignore this by talking about what's learned in the limit. If strategic manipulation is disincentivized, then strategic manipulators will eventually lose. We might still expect strategic manipulators in practice, because they might be significantly simpler. But a theory of partial agency can examine the limiting behavior separately from the prevalence of manipulators in the prior.

I agree with your other points.

(I didn't understand this post, this comment is me trying to make sense of it. After writing the comment, I think I understand the post more, and the comment is effectively an answer to it.)

Here's a picture for full agency, where we want a Cartesian agent that optimizes some utility function over the course of the history of the universe. We're going to create an algorithm "outside" the universe, but by design we only care about performance during the "actual" time in the universe.

Idealized, Outside-the-universe Algorithm: Simulate the entire history of the universe, from beginning to end, letting the agent take actions along the way. Compute the reward at the end, and use that to improve the agent so it does better. (Don't use a discount factor; if you have a pure rate of time preference, that should be part of the utility function.) Repeat until the agent is optimal. (Ignore difficulties with optimization.)

Such an agent will exhibit all the aspects of full agency within the universe. If during the universe-history, something within the universe starts to predict it well, then it will behave as within-universe FDT would predict. The agent will not be myopic: it will be optimizing over the entire universe-history. If it ends up in a game with some within-universe agent, it will not use the Nash equilibrium, it will correctly reason about the other agent's beliefs about it, and exploit those as best it can.

Now obviously, this is a Cartesian agent, not an embedded one, and the learning procedure takes place "outside the universe", and any other agents are required to be "in the universe", and this is why everything is straightforward. But it does seem like this gives us full agency.

When both your agent and learning process have to be embedded within the environment, you can't have this simple story any more. There isn't a True Embedded Learning Process to find; at the very minimum any such process could be diagonalized against by the environment. Any embedded learning process must be "misspecified" in some way, relative to the idealized learning process above, if you are evaluating on the metric "is the utility function on universe-histories maximized". (This is part of my intuition against realism about rationality.) These misspecifications lead to "partial agency".

To add more gears to this: learning algorithms work by generating/collecting data points, and then training an agent on that data, under the assumption that each data point is an iid sample. Since the data points cannot be full universe-histories, they will necessarily leave out some aspects of reality that the Idealized Outside-the-Universe Algorithm could capture. Examples:

In supervised learning, each data point is a single pair $(x, y)$ . The iid assumption means that the algorithm cannot model the fact that $y_{1}$ could influence the pair $(x_{2}, y_{2})$ , and so the gradients don't incentivize using that influence.
In RL, each data point is a small fragment of a universe-history (i.e. an episode). The iid assumption means that the algorithm cannot model the fact that changes to the first fragment can affect future fragments, which leads to myopia.
In closed-source games, each data point is a transcript of what happened in a particular instance of the game. The iid assumption means that the algorithm cannot model the opponent changing its policy, and so treats it as a one-shot game instead of an iterated game. (What exactly happens depends a lot on the specific setup.)

So my position is "partial agency arises because any embedded learning algorithm will necessarily leave out aspects that the idealized learning algorithm can identify". And as a subclaim, that this often happens because of the effective iid assumption between data points in a learning algorithm.

The reality --> beliefs optimization seems like a different thing: bidirectional optimization of that would correspond to minimizing the delta between beliefs and reality. No one actually wants to literally minimize that; having accurate beliefs is an instrumental goal for some other goal, not a terminal one.

That said, I'm not optimistic about creating incentives for particular kinds of partial agency: as soon as the model is able to reason, it can do all the same reasoning I did, and if it is actually trying to maximize some simple function of universe-histories, then it should move towards full agency upon doing this reasoning.

I wrote up a long reply to this and then accidentally lost it :(

Let me first say that I definitely sympathize with skepticism/confusion about this whole line of thinking.

I roughly agree with your picture of what's going on with "full agency" -- it's best thought of as fully cartesian idealized UDT, "learning" by searching for the best policy.

Initially I was on-board with your connection to iid, but now I think it's a red herring.

I illustrated my idea with an iid example, but I can make a similar argument for algorithms which explicitly discard iid, such as Solomonoff induction. Solomonoff induction still won't systematically learn to produce answers which manipulate the data. This is because SI's judgement of the quality of a hypothesis doesn't pay any attention to how dominant the hypothesis was during a given prediction -- completely unlike RL, where you need to pay attention to what action you actually took. So if the current-most-probable hypothesis is a manipulator, throwing around its weight to make things easy to predict, and a small-probability hypothesis is "parasitically" taking advantage of the ease-of-prediction without paying the cost of implementing the manipulative strategy, the parasite will continue rising in probability until the manipulative strategy doesn't have enough weight to shift the output probabilities the way it needs to to implement the manipulative strategy.

So, actually, iid isn't what's going on at all, although iid cases do seem like particularly clear illustrations. This further convinces me that there's an interesting phenomenon to formalize here.

The reality --> beliefs optimization seems like a different thing: bidirectional optimization of that would correspond to minimizing the delta between beliefs and reality. No one actually wants to literally minimize that; having accurate beliefs is an instrumental goal for some other goal, not a terminal one.

I'm not sure what you're saying here. I agree that "no one wants that". That's what I meant when I said that partial agency seems to be a necessary subcomponent of full agency -- even idealized full agents need to implement partial-agency optimizations for certain sub-processes, at least in the one case of reality->belief optimization. (Although, perhaps this is not true, since we should think of full agency as UDT which doesn't update at all... maybe it is more accurate to say that full-er agents often want to use partial-er optimizations for sub-processes.)

So I don't know what you mean when you say it seems like a different thing. I agree with Wei's point that myopia isn't fully sufficient to get reality->belief directionality; but, at least, it gets a whole lot of it, and reality->belief directionality implies myopia.

That said, I'm not optimistic about creating incentives for particular kinds of partial agency: as soon as the model is able to reason, it can do all the same reasoning I did, and if it is actually trying to maximize some simple function of universe-histories, then it should move towards full agency upon doing this reasoning.

I'm not sure what you mean here, so let me give another example and see what you think.

Evolution incentivises a form of partial agency because it incentivizes comparative reproductive advantage, rather than absolute. A gene that reduces the reproductive rate of other organisms is as incentivized as one which increases that of its own. This leads to evolving-to-extinction and other less extreme inefficiencies -- this is just usually not that bad because it is difficult for a gene to reduce the fitness of organisms it isn't in, and methods of doing so usually have countermeasures. As a result, we can't exactly think of evolution as optimizing something. It's myopic in the sense that it prefers genes which are point-improvements for their carriers even at a cost to global fitness; it's stop-gradient-y in that it optimizes with respect to the relatively fixed population which exists during an organism's lifetime, ignoring the fact that increasing the frequency of a gene changes that population (and so creating the maximum-of-a-fixed-point-of-our-maximum effect for evolutionarily stable equilibria).

So, understanding partial agency better could help us think about what kind of agents are incentivized by evolution.

It's true that a very intelligent organism such as humans can come along and change the rules of the game, but I'm not claiming that incentivising partial agency gets rid of inner alignment problems. I'm only claiming that **if the rules of the game remain intact** we can incentivise partial agency.

Sorry for the very late reply, I've been busy :/

To be clear, I don't think iid explains it in all cases, I also think iid is just a particularly clean example. Hence why I said (emphasis added now):

So my position is "partial agency arises because any embedded learning algorithm will necessarily leave out aspects that the idealized learning algorithm can identify". And as a subclaim, that this often happens because of the effective iid assumption between data points in a learning algorithm.

Re:

I'm not sure what you're saying here. I agree that "no one wants that".

My point is that the relevant distinction in that case seems to be "instrumental goal" vs. "terminal goal", rather than "full agency" vs. "partial agency". In other words, I expect that a map that split things up based on instrumental vs. terminal would do a better job of understanding the territory than one that used full vs. partial agency.

Re: evolution example, I agree that particular learning algorithms can be designed such that they incentivize partial agency. I think my intuition is that all of the particular kinds of partial agency we could incentivize would be too much of a handicap on powerful AI systems (or won't work at all, e.g. if the way to get powerful AI systems is via mesa optimization).

I'm only claiming that **if the rules of the game remain intact** we can incentivise partial agency.

Definitely agree with that.

My point is that the relevant distinction in that case seems to be "instrumental goal" vs. "terminal goal", rather than "full agency" vs. "partial agency". In other words, I expect that a map that split things up based on instrumental vs. terminal would do a better job of understanding the territory than one that used full vs. partial agency.

Ah, I see. I definitely don't disagree that epistemics is instrumental. (Maybe we have some terminal drive for it, but, let's set that aside.) BUT:

I don't think we can account for what's going on here just by pointing that out. Yes, the fact that it's instrumental means that we cut it off when it "goes too far", and there's not a nice encapsulation of what "goes too far" means. However, I think even when we set that aside there's still an alter-the-map-to-fit-the-territory-not-the-other-way-around phenomenon. IE, yes, it's a subgoal, but how can we understand the subgoal? Is it best understood as optimization, or something else?
When designing machine learning algorithms, this is essentially built in as a terminal goal; the training procedure incentivises predicting the data, not manipulating it. Or, if it does indeed incentivize manipulation of the data, we would like to understand that better; and we'd like to be able to design things which don't have that incentive structure.

To be clear, I don't think iid explains it in all cases, I also think iid is just a particularly clean example.

Ah, sorry for misinterpreting you.

[EDIT: 2019-11-09: The argument I made here seems incorrect; see here (H/T Abram for showing me that my reasoning on this was wrong).]

If there's a learning-theoretic setup which incentivizes the development of "full agency" (whatever that even means, really!) I don't know what it is yet.

Consider evolutionary algorithms. It seems that (theoretically) they tend to yield non-myopic models by default given sufficiently long runtime. For example, a network parameter value that causes behavior that minimizes loss in future training inferences might be more likely to end up in the final model than one that causes behavior that minimizes loss in the current inference at a great cost for the loss in future ones.

That's an interesting point, but I'm very skeptical that this effect is great enough to really hold in the long run. Toward the end, the parameter value could mutate away. And how would you apply evolutionary algorithms to really non-myopic settings, like reinforcement learning where you can't create any good episode boundaries (for example, you have a robot interacting with an environment "live", no resets, and you want to learn on-line)?

Toward the end, the parameter value could mutate away.

I agree that it's possible to get myopic models in the population after arbitrarily long runtime due to mutations. It seems less likely the more bits that need to change—in any model in the current population—to get a model that in completely myopic.

From a safety perspective, if the prospect of some learning algorithm yielding a non-myopic model is concerning, the prospect of it creating non-myopic models along the way is plausibly also concerning.

And how would you apply evolutionary algorithms to really non-myopic settings, like reinforcement learning where you can't create any good episode boundaries (for example, you have a robot interacting with an environment "live", no resets, and you want to learn on-line)?

In this example, if we train an environment model on the data collected so far (ignoring the "you want to learn on-line" part), evolutionary algorithms might be an alternative to regular deep learning. More realistically, some actors would probably invest a lot of resources in developing top predictive models for stock prices etc., and evolutionary algorithms might be one of the approaches being experimented with.

Also, people might experiment with evolutionary algorithms as an alternative to RL, for environments that can be simulated, as OpenAI did; they wrote (2017): "Our work suggests that neuroevolution approaches can be competitive with reinforcement learning methods on modern agent-environment benchmarks, while offering significant benefits related to code complexity and ease of scaling to large-scale distributed settings.".

I agree that it's possible to get myopic models in the population after arbitrarily long runtime due to mutations. It seems less likely the more bits that need to change—in any model in the current population—to get a model that in completely myopic.

Yeah, I agree there's something to think about here. The reason I responded as I did was because it seems more tractable to think about asymptotic results, ie, what happens if you run an algorithm to convergence. But ultimately we also need to think about what an algorithm does with realistic resources.

From a safety perspective, if the prospect of some learning algorithm yielding a non-myopic model is concerning, the prospect of it creating non-myopic models along the way is plausibly also concerning.

I'm not assuming partial agency is uniformly good or bad from a safety perspective; in some cases one, in some cases the other. If I want an algorithm to reliably produce full agency, I can't use algorithms which produce some full-agency effects in the short term but which have a tendency to wash them out in the long term, unless I have a very clear understanding of when I'm in the short term.

In this example, if we train an environment model on the data collected so far (ignoring the "you want to learn on-line" part),

Ignoring the on-line part ignores too much, I think.

Also, people might experiment with evolutionary algorithms as an alternative to RL, for environments that can be simulated,

Similarly, assuming environments that can be simulated seems to assume too much. A simulatable environment gives us the ability to perfectly optimize from outside of that environment, allowing full optimization and hence full agency. So it's not solving the problem I was talking about.

I think this is an example of selection/control confusion. Evolution is a selection algorithm, and can only be applied indirectly to control problems.

I read partial agency and myopia as a specific way the boundedness of embedded processes manifest their limitations, so it seems to me both not surprising that it exists nor surprising that there is an idealized "unbounded" form to which the bounded form may aspire but not achieve due to limitations created by being bounded and instantiated out of physical stuff rather than mathematics.

I realize there's a lot more details to the specific case you're considering, but I wonder if you'd agree it's part of this larger, general pattern of real things being limited in ways by embeddedness that makes them less than their theoretical (albeit unachievable) ideal.

But how can we optimize it only “in one direction”?

(I first saw Alex Turner (TurnTrout) express this concern in the context of Counterfactual Oracles, which I then elaborated here.)

By default, I expect mesa-optimizers to use proxy objectives that are simple, fast, and easy to specify in terms of their input data (e.g. pain) not those that require extremely complex world models to even be able to specify (e.g. spread of DNA).

My intuition here is that you're likely right, but, I do want to understand what Wei is pointing out as part of a full understanding of partial agency.

I agree that this is possible, but I would be very surprised if a mesa-optimizer actually did something like this. By default, I expect mesa-optimizers to use proxy objectives that are simple, fast, and easy to specify

Yep, I totally agree; I was thinking about this but didn't include it in the post. So the different notions actually aren't equivalent; myopia may be a generally weaker condition.

(I first saw Alex Turner (TurnTrout) express this concern

This link is broken now but I think I found an updated one that works:

https://www.lesswrong.com/posts/yAiqLmLFxvyANSfs2/counterfactual-oracles-online-supervised-learning-with?commentId=FPcEqFisRsfihnLcX

we can kind of expect any type of learning to be myopic to some extent

RL tends to require temporal discounting -- this also creates a soft episode boundary, because things far enough in the future matter so little that they can be thought of as "a different episode".

This is just a side note, but RL also tends to have hard episode boundaries if you are regularly resetting the state of the environment as is common in many RL setups.

Thanks, I appreciate your enthusiasm! I'm still not sure how much sense all of this makes.

I agree with your other points.

(I didn't understand this post, this comment is me trying to make sense of it. After writing the comment, I think I understand the post more, and the comment is effectively an answer to it.)

In supervised learning, each data point is a single pair $(x, y)$ . The iid assumption means that the algorithm cannot model the fact that $y_{1}$ could influence the pair $(x_{2}, y_{2})$ , and so the gradients don't incentivize using that influence.
In RL, each data point is a small fragment of a universe-history (i.e. an episode). The iid assumption means that the algorithm cannot model the fact that changes to the first fragment can affect future fragments, which leads to myopia.
In closed-source games, each data point is a transcript of what happened in a particular instance of the game. The iid assumption means that the algorithm cannot model the opponent changing its policy, and so treats it as a one-shot game instead of an iterated game. (What exactly happens depends a lot on the specific setup.)

I wrote up a long reply to this and then accidentally lost it :(

Let me first say that I definitely sympathize with skepticism/confusion about this whole line of thinking.

I roughly agree with your picture of what's going on with "full agency" -- it's best thought of as fully cartesian idealized UDT, "learning" by searching for the best policy.

Initially I was on-board with your connection to iid, but now I think it's a red herring.

So, actually, iid isn't what's going on at all, although iid cases do seem like particularly clear illustrations. This further convinces me that there's an interesting phenomenon to formalize here.

The reality --> beliefs optimization seems like a different thing: bidirectional optimization of that would correspond to minimizing the delta between beliefs and reality. No one actually wants to literally minimize that; having accurate beliefs is an instrumental goal for some other goal, not a terminal one.

That said, I'm not optimistic about creating incentives for particular kinds of partial agency: as soon as the model is able to reason, it can do all the same reasoning I did, and if it is actually trying to maximize some simple function of universe-histories, then it should move towards full agency upon doing this reasoning.

I'm not sure what you mean here, so let me give another example and see what you think.

So, understanding partial agency better could help us think about what kind of agents are incentivized by evolution.

Sorry for the very late reply, I've been busy :/

To be clear, I don't think iid explains it in all cases, I also think iid is just a particularly clean example. Hence why I said (emphasis added now):

So my position is "partial agency arises because any embedded learning algorithm will necessarily leave out aspects that the idealized learning algorithm can identify". And as a subclaim, that this often happens because of the effective iid assumption between data points in a learning algorithm.

Re:

I'm not sure what you're saying here. I agree that "no one wants that".

I'm only claiming that **if the rules of the game remain intact** we can incentivise partial agency.

Definitely agree with that.

My point is that the relevant distinction in that case seems to be "instrumental goal" vs. "terminal goal", rather than "full agency" vs. "partial agency". In other words, I expect that a map that split things up based on instrumental vs. terminal would do a better job of understanding the territory than one that used full vs. partial agency.

Ah, I see. I definitely don't disagree that epistemics is instrumental. (Maybe we have some terminal drive for it, but, let's set that aside.) BUT:

I don't think we can account for what's going on here just by pointing that out. Yes, the fact that it's instrumental means that we cut it off when it "goes too far", and there's not a nice encapsulation of what "goes too far" means. However, I think even when we set that aside there's still an alter-the-map-to-fit-the-territory-not-the-other-way-around phenomenon. IE, yes, it's a subgoal, but how can we understand the subgoal? Is it best understood as optimization, or something else?
When designing machine learning algorithms, this is essentially built in as a terminal goal; the training procedure incentivises predicting the data, not manipulating it. Or, if it does indeed incentivize manipulation of the data, we would like to understand that better; and we'd like to be able to design things which don't have that incentive structure.

To be clear, I don't think iid explains it in all cases, I also think iid is just a particularly clean example.

Ah, sorry for misinterpreting you.

[EDIT: 2019-11-09: The argument I made here seems incorrect; see here (H/T Abram for showing me that my reasoning on this was wrong).]

If there's a learning-theoretic setup which incentivizes the development of "full agency" (whatever that even means, really!) I don't know what it is yet.

Toward the end, the parameter value could mutate away.

And how would you apply evolutionary algorithms to really non-myopic settings, like reinforcement learning where you can't create any good episode boundaries (for example, you have a robot interacting with an environment "live", no resets, and you want to learn on-line)?

I agree that it's possible to get myopic models in the population after arbitrarily long runtime due to mutations. It seems less likely the more bits that need to change—in any model in the current population—to get a model that in completely myopic.

From a safety perspective, if the prospect of some learning algorithm yielding a non-myopic model is concerning, the prospect of it creating non-myopic models along the way is plausibly also concerning.

In this example, if we train an environment model on the data collected so far (ignoring the "you want to learn on-line" part),

Ignoring the on-line part ignores too much, I think.

Also, people might experiment with evolutionary algorithms as an alternative to RL, for environments that can be simulated,

I think this is an example of selection/control confusion. Evolution is a selection algorithm, and can only be applied indirectly to control problems.

LESSWRONG
LW

LESSWRONG
LW

76

Partial Agency

76

Ω 34

Stop-Gradients

Myopia

Directionality

The Partial Agency Paradox

76

Ω 34

76

Ω 34