IID vs Myopia
In a comment to Partial Agency, Rohin summarized his understanding of the post. He used the iid assumption as a critical part of his story. Initially, I thought that this was a good description of what was going on; but I soon realized that iid isn't myopia at all (and commented as such). This post expands on the thought.
My original post conflated episodic (which is basically 'iid') with myopic.
In an episodic setting, it makes sense to be myopic about anything beyond the current episode. There's no benefit to cross-episode strategies, so, no need to learn them.
This is true at several levels (which I mention in the hopes of avoiding later confusion):
- In designing an ML algorithm, if we are assuming an episodic structure, it makes sense to use a learning algorithm which is designed to be myopic.
- A learning algorithm in an episodic setting has no incentive to find non-myopic solutions (even if it can).
However, it is also possible to consider myopia in the absence of episodic structure, and not just as a mistake. We might want an ML algorithm to learn myopic strategies, as is the case with predictive systems. (We don't want them to learn to manipulate the data; and even though that failure mode is far-fetched for most modern systems, there's no point setting up learning procedures which would incentivise it. Indeed, learning procedures seem to mostly encourage myopia, though the full situation is still unclear to me.)
These myopic strategies aren't just "strategies which behave as if there were an episodic assumption", either. For example, sequential prediction is myopic (the goal is to predict each next item accurately, not to get the most accuracy overall -- if this is unclear, hopefully it will become clearer in the next section).
So, there's a distinction between not remembering the past vs not looking ahead to the future. In episodic settings, the relevant parts of past and future are both limited to the duration of the episode. However, the two come apart in general. We can have/want myopic agents with memory; or, we can have/want memoryless agents which are not myopic. (The second seems somewhat more exotic.)
Game-Theoretic Myopia Definition
So far, I've used 'myopia' in more or less two ways: an inclusive notion which encompasses a big cluster of things, and also the specific thing of only optimizing each output to maximize the very next reward. Let's call the more specific thing "absolute" myopia, and try to define the more general thing.
Myopia can't be defined in terms of optimizing an objective in the usual sense -- there isn't one quantity being optimized. However, it seems like most things in my 'myopia' cluster can be described in terms of game theory.
Let's put down some definitions:
Sequential decision scenario: An interactive environment which takes in actions and outputs rewards and observations. I'm not trying to deal with embeddedness issues; this is basically the AIXI setup. (I do think 'reward' is a very restrictive assumption about what kind of feedback the system gets, but talking about other alternatives seems like a distraction from the current post.)
(Generalized) objective: A generalized objective assigns, to each action , a function . The quantity is how much the nth decision is supposed to value the ith reward. Probably, we require the sum to exist.
- Absolute myopia. if , and otherwise.
- Back-scratching variant: if , and 0 otherwise.
- Episodic myopia. if and are within the same episode; 0 otherwise.
- Hyperbolic discounting. , in; 0 otherwise.
- Dynamically consistent version of hyperbolic:
- Exponential discounting. , typically with c<1.
- 'Self-defeating' functions, such as for n=i, -1 for n=i+1, and 0 otherwise.
A generalized objective could be called 'myopic' if it is not dynamically consistent; ie, if there's no way to write as a function of alone, eliminating the dependence on .
This notion of myopia does not seem to include 'directionality' or 'stop-gradients' from my original post. In particular, if we try to model pure prediction, absolute myopia captures the idea that you aren't supposed to have manipulative strategies which lie (throw out some reward for one instance in order to get more overall). However, it does not rule out manipulative strategies which select self-fulfilling prophecies strategically; those achieve high reward on instance by choice of output , which is what a myopic agent is supposed to do.
There are also non-myopic objectives which we can't represent here but might want to represent more generally: there isn't a single well-defined objective corresponding to 'maximizing average reward' (the limit of exponential discounting as ).
Vanessa recently mentioned using game-theoretic models like this for the purpose of modeling inconsistent human values. I want to emphasize that (1) I don't want to think of myopia as necessarily 'wrong'; it seems like sometimes a myopic objective is a legitimate one, for the purpose of building a system which does something we want (such as make non-manipulative predictions). As such, (2) myopia is not just about bounded rationality.
I also don't necessarily want to think of myopia as multi-agent, even when modeling it with multi-agent game theory like this. I'd rather think about learning one myopic policy, which makes the appropriate (non-)trade-offs based on .
In order to think about a system behaving myopically, we need to use an equilibrium notion (such as Nash equilibria or correlated equilibria), not just . However, I'm not sure quite how I want to talk about this. We don't want to think in terms of a big equilibrium between each decision-point ; I think of that as a selection-vs-control mistake, treating the sequential decision scenario as one big thing to be optimized. Or, putting it another way: the problem is that we have to learn; so we can't talk about everything being in equilibrium from the beginning.
Perhaps we can say that there should be some such that each decision after that is in approximate equilibrium with each other taking the decisions before as given.
(Aside -- What we definitely don't want (if we want to describe or engineer legitimately myopic behavior) is a framework where the different decision-points end up bargaining with each other (acausal trade, or mere causal trade), in order to take pareto improvements and thus move toward full agency. IE, in order to keep our distinctions from falling apart, we can't apply a decision theory which would cooperate in Prisoner's Dilemma or similar things. This could present difficulties.)
Let's move on to a different way of thinking about myopia, through the language of Pareto-optimality.
We can think of myopia as a refusal to take certain Pareto improvements. This fits well with the previous definition; if an agent takes all the Pareto improvements, then its behavior must be consistent with some global weights not a function of . However, not all myopic strategies in the Pareto sense have nice representations in terms of generalized objectives.
In particular: I mentioned that generalized objectives couldn't rule out manipulation through selection of self-fulfulling prophecies; so, only capture part of what seems implied by map/territory directionality. Thinking in terms of Pareto-failures, we can also talk about failing to reap the gains from selection of manipulative self-fulfilling prophecies.
However, thinking in these terms is not very satisfying. It allows a very broad notion of myopia, but has few other virtues. Generalized objectives let me talk about myopic agents trying to do a specific thing, even though the thing they're trying to do isn't a coherent objective. Defining myopia as failure to take certain Pareto improvements doesn't give me any structure like that; a myopic agent is being defined in the negative, rather than described positively.
Here, as before, we also have the problem of defining things learning-theoretically. Speaking purely in terms of whether the agent takes certain Pareto improvements doesn't really make sense, because it has to learn what situation it is in. We want to talk about learning processes, so we need to talk about learning to take the Pareto improvements, somehow.
(Bayesian learning can be described in terms of Pareto optimality directly, because using a prior over possible environments allows Pareto-optimal behavior in terms of those environments. However, working that way requires realizability, which isn't realistic.)
In the original partial agency post, I described full agency as an extreme (perhaps imaginary) limit of less and less myopia. Full agency is like Cartesian dualism, sitting fully outside the universe and optimizing.
Is full agency that difficult? From the generalized-objective formalism, one might think that ordinary RL with exponential discounting is sufficient.
The counterexamples to this are MIRI-esque decision problems, which create dynamic inconsistencies for otherwise non-myopic agents. (See this comment thread with Vanessa for more discussion of several of the points I'm about to make.)
To give a simple example, the version of Newcomb's Problem where the predictor knows about as much about your behavior as you do. (The version where the predictor is nearly infallible is easily handled by RL-like learning; you need to specifically inject sophisticated CDT-like thinking to mess that one up.)
In order to have good learning-theoretic properties at all, we need to have epsilon exploration. But if we do, then we tend to learn to 1-box, because (it will seem) doing so is independent of the predictor's predictions of us.
Now, it's true that in a sequential setting, there will be some incentive to 2-box not for the payoff today, but for the future; establishing a reputation of 1-boxing gets higher payoffs in iterated Newcomb in a straightforward (causal) way.
However, that's not enough to entirely avoid dynamic inconsistency. For any discounting function, we need only to assume that the instances of Newcomb's problem are spaced out far enough over time so that 2-boxing in each individual case is appealing.
Now, one might argue that in this case, the agent is correctly respecting its generalized objective; it's supposed to sacrifice future value for present according to the discounting function. And that's true, if we want myopic behavior. But it is dynamically inconsistent -- the agent wishes to 2-box in each individual case, but with respect to future cases, would prefer to 1-box. It would happily bind its future actions given an opportunity to do so.
Like the issue with self-fulfilling prophecies, this creates a type of myopia which we can't really talk about within the formalism of generalized objectives. Even with an apparently dynamically consistent discounting function, the agent is inconsistent. As mentioned earlier, we need generalized-objective systems to fail to coordinate with themselves; otherwise, their goals collapse into regular objectives. So this is a type of myopia which all generalized objectives possess.
As before, I'd really prefer to be able to talk about this with specific types of myopia (as with myopic generalized objectives), rather than just pointing to a dynamic inconsistency and classifying it with myopia.
(We might think of the fully non-myopic agent as the limit of less and less discounting, as Vanessa suggests. This has some problems of convergence, but perhaps that's in line with non-myopia being an extreme ideal which doesn't always make sense. Alternately, we might thing of this as a problem of decision theory, arguing that we should be able reap the advantages of 1-boxing despite our values temporally discounting. Or, there might be some other wilder generalization of objective functions which lets us represent the distinctions we care about.)
Mechanism Design Analogy
I'll close this post with a sketchy conjecture.
Although I don't want to think of generalized objectives as truly multi-agent in the one-'agent'-per-decision sense, learning algorithms will typically have a space of possible hypotheses which are (in some sense) competing with each other. We can analogize that to many competing agents (keeping in mind that they may individually be 'partial agents', ie, we can't necessarily model them as coherently pursuing a utility function).
For any particular type of myopia (whether or not we can capture it in terms of a generalized objective), we can ask the question: is it possible to design a training procedure which will learn that type of myopia?
(We can approach this question in different ways; asymptotic convergence, bounded-loss (which may give useful bounds at finite time), or 'in-practice' (which fully accounts for finite-time effects). As I've mentioned before, my thoughts on this are mostly asymptotic at the moment, that being the easier theoretical question.)
We can think of this question -- the question of designing training procedures -- as a mechanism-design question. Is it possible to set up a system of incentives which encourages a given kind of behavior?
Now, mechanism design is a field which is associated with negative results. It is often not possible to get everything you want. As such, a natural conjecture might be:
Conjecture: It is not possible to set up a learning system which gets you full agency in the sense of eventually learning to take all the Pareto improvements.
This conjecture is still quite vague, because I have not stated what it means to 'learn to take all the Pareto improvements'. Additionally, I don't really want to assume the AIXI-like setting which I've sketched in this post. The setting doesn't yield very good learning-theoretic results anyway, so getting a negative result here isn't that interesting. Ideally the conjecture should be formulated in a setting where we can contrast it to some positive results.
There's also reason to suspect the conjecture to be false. There's a natural instrumental convergence toward dynamic consistency; a system will self-modify to greater consistency in many cases. If there's an attractor basin around full agency, one would not expect it to be that hard to set up incentives which push things into that attractor basin.