I'm going to be a bit more explicit about some ideas that appeared in The Parable of Predict-O-Matic. (If you don't want spoilers, read it first. Probably you should read it first anyway.)
[Note: while the ideas here are somewhat better than the ideas in the predict-o-matic story, they're equally rambling, without the crutch of the story to prop them up. As such, I expect readers to be less engaged. Unless you're especially interested in which character's remarks are true (or at least, which ones I stand by), this might be a post to skim; I don't think it has enough coherence that you need to read it start-to-finish.]
First, as I mentioned in Partial Agency, my main concern here isn't actually about building safe oracles or inner-aligned systems. My main concern is to understand what's going on. If we can build guaranteed-myopic systems, that's good for some purposes. If we can build guaranteed-non-myopic systems, that's good for other purposes. The story largely frames it as a back-and-forth about whether things will be OK / whether there will be terrible consequences; but my focus was on the more specific questions about the behavior of the system.
Second, I'm not trying to confidently stand behind any of the character's views on what will happen. The ending was partly intended to be "and no one got it right, because this stuff is very complicated". I'm very uncertain about all of this. Part of the reason why it was so much easier to write the post as a story was that I could have characters confidently explain views without worrying about adding all the relevant caveats.
Evan Hubinger pointed out to me that all the characters are talking about asymptotic performance, and ignoring inductive bias. Inner optimizers might emerge due to the inductive bias of the system. I agree; in my mind, the ending was a bit of a hat tip to this, although I hinted at gradient hacking rather than inductive bias in the actual text.
On the other hand, "inductive bias" is a complicated object when you're talking about a system which isn't 100% Bayesian.
- You often represent inductive bias through regularization techniques which introduce incentives pulling toward 'simpler' models. This means we're back in the territory of incentives and convergence.
- So, to talk about what a learning algorithm really does, we have to also think of the initialization and search procedure as part of the inductive bias. This makes inductive bias altogether a fairly complicated object.
Explicit Fixed-Point Selection
The very first conversation involved the intern arguing that there would be multiple valid fixed-points of prediction, and Predict-O-Matic would have to choose between them somehow.
Explicitly modeling fixed points and choosing between them is a feature of the logical induction algorithm. This feature allows us to select the best one according to some criterion, as is leveraged in When Wishful Thinking Works. As discussed later in the conversation with the mathematician, this is atypical of supervised learning algorithms. What logical induction does is very expensive: it solves a computationally difficult fixed-point finding problem (by searching exhaustively).
Other algorithms are not really "choosing a fixed point somehow". They're typically failing to guarantee a fixed point. The mathematician hinted at this by describing how algorithms would not necessarily converge to a self-fulfilling prophecy; they could just as easily go in circles or wander around randomly forever.
Think of it like fashion. Sometimes, putting a trend into common knowledge will lock it in; this was true about neck ties in business for a long time. In other instances, the popularity of a fashion trend will actually work against it, a fashion statement being ineffective if it's overdone.
So, keep in mind that different learning procedures will relate to this aspect of the problem in different ways.
Reward vs Prediction Error
The economist first compared the learning algorithm to decision markets, then later, decided prediction markets were a better analogy.
The mathematician contrasted the learning algorithm to reinforcement learning, pointing out that Predict-O-Matic always adjusted outputs to be more like historical observations, whereas reinforcement learning would more strategically optimize reward.
Both of these point at a distinction between learning general decision-making and something much narrower and much more epistemic in character. As I see it, the critical idea is that (1) the system gets information about what it should have output; (2) the learning update moves toward a modified system which would have output that. This is quite different from reinforcement learning.
In a recent post, Wei Dai mentions a similar distinction (italics added by me):
Supervised training - This is safer than reinforcement learning because we don't have to worry about reward hacking (i.e., reward gaming and reward tampering), and it eliminates the problem of self-confirming predictions (which can be seen as a form of reward hacking). In other words, if the only thing that ever sees the Oracle's output during a training episode is an automated system that computes the Oracle's reward/loss, and that system is secure because it's just computing a simple distance metric (comparing the Oracle's output to the training label), then reward hacking and self-confirming predictions can't happen.
There are several things going on here, but I think Wei is trying to point at something similar to the distinction I'm thinking of. It's quite tempting to call it "supervised learning", because you get a signal telling you what you should have done. However, it's a bit fuzzy, because this also encompasses things normally called "unsupervised learning": the supervised/unsupervised distinction is often explained as modeling vs . Wikipedia:
It could be contrasted with supervised learning by saying that whereas supervised learning intends to infer a conditional probability distribution conditioned on the label of input data; unsupervised learning intends to infer an a priori probability distribution .
But many (not all) unsupervised algorithms still have the critical features we're interested in! Predicting without any context information to help still involves (1) getting feedback on what we "should have" expected, and (2) updating to a configuration which would have more expected that. We simply can't expect the predictions to be as focused, given the absence of contextual information to help. But that just means it's a prediction task on which we tend to expect lower accuracy.
I'm somewhat happy referring to this category as imitative learning. This includes supervised learning, unsupervised learning so long as it's generative (but not otherwise), and imitation learning (a paradigm which achieves similar ends as inverse reinforcement learning). Homever, the terminological overlap with 'imitation learning' is rather terrible, so I'm open to other suggestions.
It seems to me that this is a critical distinction for the myopia discussion. I hope to say more about it in future posts.
The discussion of prediction markets toward the end was rather loose, in that the economist didn't deal with a lot of the other points which had been made throughout, and just threw a new model out there.
- The mechanism of manipulation is left quite vague. In an assassination market, there are all kinds of side-channels which agents can use to accomplish their goals. But the rest of the essay had only considered the influence which Predict-O-Matic has by virtue of the predictions it makes. When writing this part, I was actually imagining side-channels, such as exploiting bugs to communicate by other means.
- It's not clear whether the market is supposed to be myopic or non-myopic in this discussion. The argument for overall myopia was the economist's initial decision-market model, which is being cast aside here. Are manipulators expected to only throw off individual predictions to reap reward, or are they expected to implement cross-prediction strategies?
- The disanalogy between prediction markets and a local search like gradient descent is swept under the rug. A prediction market has many agents interacting with it, so that one agent can pick up money from another by out-smarting it. Do conclusions from that setting carry over to single-hypothesis learning? It isn't clear.
Isnasene interpreted the first point by imagining that the mechanism of manipulation is still through selection of fixed points:
In the same way that self-fulfilling predictions are good for prediction strategies because they enhance accuracy of the strategy in question, self-fulfilling predictions that seem generally surprising to outside observers are even better because they lower the accuracy of competing strategies. The established prediction strategy thus systematically causes the kinds of events in the world that no other method could predict to further establish itself.
This is compatible with the assumption of myopia; we might imagine that the system still can't manipulate events through actual bad predictions, because those strategies will be undercut. Therefore, the manipulation is restricted to selecting fixed-points which are surprising.
However, there are three problems with this:
- The "undercutting" argument relies on an assumption that there are enough different strategies considered, so that the one which undercuts the non-myopic strategy is eventually found. The argument for strategically increasing entropy relies on the reverse assumption! There's little profit in increasing entropy if others can predict that you'll do this and cash in.
- (We might still see both effects at different times in training, however, and therefore expect major consequences of both effects.)
- As previously discussed, we don't actually expect fixed points in general. So we have to ask whether the entropy-increasing incentive is significant more generally (unless we're specifically thinking about the logical induction algorithm or other algorithms which get you a fixed point).
- We still have not dealt with the disanalogy between prediction markets and local-search-based learning.
So, it seems the actual situation is more complicated, and I'm not yet sure how to think about this.
'Local Search'; selection vs control
I used the term 'local search' to describe the application of gradient-descent-like updates to reduce prediction error. I have some conceptual/terminological issues with this.
Calling this 'local search' invokes the mental image of a well-defined gradient landscape which we are taking steps on, to further optimize some function. But this is the wrong mental image. The mental image is one of selection, when we're in a control setting (in my terminology). We are not making an iid assumption. We are not getting samples from a stationary but stochastic loss function, as in stochastic gradient descent.
If 'local search' were an appropriate descriptor for gradient-descent here, would it also be an appropriate descriptor for Bayesian updates? There's a tendency to think of Bayesian learning as trying to find one good hypothesis by tracking how well all of them do (which sounds like a global search), but we needn't think of it this way. The "right answer" can be a mixture over hypotheses. We can think of a Bayesian update as incrementally improving our mixture. But thinking of Bayesian updates as local search seems wrong. (So does thinking of them as global search.)
This is online learning. A gradient-descent step represents a prediction that the future will be like the past in some relevant sense, in spite of potential non-stationarity. It is not a guaranteed improvement, even in expectation -- as it would be in offline stochastic gradient descent with sufficiently small step size.
Moreover, step size becomes a more significant problem. In offline gradient descent, selecting too small a step size only means that you have to make many more steps to get where you're going. It's "just a matter of computing power". In online learning, it's a more serious problem; we want to make the appropriate-sized update to new data.
I realize there are more ways of dealing with this than tuning step size; we don't necessarily update to data by making a single gradient step. But there are problems of principal here.
What's gradient descent without a fitness landscape?
Simply put, gradient descent is a search concept, not a learning concept. I want to be able to think of it more directly as a learning concept. I want to be able to think of it as an "update", and use terminology which points out the similarity to Bayesian updates.
The Duality Remark
Vanessa asked about this passage:
The engineer was worse: they were arguing that Predict-O-Matic might maximize prediction error! Some kind of duality principle. Minimizing in one direction means maximizing in the other direction. Whatever that means.
[I]t was a speculative conjecture which I thought of while writing.
The idea is that incentivizing agents to lower the error of your predictions (as in a prediction market) looks exactly like incentivizing them to "create" information (find ways of making the world more chaotic), and this is no coincidence. So perhaps there's a more general principle behind it, where trying to incentivize minimization of f(x,y) only through channel x (eg, only by improving predictions) results in an incentive to maximize f through y, under some additional assumptions. Maybe there is a connection to optimization duality in there.
In terms of the fictional cannon, I think of it as the engineer trying to convince the boss by simplifying things and making wild but impressive sounding conjectures. :)
If you have an outer optimizer which is trying to maximize through while being indifferent about , it seems sensible to suppose that inner optimizers will want to change to throw things off, particularly if they can get credit for then correcting to be optimal for the new . If so, then inner optimizers will generally be seeking to find -values which make the current a comparatively bad choice. So this argument does not establish an incentive to choose which makes all choices of poor.
In a log-loss setting, this would translate to an incentive to make observations surprising (for the current expectations), rather than a direct incentive to make outcomes maximum-entropy. However, iteration of this would push toward maximum entropy. Or, logical-induction-style fixed-point selection could push directly to maximum entropy.
This would be a nice example of partial agency. The system is strategically influencing and so as to maximize through channel , while minimizing through channel . What does this mean? This does not correspond to a coherent objective function at all! The system is 'learning a game-theoretic equilibrium' -- which is to say, it's learning to fight with itself, rather than optimize.
There are two different ways we can think about this. One way is to say there's an inner alignment problem here: the system learns to do something which doesn't fit any objective, so it's sort of trivially misaligned with whatever the outer objective was supposed to be. But what if we wanted this? We can think of games as a kind of generalized objective, legitimizing this behavior.
To make things even more confusing, if the only channel by which Predict-O-Matic can influence the world is via the predictions which get output, then... doesn't ? represents the 'legitimate' channel whereby predictions get combined with (fixed) observations to yield a score. represents the 'manipulative' channel, where predictions can influence the world and thus modify observations. But the two causal pathways have one bottleneck which the system has to act through, namely, the predictions made.
In any case, I don't particularly trust any of the reasoning above.
- I didn't clarify my assumptions. What does it mean for the outer optimizer to maximize through while being indifferent about ? It's quite plausible that some versions of that will incentivise inner optimizers which optimize taking advantage of both channels, rather than the contradictory behavior conjectured above
- I anthropomorphized the inner optimizers. In particular, I did not specify or reason about details of the learning procedure.
- This sort of assumes they'll tend to act like full agents rather than partial agents, while yielding a conclusion which suggests otherwise.
- This caused me to speak in terms of a fixed optimization problem, rather than a learning process. Optimizing isn't really one thing -- is a loss function which is applied repeatedly in order to learn. The real problem facing inner optimizers is an iterated game involving a complex world. I can only think of them trying to game a single if I establish that they're myopic; otherwise I should think of them trying to deal with a sequence of instances.
So, I'm still unsure how to think about all this.
I think I've updated a bit from when I wrote this (due to this discussion). (ETA: I've now added a link from that paragraph to this comment.) Now I would say that the safety-relevant differences between SL and RL are:
"But there are problems of principal here." - principle?
This is something that John_Maxwell made made think about when discussing the Dualist Predict-o-Matic:
It made me realize a few things
1. When I imagined an strategy identified fixed points/self-fulfilling prophecies, I was explicitly imagining the kind of strategy that first roughly enumerated a number of possibilities (without zooming in on impacts it may itself cause) and then exploring those possibilities by re-simulating them while conditioning on it giving them as a prediction and noticing that they have fixed point like properties. However, from a practical perspective, this doesn't seem very feasible considering the massive number of degrees of freedom that the Predict-O-Matic has.
Why did I think along these lines specifically? Well, I read this
and interpreted it to mean that, since observations will depend on the prediction to at least some extent, the algorithm really should sanity-check its prediction to make sure it still happens when it is given as an output. Hence, I imagined a gradual bumbling into of fixed points as the algorithm would sanity check its prediction, modify it slightly to address the changes, and then repeat--falling into a basin of attraction (to be clear, I also interpreted the Predict-O-Matic to be myopic).
2. A prediction doesn't have to be completely self-fulfilling or completely independent of the observation. The iterative process I described above of sanity-checking a prediction, adjusting it, and iterating is unlikely to produce a prediction that guarantees an outcome but it will probably raise the chances of that outcome happening to some extent. In other words, self-fulfilling predictions themselves might be rare but predictions that slightly raise the chances of themselves happening when given as an output could be relatively normal.
3. As you note here, the actual behavior of the Predict-O-Matic is heavily determined by the details of how the optimizer tunes the strategies it considers over time. Whether it ultimately interacts with fixed points depends on the kind of strategies the optimizer gets to look at. Furthermore, things like selecting for maximally surprising self-fulfilling prophecies as I described could be strategically avoided by using different algorithms. Using a quantilizer (ie, building a bunch of strategies and randomly picking one of the ones within the top 5% historical performance) could yield multiple distinct strategies that each fall into distinct basins of attraction. This would cause any single strategy to suffer if it falls too deeply into a surprising self-fulfilling prediction because it won't be able to predict whether that prediction will actually be made. To carry the economics metaphors, you might try to shift an "assassination market"-like thing into a "Keynesian Beauty Contest"-like thing.
4. One of the other relevant things in the Parable was that accuracy in the context of the Predict-O-Matic is underspecified in a lot of ways that could be important. When your machine can try to predict any detail of the world, prediction error can vary greatly. If you ask about the stock market, it might be just one number. If you ask about a painting, it might be a lot more things. Since the Predict-O-Matic as described has natural language processing (presumably with the goal of providing a good interface for humans), it's prediction accuracy might even have to be calculated by weighting the inaccuracy of each specific aspect of the prediction by the anticipated importance of that aspect to the human interacting with it. Even if we try to keep things simple and ignore that aspect, there are a lot of ways that the Predict-O-Matic might output a prediction of the same events. Thus, a given strategy might strategically phrase the predictions in the ways most optimal for ensuring the prediction's accuracy (or strategically try to minimize prediction error by making the prediction as low impact or generic as possible).
5. A lot of this is obviously getting into the weeds of what really specific thing the Predict-O-Matic is trying to do and I'm not sure how helpful it is in general.
Anyway, moving on....
This is an important but non-obvious thing. Consider kernel density estimation with cross-validation. What you're actually doing here is taking the distibution x you've observed and then testing how a bunch of simple distributional models built on subsets of that observed data generalize to the rest of the set. While you're learning about a distribution in an unsupervised manor, you can also interpret it as supervised learning:
p(y = unobserved x | x = observed x)
This is a somewhat confusing paragraph for me because the idea of the outer optimizer trying to optimize a function of something that is doesn't care about seems strange. Based on previous discussions of partial agency, I'm assuming this means something like
-The outer optimizer rewards the inner optimizer whenever it changes x in a way that increases f
-The outer optimizer does not care if f decreases because the inner optimizer has changed y while holding x constant
In the case of the Predict-O-Matic, I just didn't imagine this happening because I figured the outer optimizer was just trying to myopically minimize the prediction error without adjusting for which of the inputs were modified. However, when this does happen, the results can be chaotic. After all, you can simply an x that optimized f when y=y1 and another x that optimized f when y=y2. Then simply move y back and forth without punishment and get a reward every time you "correct" x to the relevant optimum. This let's you milk the outer optimizer for infinite reward. For example:
f(x,y) = (1-2^(-yx))
If you start at y0=1, x0=10, you'll get diminishing returns if you increase x. But, if you change y from y0 to y=-1, f drops dramatically and you can collect all the reward you'd like by decreasing x to -10. Then simply switch y from -1 back to 1 and reverse the shift you just made. Infinite reward cycle. I think Yudkowsky said something about how utility functions are good because they order outcomes well and, if you don't have ordering, someone can exploit you by repeatedly cycling through a process like the one above.
Since the reward you can get out of a process like this is arbitrarily large and, in practical scenarios, the inner optimizer will only have a model of how its decisions will change the rewards it gets, the actual path traced out above can be highly depending on initial conditions. One algorithm might hold y constant and blindly try to increase x since it increases f, converging on a boring finite reward. Another might discover the loophole and start cycling the outer optimizer for reward but be scared to risk leaving the cycle (since who knows what the function looks like outside the two optima it found, or whether it will ever b e able to get back?). Another might explore and realize it shift y even more dramatically and get even more dramatic rewards faster (if it's trying to maximize total reward and realizes it only has a finite amount of time).
This kind of maximizing surprise/entropy is very different from the kind of maximization I envisioned. When I imagined an algorithm that tried to throw off the algorithms it was competing against, it was creating surprising outcomes but this surprise was fundamentally relative to the predictor. To the algorithm causing surprise, nothing was surprising. To the algorithms it was competing against, it was all surprising. To the people outside the Predict-O-Matic, who knows if it was surprising? This really depends on whether the inside Predict-o-Matic strategies tend to make predictions similar to the ones people would make. In short, you're considering objective increases in entropy relative to the target function. I'm considering subjective increases in entropy relative to people/strategies aside from the one chosen by the Predict-O-Matic.
No idea if its useful but there's gotta be something here! Per wikipedia:
It's been a while since I took my linear optimization class but the primal and dual problems in linear algebra involve switching between a maximization and minimization problem and interchanging the constraints of a linear problem with the coefficients of the target function and vice-versa. For a given strategy in the Predict-O-Matic, you might try and imagine a strategy that tries to minimize prediction error while viewing the performances of its competitors as constraints. When you switch to the dual problem, you switch to minimizing the constraints and constraining on the minimization. This is hand-waivey of course; and I'm not sure it applies in the context of entropy caused by partial agency as you described it.