Self-supervised learning & manipulative predictions

[-]interstice6yΩ350

Example 1 basically seems to be the problem of output diversity in generative models. This can be a problem in generative models, but there are ways around it. e.g. instead of outputting the highest-probability individual sequence, which will certainly look "manipulative" as you say, sample from the implied distribution over sequences. Then the sentence involving "pyrite" will be output with probability proportional to how likely the model thinks "pyrite" is on its own, disregarding subsequent tokens.

For example 2, I wrote a similar post a few months ago (and in fact, this idea seems to have been proposed and forgotten a few times on LW). But for gradient descent-based learning systems, I don't think the effect described will take place.

The reason is that gradient-descent-based systems are only updated towards what they actually observe. Let's say we're training a system to predict EU laws. If it predicts "The EU will pass potato laws..." but sees "The EU will pass corn laws..." the parameters will be updated to make "corn" more likely to have been output than "potato". There is no explicit global optimization for prediction accuracy.

As you train to convergence, the predictions of the model will attempt to approach a fixed point, a set of predictions that imply themselves. However, due to the local nature of the update, this fixed-point will not be selected to be globally minimal, it will just be the first minima the model falls into. (This is different from the problems with "local minima" you may have heard about in ordinary neural network training -- those go away in the infinite-capacity limit, whereas local minima among fixed-points do not) The fixed-point should look something like "what I would predict if I output [what I would predict if I output [what I would predict .. ]]]" where the initial prediction is some random gibberish. This might look pretty weird, but it's not optimizing for global prediction accuracy.

[-]Steven Byrnes6y*Ω340

Thank you for the links!! Sorry I missed them! I'm not sure I understand your comments though and want to clarify:

I'm going to try to rephrase what you said about example 1. Maybe the text in any individual journal article about pyrite is perplexing, but given that the system expects some article about pyrite there, it should ramp the probabilities of individual articles up or down such that the total probability of seeing a journal article about pyrite, conditional on the answer "pyrite", is 100%. (By the same token, "The following is a random number: 2113164" is, in a sense, an unsurprising text string.) I agree with you that a system that creates a sensible, self-consistent probability distribution for text strings would not have a problem with example 1 if we sample from that distribution. (Thanks.) I am concerned that we will build a system with heuristic-guided search processes, not self-consistent probability estimates, and that this system will have a problem with example 1. After all, humans are subject to the conjunction fallacy etc., I assume AGIs will be too, right? Unless we flag this as a critical safety requirement and invent good techniques to ensure it. (I updated the post in a couple places to clarify this point, thanks again.)

For gradient descent, yes they are "only updated towards what they actually observe", but they may "observe" high-level abstractions and not just low-level features. It can learn about a new high-level context in which the low-level word sequence statistics would be very different than when superficially-similar text appeared in the past. So I don't understand how you're ruling out example 2 on that basis.

I mostly agree with what you say about fixed points in principle, but with the additional complication that the system's beliefs may not reflect reality, especially if the beliefs come about through abstract reasoning (in the presence of imperfect information) rather than trial-and-error. If the goal is "No manipulative answers at all ever, please just try to predict the most likely masked bits in this data-file!"—then hopefully that trial-and-error will not happen, and in this case I think fixed points becomes a less useful framework to think about what's going on.

[-]interstice6yΩ110

No worries, I also missed the earlier posts when I wrote mine. There's lots of stuff on this website.

I endorse your rephrasing of example 1. I think my position is that it's just not that hard to create a "self-consistent probability distribution". For example, say you trained an RNN to predict sequences, like in this post. Despite being very simple, it already implicitly represents a probability distribution over sequences. If you train it with back-propagation on a confusing article involving pyrite, then its weights will be updated to try to model the article better. However, if "pyrite" itself was easy to predict, then the weights that lead to it outputting "pyrite" will *not* be updated. The same thing holds for modern Transformer networks, which predict the next token based only on what it has seen so far. (Here is a paper with a recent example using GPT-2. Note the degeneracy of maximum likelihood sampling, but how this becomes less of a problem when just sampling from the implied distribution)

I agree that this sort of manipulative prediction could be a problem in principle, but it does not seem to occur in recent ML systems. (Although, there are some things which are somewhat like this; the earlier paper I linked and mode collapse do involve neglecting high-entropy components of the distribution. However, the most straightforward generation and training schemes do not incentivize this)

For example 2, the point about gradient descent is this: while it might be the case that outputting "Help I'm stuck in a GPU Factory000" would ultimately result in a higher accuracy, the way the gradient is propagated would not encourage the agent to behave manipulatively. This is because, *locally*, "Help I'm stuck in a GPU Factory" decreases accuracy, so that behavior(or policies leading to it) will be dis-incentivized by gradient descent. It may be the case that this will result in easier predictions later, but the structure of the reward function does not lead to any optimization pressure towards such manipulative strategies. Learning taking place over high-level abstractions doesn't change anything, because any high-level abstractions leading to locally bad behavior will likewise be dis-incentivized by gradient descent

[-]Steven Byrnes6y*Ω110

Thanks, that's helpful! I'll have to think about the "self-consistent probability distribution" issue more, and thanks for the links. (ETA: Meanwhile I also added an "Update 2" to the post, offering a different way to think about this, which might or might not be helpful.)

Let me try the gradient descent argument again (and note that I am sympathetic, and indeed I made (what I think is) that exact argument a few weeks ago, cf. Self-Supervised Learning and AGI Safety, section title "Why won't it try to get more predictable data?"). My argument here is not assuming there's a policy of trying to get more predictable data for its own sake, but rather that this kind of behavior arises as a side-effect of an algorithmic process, and that all the ingredients of that process are either things we would program into the algorithm ourselves or things that would be incentivized by gradient descent.

The ingredients are things like "Look for and learn patterns in all accessible data", which includes both low-level patterns in the raw data, higher-level patterns in the lower-level patterns, and (perhaps unintentionally) patterns in accessible information about its own thought process ("After I visualize the shape of an elephant tusk, I often visualize an elephant shortly thereafter"). It includes searching for transformations (cause-effect, composition, analogies, etc.) between any two patterns it already knows about ("sneakers are a type of shoe", or more problematically, "my thought processes resemble the associative memory of an AGI"), and cataloging these transformations when they're found. Stuff like that.

So, "make smart hypotheses about one's own embodied situation" is definitely an unintended side-effect, and not rewarded by gradient descent as such. But as its world-model becomes more comprehensive, and as it continues to automatically search for patterns in whatever information it has access to, "make smart hypotheses about one's own embodied situation" would just be something that happens naturally, unless we somehow prevent it (and I can't see how to prevent it). Likewise, "model one's own real-world causal effects on downstream data" is neither desired by us nor rewarded (as such) by gradient descent. But it can happen anyway, as a side-effect of the usually-locally-helpful rule of "search through the world-model for any patterns and relationships which may impact our beliefs about the upcoming data". Likewise, we have the generally-helpful rule "Hypothesize possible higher-level contexts that span an extended swathe of text surrounding the next word to be predicted, and pick one such context based on how surprising it would be based on what it knows about the preceding text and the world-model, and then make a prediction conditional on that context". All these ingredients combine to get the pathological behavior of choosing "Help I'm trapped in a GPU". That's my argument, anyway...

[-]abramdemski6yΩ230

The reply to interstice makes me think about logical uncertainty: if the predictor "reasons" about what to expect (internally engages in a sequence of computations which accounts for more structure as it thinks longer), then it is especially difficult to be approximately Bayesian (for all the classic reasons that logical uncertainty address things up). So the argument that the described behaviour isn't logical doesn't really apply, because you have to deal with things like you mention where you spot an inconsistency in your probability distribution but you aren't sure how to deal with it.

This "reasoning" argument is related to the intuition you mention about search -- you imagine the system searching for sensible futures when deciding what to predict next. It doesn't make sense for a system to do that if the system is only learning conditional probabilities of the next token given history; there is no information to gain by looking ahead. However, there are a number of reasons why it could look ahead of it's doing something more complicated. It could be actively searching for good explanations of its history, and looking ahead to plausible futures might somehow aid that process. Or maybe it learns the more general blank-filling task rather than only the forward-prediction version where you fill in the future given the past; then it could benefit from consulting its own models that go in the other direction as a consistency check.

Still, I'm not convinced that strategic behavior gets incentivised. As you say in the post, we have to think through specific learning algorithms and what behaviour they encourage.

[-]wkey555y20

I used your idea of "a" vs. "an" agreement with the following word as the basis of a GPT-3 experiment to show that GPT-3 indeed probably does do lookahead. The details are at https://www.reddit.com/r/GPT3/comments/k0mvf3/experiment_that_shows_that_gpt3_can_probably_plan/.

[-]John_Maxwell6yΩ120

Glad you are thinking about this!

How about putting the system in an "interactive environment" in the sense that it sometimes gets new data, but not asking it to predict what new data it will get? (Or, for an even looser constraint, maybe in some cases it makes predictions about new data it will get, but it doesn't factor these predictions into things like the sentence completion task.)

[-]Steven Byrnes6yΩ350

Yeah, I think something like that would probably work for 1B, but 1B is the easy part. It's 1C & 1D that are keeping me up at night...

[-]John_Maxwell6yΩ120

Can you be crisper about why you think 1C & 1D are necessary?

[-]Steven Byrnes6yΩ220

Well, strategy 1 is "Keep it from thinking that it's in an interactive environment". Things like "don't adjust the weights of the network while we ask questions" is a way to prevent it from thinking that it's in an interactive environment based on first-hand experience—we're engineering the experience to not leave traces in its knowledge. But to succeed in strategy 1, we also need to make sure that it doesn't come to believe it's in an interactive environment by other means besides first-hand experience, namely by abstract reasoning. More details in this comment, but basically an AGI with introspective information and world-knowledge will naturally over time figure out that it's an AGI, and to figure out the sorts of environments that AGIs are typically in, and thus to hypothesize the existence of interactions even if those interactions have never happened before, and were not intended by the designer (e.g. the "Help I'm trapped in a GPU!" type interactions).

[-]John_Maxwell6yΩ230

Hm, I think we're talking past each other a bit. What I was trying to get at was: When we're doing self-supervised learning, we're optimizing an objective function related to the quality of system's internal knowledge representations. My suggestion was that this internal objective function should have a term for the accuracy with which the system is able to predict masked bits of existing knowledge, but not a term for the accuracy of hypothesized future predictions a la beam search. Then we can use the system interactively as follows:

Give it some data.
Do self-supervised learning on the data, optimizing the quality of internal knowledge representations with a "short-sighted" objective function like I described.
Use these knowledge representations to make predictions of interest.
Repeat as needed.

What I'm looking for is a crisp description of why accurate self-knowledge (including knowledge of the interaction loop) is dangerous in this framework.

[-]Steven Byrnes6yΩ230

OK, hmm, let me try again then. This would be the section of the post entitled "A self-supervised learning algorithm in an interactive environment can become a manipulative goal-seeker".

I've been assuming all along that the objective function only rewards the next word. Unfortunately, it seems that the way to achieve this objective in practice is to search for higher-level longer-term contexts that surround the next word, like when we're watching TV and we think, "A commercial break is starting." Knowing that a commercial break is starting is essential for predicting the very next frame on the TV screen, but it is also incidentally a (implicit) prediction about what will appear on the screen for the next few minutes. In other words, you could say that making accurate (possibly implicit) probabilistic predictions about the next many words is instrumentally useful for making accurate probabilistic predictions about the next one word, and is thus rewarded by the objective function. I expect that systems that work well will have to be designed this way (i.e. finding "contexts" that entail implicit predictions about many future words, as a step towards picking the single next word). I think this kind of thing is necessary to implement even very basic things like object permanence.

Then the next step is to suppose that the system (being highly intelligent) comes to believe that the prediction X will cause other aspects of the longer-term context to be Y. (See the "Hypothesis 1" vs "Hypothesis 2" examples in the post.) If the system was previously thinking that P(X) is high and P(Y) is low, then ideally, the realization that X implies Y will cause the system to raise P(Y), while keeping P(X) at its previous value. This is, after all, the logically correct update, based on the direction of causality!

But if the system screws up, and lowers P(X) instead of raising P(Y), then it will make a manipulative prediction—the output is being chosen partially for its downstream interactive effects. (Not all manipulative predictions are dangerous, and there might be limits to how strongly it optimizes its outputs for their downstream effects, but I suspect that this particular case can indeed lead to catastrophic outcomes, just like we generically expect from AIs with real-world human-misaligned goals.)

Why should the system screw up this way? Just because the system's causal models will sometimes have mistakes, and sometimes have uncertainties or blank spaces (statistical-regularities-of-unknown-cause), and also because humans make this type of mistake all the time ("One man's modus ponens is another man's modus tollens"). I suspect it will make the right update more often than chance, I just don't see how we can guarantee that it will never make the wrong update in the manipulative Y-->X direction.

Does that help?

[-]John_Maxwell6yΩ120

Thanks for the thoughts!

This description seems rather different than your original beam search story, no? In your original story, you were describing an incentive the system had to direct the world in order to make it easier to predict. I don't see how this incentive arises here.

I'm not entirely convinced that predictions should be made in a way that's completely divorced from their effects on the world. For example, the prediction "You aren't going to think about ice cream" would appear to be self-falsifying. It seems like the most useful AI system would be one whose predictions tend to remain true even after being made.

(By the way, I hope I'm not coming across as antagonistic in this thread--I'm still replying because I think this is a really important topic and I'm hoping we can hammer it out together! And I think a crisp description of a problem is frequently the first step to solving it.)

[-]Steven Byrnes6yΩ110

This is great, thanks again for your time and thoughtful commentary!

RE "I'm not entirely convinced that predictions should be made in a way that's completely divorced from their effects on the world.": My vision is to make a non-agential question-answering AGI, thus avoiding value alignment. I don't claim that this is definitely the One Right Answer To AGI Safety (see "4. Give up, and just make an agent with value-aligned goals" in the post), but I think it is a plausible (and neglected) candidate answer. See also my post In defense of oracle (tool) AI research for why I think it would solve the AGI safety problem.

If an AGI applies its intelligence and world model to its own output, choosing that output partly for its downstream effects as predicted by the model, then I say it's a goal-seeking agent. In this case, we need to solve value alignment—even if the goal is as simple as "answer my question". (We would need to make sure that the goal is what it's supposed to be, as opposed to a proxy goal, or a weird alien interpretation where rewiring the operator's brain counts as "answer my question".) Again, I'm not opposed to building agents after solving value alignment, but we haven't solved value alignment yet, and thus it's worth exploring the other option: build a non-agent which does not intelligently model the downstream effects of its output at all (or if it does model it incidentally, to not do anything with that information).

Interfacing with a non-agential AGI is generally awkward. You can't directly ask it to do things, or to find a better way to communicate. My proposal here is to ask questions like "If there were no AGIs in the world, what's the likeliest way that a person would find a cure for Alzheimer's?" This type of question does not require the AGI to think through the consequence of its output, and it also has other nice properties (it should give less weird and alien and human-unfriendly answers than the solutions a direct goal-seeking agent would find).

OK, that's my grand vision and motivation, and why I'm hoping for "no reasoning about the consequences of one's output whatsoever", as opposed to finding self-fulfilling predictions. (Maybe very very mild optimization for the consequences of one's outputs is OK, but I'm nervous.)

Your other question was: if a system is making manipulative predictions, towards what goal is it manipulating? Well, you noticed correctly, I'm not sure, and I keep changing my mind. And it may also be different answers depending on the algorithm details.

My top expectation is that it will manipulate towards getting further inputs that its model thinks are typical, high-probability inputs. If X implies Y, and P(Y) is low, that might sometimes spuriously push down P(X), and thus the system will pick those X's that result in high P(Y).
My secondary expectation is that it might manipulate towards unambiguous, low-entropy outputs. This is the expectation if the system picks out the single most likely ongoing long-term context, and output a prediction contingent on that. (If instead the system randomly draws from the probability distribution of all possible contexts, this wouldn't happen, as suggested by interstice's comments on this page.) So if X1 leads to one of 500 slightly different Y1's (Y1a, Y1b,...), while X2 definitely leads to only one specific Y2, then Y2 is probably the most likely single Y, even if all the Y1's in aggregate are likelier than Y2; so X2 is at an unfair advantage.
Beyond those two, I suspect there can be other goals but they depend on the algorithm and its heuristics.

[+][comment deleted]6y10

LESSWRONG
LW

LESSWRONG
LW

18

Self-supervised learning & manipulative predictions

18

Ω 11

18

Ω 11

Background and assumptions about the self-supervised-learning system

Why might we put a self-supervised learner into an interactive environment?

On purpose

By accident

A self-supervised learning algorithm in an interactive environment can become a manipulative goal-seeker

Types of solutions

1. Keep it from thinking that it's in an interactive environment

1A. Prevent it from learning first-hand about an accidental interactive environment

1B. Prevent it from learning first-hand about a purposeful interactive environment

1C & 1D. Prevent it hypothesizing about an interactive environment (either accidental or purposeful) via indirect, abstract reasoning

2. Prevent its knowledge of the interactive environment from affecting its predictions

3. Don't put it in an interactive environment

4. Give up, and just make an agent with value-aligned goals