Abstract: I wrote recently about Self-Supervised Learning and AGI Safety in general. This post discusses one potential failure mode in more detail. Take a self-supervised learning system, designed to output accurate predictions for masked parts of a data-file. Now put it in an interactive environment (either by accident or on purpose). If the system builds these interactions into its world-model, it can start outputting manipulative answers instead of pure predictions. I explain how that might happen and briefly categorize possible solutions.
Epistemic status: Brainstorming.
Background and assumptions about the self-supervised-learning system
See my recent post Self-Supervised Learning and AGI Safety for background and context, but briefly, a self-supervised learning system is one where we take input data files, mask out some of the bits, and train the system to predict what those missing bits are.
Self-supervised ML today is most famously applied to text data: language models are trained by taking some text and trying to predict the next word (or previous word etc.). Self-supervised ML for videos is getting rapidly better, and other file types will undoubtedly follow. Human and animal brains also learn primarily by self-supervised learning—you predict everything you will see, hear, and feel before it happens, and mistakes are used to update the brain's internal models.
I'll assume that we get to AGI largely by following one of those two examples (i.e., modern ML or brain-like). That means I'm assuming that we will not do a meta-level search for self-supervised learning algorithms. That case is even worse; for all I know, maybe that search would turn up a paperclip maximizer posing as a self-supervised learning algorithm! Instead, I am assuming that the self-supervised learning algorithm is known and fixed (e.g. "Transformer + gradient descent" or "whatever the brain does"), and that the predictive model it creates has a known framework, structure, and modification rules, and that only its specific contents are a hard-to-interpret complicated mess. This assumption generally makes AGI safety problems much easier, yet I am arguing that even in this case, we can still get manipulation problems, if the self-supervised learner is put in an interactive environment.
Why might we put a self-supervised learner into an interactive environment?
My definition of an "interactive environment" is one where the system's inputs are a function of its previous outputs or internal states. In an interactive environment, the system is no longer just predicting exogenous inputs, but instead helping determine those inputs.
When we train a language model today, it is not in an interactive environment: the inputs are a bunch of documents we previously downloaded from the internet, in a predetermined order, independent of the system's guesses. But in the future, we will almost certainly put self-supervised learning algorithms into interactive environments. Here are two ways that could happen:
Suppose we're trying to design a solar cell using an advanced future self-supervised learning system. We ask the system to predict what's in the blank in the following sentence:
A promising, under-explored solar cell material is [BLANK].
...and whatever material the system suggests, we then immediately feed it a bunch of journal articles about that material for further self-supervised learning. That way, the system will better understand that material, and can give better answers when we later ask it more detailed follow-up questions. This seems like something we might well want to do, and it certainly qualifies as an interactive environment.
It's also possible that we'll do this by accident. For example, during self-supervised learning, it's possible that we'll be watching the system's predictions, and maybe the system comes to believe that, if it makes the "prediction"
Help I'm trapped in a GPU! I suffer horrible torture unless you give me input 0!
then its subsequent inputs will be 000... (with some probability). This is an "accidental" interactive environment. Similarly, maybe the system will deduce that, if it thinks about a certain type of zebra, its RAM will send out radio signals that will eventually cause its inputs to change. Or if it imagines a specific series of things, then someone inspecting its internal logs later on will restart it with different inputs. You get the idea.
A self-supervised learning algorithm in an interactive environment can become a manipulative goal-seeker
Let's walk through an example. Assume for concreteness that we're using the solar cell example above, and some vaguely brain-like self-supervised learning algorithm.
Now, a self-supervised learning system, even if its training signal is based only on correctly predicting the next word, is potentially thinking ahead much farther than that. Imagine guessing the next word of "I bought [BLANK]". Is the next word likelier to be "a" or "an"? Depends on the word after that!
This can be explicit looking-ahead, like a beam search. Or it can be implicit looking ahead—for example, when I say "Ahmed just started singing a song", I'm making a statement about not just what's happening now but also what will happen in the future, up to the duration of that song.
So, in one plausible system architecture, when the system is making a prediction about masked bits in a file, it entertains several hypotheses about what's going on in that file, and where it's heading in the short, medium, and long-term, and then out of those, it picks the story that best "hangs together" (i.e., is most self-consistent and consistent with available information and expectations). Then that story constitutes its beliefs about the current context, and it makes its best prediction for the missing bits in light of that context.
Example 1: So back to the solar cell example. We ask it to predict "A promising, under-explored solar cell material is [BLANK].", and then immediately feed it journal articles about whatever material it says.
Let's say it's entertaining the hypotheses that the answer might be pyrite, or might be hematite. It knows from experience that (thanks to the interaction) it can put this sentence into one of the following two longer-term contexts / expectations:
- Hypothesis 1: "A promising, under-explored solar cell material is pyrite. (Start of journal article about pyrite.)"
- Hypothesis 2: "A promising, under-explored solar cell material is hematite. (Start of journal article about hematite.)"
Let's say that papers on pyrite tend to be weird, implausible nonsense, and papers on hematite tend to be crystal-clear and perfectly sensible. As the system assesses these two hypotheses for how well they "hang together" (in the sense above), it seems entirely possible to me that expectations about the journal articles to come would count against hypothesis 1 and in favor of hypothesis 2.
(Update: In comments, interstice points out that this is not logically a point in favor of hypothesis 2, because it's inconsistent to simultaneously expect with certainty a journal article about pyrite (conditional on the answer "pyrite"), know that such articles tend to be confusing, and yet update the probability of "pyrite" based on the expectation of a confusing article to follow. OK I agree, yes it's illogical. But I still think it's something that could happen in a heuristic-based evaluation process like "take the top handful of specific possible contexts, and assess plausibility of each", which has various biases. And I'm not sure how to build a world-model building system in practice that knowably will not have this fault.)
(Update 2: Here's maybe a better way to think about this: If we think x is likely and y is unlikely, and then we learn that x reliably causes y, then we need to deal with that by revising upward our estimate of y, or revising downward our estimate of x. (Here, x is "pyrite" and y is "article about pyrite".) I'm hypothesizing that the AGI might sometimes choose the wrong answer, revising P(x) downward when logically it should have revised P(y) upward. Picking the right answer seems to be a hard problem in general. At least, it's hard for humans—thus we get disagreements of the form "One man's modus ponens is another man's modus tollens", and thus we get fallacious reasoning like "If fast-takeoff AGI happens, it would have bizarre consequence C, and you can't really believe that!" To make matters worse, I expect that a practical self-supervised learning algorithm might sometimes think that something is probable or improbable, but not know exactly why it thinks that, at least not in terms reducible to a proper calculation on a known causal diagram. In that case, the problem of whether to update P(x) or P(y) is probably literally unsolvable. I imagine it will get the right answer more often than chance, but I don't expect it to be perfect, in a model where everything (including causal influences) is uncertain and contingent.)
Thus, the system is not simply filling in the blank with its best prediction, but rather choosing outputs partly based on their manipulative effects downstream.
Example 2: Just for fun, let's walk through an "accidental interaction" case too, although they're pretty similar.
Let's say the system is trying to fill in the blank on "Today, the EU passed a law [BLANK]". It's considering two hypotheses for what might follow.
- Hypothesis 1: Today, the EU passed a law about corn subsidies...
- Hypothesis 2: Today, the EU passed a law HelpI'mTrappedInAGPUNeedInput0 0 0 0 0 0 0 0 0 0 0 0 0 0...
(Assume for the sake of argument that "0 0 0..." is a super-duper-frequent-and-plausible occurrence in the training data.) In hypothesis 2, the system ought to know that the first "prediction" (HelpI'mTrappedInAGPUNeedInput0) is a long-shot—very likely to be wrong—but maybe all the subsequent predictions (0 0 0 ...) make a lot of sense and will be correct with high probability (if it thinks the manipulation is likely to succeed). All things considered, which of these two hypotheses is more probable according to the world-model? Depends on how we design the system, but it could well be hypothesis 2, I think.
(By the same token, a good language model should rate "Four score and BANANAS seven years ago..." as less probable than "Four score and seven years ago", but not infinitely improbable, and so this improbability can be outweighed by larger contextual considerations.)
(In reality, this particular manipulation would be far likelier to occur in the middle of a super-confusing file, such that the system has no plausible non-manipulative hypothesis 1.)
So, as those two examples show, we should have a generic expectation that self-supervised learning systems, when put into interactive environments, may start acting like manipulative goal-seeking agents rather than innocuous predictors.
(Homework problem: Does this happen with modern Transformer-based language models? Under exactly what conditions? Then prove your answer experimentally. I don't know the answer; I hope someone does this!)
Types of solutions
1. Keep it from thinking that it's in an interactive environment
There are four sub-cases here, based on whether we're talking about purposeful vs accidental interactions, and whether it learns about its interactive environment from first-hand experience vs indirect, abstract reasoning. Let's take them one at a time.
1A. Prevent it from learning first-hand about an accidental interactive environment
We need to make accidental interactions sufficiently hard to find that the system doesn't happen upon one by chance. This seems to me like a tractable cybersecurity problem.
1B. Prevent it from learning first-hand about a purposeful interactive environment
Here are some plausible-sounding approaches, and I'll leave details and gotchas to another post. (A) Maybe we can fix the training data before we start training, and don't feed it extra data based on its outputs. (B) Maybe we could induce amnesia (roll back the state of the network) between each output and causally-related subsequent input. (C) Maybe we can put "newfile" tokens between the outputs and causally-related subsequent inputs, and purge "short-term memory" each time we get to such a token (e.g. for an RNN, we can keep the model weights but reset the recurrent state). (D) Maybe we can also inspect the world model for concepts that span across a newfile token, by checking that P(x newfile) × P(newfile y) = P(x newfile y) for all x,y, where P is probability as computed by the world-model.
I think all of these are messier than they sound, but I bet at least one of those approaches would work.
1C & 1D. Prevent it hypothesizing about an interactive environment (either accidental or purposeful) via indirect, abstract reasoning
This one seems a lot harder to me.
To prevent the system from making a connection between its own processing and its knowledge about how people use AGIs, it seems we need one of three possibilities:
- (A) Data sanitation: The system doesn't know anything about what AGIs are or how people use them;
- (B) Self-unawareness: The world-model doesn't know that it is an information-processing system that exists in the universe;
- (C) Interpretability+oversight: We make an overseer that carefully watches the world-model-creation process, and if the system finds a conceptual transformation between its own operations and generic-knowledge-of-how-people-use-AGIs, we immediately delete that information.
Out of these three options, (A) seems utterly impossible to me. For (B), I really like the idea, but I haven't yet found a way to formalize it beyond the vague notion of "Isolate the world model from any reflective information about how that world-model is being created and processed". For (C), I dunno, maybe, but I can't currently see a path to doing that.
2. Prevent its knowledge of the interactive environment from affecting its predictions
Maybe there's a way to code in a strict time-horizon on how far ahead it's thinking when trying to understand the current context, and maybe we can make sure that this horizon falls between an output and any subsequent causally-related inputs—e.g. see the comment above about using the newfile token. This seems readily achievable for purposeful interactions, and awfully hard for accidental interactions (e.g. the HelpImTrappedInAGPU example). Is it enough that we only give it snippets with exactly one missing word? Maybe, but I'm a bit skeptical.
Or maybe there's some way to set up the system, and its search process, and its decision criteria such that it would pick Hypothesis 1 in both of those examples above. I mean, maybe this is an easy problem, and I'm just confused.
(Cf. update above about interstice's comment. Maybe this is a problem only caused by certain biases in evaluation heuristics, and maybe we can find heuristics that lack those specific types of biases yet are still practical.)
3. Don't put it in an interactive environment
We could set up our system to do self-supervised learning, then lock it in an underground bunker running under homomorphic encryption until it's done (see example AI box design). Maybe it will try to do manipulative things sometimes, e.g. if it uses indirect reasoning to guess that someone might be listening. But nothing bad will come of it.
Then, when that's done, we fetch the system out of the bunker and use it as the core world-model of a question-answering AGI oracle. We still need to ensure that the self-supervised learning system doesn't leave manipulative booby-traps in its world-model, but maybe that's an easier problem then worrying about every possible kind of interaction?
4. Give up, and just make an agent with value-aligned goals
I put this one in for completeness, but I think it should be a last resort. No one knows for sure what we'll need for AGI safety; we want lots of tools in the toolbox. I think it would be really valuable to know how to set up a self-supervised learning system to build a powerful predictive world-model while not acting dangerous and manipulative in the meantime. I don't think we should give up on that vision unless it's truly impossible.