Introduction

Current discourse around AI safety focuses on creating some new paradigm for training AI models to make them safe, such as assistance games (Russell 2021) or corrigibility (Soares 2014, 2018). The prevailing view seems to be that the only thing keeping current AI systems safe is the fact that they are not very smart. This paper will present the opposite view. The current paradigm has at least two elements that make it safe even when extended to superhuman intelligence.

Generative Objectives

The key current trend in AI is generating text(/image/sound/video) from the training distribution. Maximum-likelihood models like GPT-3 (Brown et al. 2020), diffusion models like DALL-E 2 (Ramesh et al. 2022) and generative adversarial networks (Goodfellow et al. 2014) all share this basic objective.

Conditional text generation is a particularly useful variety of this objective. Given a prompt, often a partial document, the task is to generate the remaining document to have text that is as likely as possible to occur in the training set. GPT-3 and Codex are two prominent examples.

Web text is safe in the sense that even unfiltered and uncurated text scraped for quantity over quality is extremely unlikely to contain anything dangerous in the context of AI safety. We will call a dangerous output from a text generating AI an "attack text". For example, the AI could generate a source file for code that contains a hidden exploit to replicate the AI or evade restrictions.

Consider some likely use cases for a very capable text generating AI:

  • Create or extend a github repository for: better GPU performance, better AI model, etc.
    • Similar repositories in the training data are actually very unlikely to contain malicious code of any type.
    • Crypto-mining scams and digital vandalism represent almost the entire space of malicious or deceptive code for such repositories.
  • Write or extend an arxiv paper for: improved AI capability, better AI interpretability, etc.
    • Preprints can be wrong or sometimes not-even-wrong, but simply do not contain dangerous trickery.

The ideas for self improvement that the AI generates do not necessarily result in an AI that is naturally safe. If the described or implemented AI improvements depart from the key safety properties of current AI systems they will need to be analyzed separately.

Episodic Objectives

If an AI has human-level or superhuman intelligence it will realize it can improve its generation of text (or other objective) by pursuing additional resources, protection from alteration, self-replication, etc. These are called convergent instrumental goals (Bostrom 2012, 2014).

Suppose an AI is trained in the style of GPT-3, but at a scale and with enough model architecture improvements to give it human-level intelligence and understanding. When generating code for a new AI feature, the AI will understand that including hostile code to (for example) propagate itself over the internet will improve its ability to generate likely texts in the future. Also, when generating code, it will understand that including a worm or virus to target control systems for pesticides will increase the number of insects in the world. These two facts that the AI will understand will have equal impact on its behavior - approximately zero. It makes no difference to any prominent AI model's objective how well the AI will be able to perform the task past the current episode.

It makes no difference to any prominent AI model's objective how well the AI will be able to perform the task past the current episode. In a human this sort of episodic preference would be so strange as to prompt clinical study. People who care about performing a task well, and know they will perform that task in the future, and know they will care about performing it well in the future, also care about improving their ability to perform that task in the future. All people have non-episodic preferences. All current, prominent AIs have episodic objectives.

Computing gradients and exploration in reinforcement learning are some of the ways that current optimization algorithms use the current episode to improve future episodes. However, the AI model being optimized, the mesa-optimizer (Hubinger et al. 2019), never receives any training signal to care about its performance on future episodes.

Layered Protection from a Generative, Episodic Objective

Consider removing the first element of safety, text generation over the distribution of texts found on the internet. Episodic objectives over plausible per-instance time frames (< 10 human-equivalent-hours) prevent the development of the types of convergent instrumental goals we would rightly worry about in an agent with human-like time preferences. There is not a realistic way that a human-level AI will execute some serious hostile act that pays off within a time frame of hours.

Consider removing the second element of safety, episodic objectives. An AI model for text generation trained to match the distribution of web text will not generate attack texts. This remains true even if text or intermediate representations from the current episode are reused in future episodes with, for example, a reinforcement learning value function to encourage the creation of representations that are reusable. In the current paradigm, the AI's generated texts are never implemented or executed during the training phase. Therefore, generating anything like an attack text during training will never reduce the model's loss, not over the current episode, nor over any future episode.

Furthermore, the model will not bide its time until it is confident it is outside the training phase and then pursue convergent instrumental goals with attack texts. That is an implausible mesa-objective for the AI to learn. The AI model would need to devote parameters and time to detecting if it is still in the training phase while on every single gradient update these parameters are being hammered into something that might help during training.

Prosaic AI Alignment considers the possibility that current approaches to AI may be sufficient for AGI and could be used for dangerous, non-episodic objectives such as "using RL to implement a decentralized autonomous organization (DAO) which maximizes its profit". In some sense Prosaic Alignment presents the reverse thesis of this paper. We propose to retain the generative episodic training objective of current research and reach human-level performance through model architecture and algorithm improvements. Prosaic Alignment research hopes to develop aligned (rather than simply safe) AI objectives under the assumption they will be implemented in the current algorithms and architectures.

Myopic Objectives, as distinct from episodic objectives, include a range of ideas such as "stop gradients", discount rates of zero, and time-limited myopia. These ideas revolve around eliminating deception or convergent instrumental goals, even within a single episode. However, limiting the length of an episode may be a more easily analyzed and implemented alternative.

Tool AI is a proposal to direct research away from agent AIs and toward AIs that act only at the direction of a human operator. In practical use, a conditional text generation AI trained with a generative episodic objective may well fit the idea of a Tool AI. Gwern (2016) argued that agent AIs might have key advantages economically over their tool AI counterparts. However, recent progress has shown generative episodic AIs have key advantages in capability and economic impact, while non-episodic agent AIs receive little attention.

Objections

After considerable literature review, we have not found a direct argument against the safety of the generative episodic training objective. This section addresses some common objections against related safety arguments.

This is not true AGI.

The main output of AI researchers is code and papers. If the AI can generate those at the same level of quality then a takeoff of self improvement seems plausible.

For almost everyone, every impressive thing they have ever seen an AI do was done with an episodic training objective. This includes beating Lee Sedol at Go, writing an article about the discovery of talking unicorns, engaging in dialog well enough to suggest sentience, and creating a picture of a cat Napoleon. The later three were also done specifically with a generative episodic objective.

Note that unlike Prosaic AI, we propose that the current training objectives are compatible with safe AGI, not that current model architectures will achieve AGI. The training objective of GPT-X can be paired with models capable of more than a constant-time sequence of tensor operations.

The limitations that make it safe also make it less capable. Someone will remove those limitations.

The generative training objective is not easy to remove. This objective enables training from large amounts of easily acquired data. A non-episodic objective is also a significant complication. Most importantly, these are not “limitations” that can be simply removed. Instead there would need to be new research into a method for training AIs that does not have these features but produces AIs with the same valuable capabilities.

It will generalize away its episodic objective.

It may seem like an AI that only wants to produce the best document completion it can within a few hours will, at some level of intelligence, realize that later it will also want to produce the best completion for another prompt. So perhaps it will in some sense "generalize" and start valuing progress toward better document completions even in the long term.

But this is not a generalization. This alternative, longer term, objective is actually in competition with its goal of producing the best text it can right now. Bostrom (2012) makes a similar point when discussing a hypothetical Future-Tuesday-Indifference (Parfit 1984). There is nothing illogical about something strongly intelligent disregarding progress towards goals it knows it (or another copy of it) will have in the future.

It’s not that the AI starts with an anthropomorphic objective that cares about long term consequences, and then if the training objective is even halfway compatible with that, it will retain an interest in the long-term consequences of its actions. The AI starts not caring about anything at all. It is equally ready to learn to care about how many paperclips there will be in the next ten seconds as how many paperclips there will ever be.

Hyperparameter search or other outer loops can turn an episodic training objective into a non-episodic objective.

Self Induced Distributional Shift (Krueger et al. 2019) is not a safety problem in the generative episodic paradigm. No current "outer optimization loop" (hyperparameter search, architecture search, etc) has any chance of producing complex strategic behavior. These outer loops typically iterate in an extremely restricted hypothesis space (10s of parameters) for a tiny number of iterations (<1000). And with models increasing in size, these outer optimization loops are decreasing.

A generative, episodic training objective may not result in a safe mesa-objective for the AI.

Humans are mesa-optimizers with respect to evolution's optimization. And humans have motivations significantly different from the objective function of evolution. Importantly though, evolution's influence over human motivation is extremely indirect. Evolution selects for genes to produce proteins to influence the structure and development of the brain. AI optimizers have a very direct influence on the AI model being trained. Simple properties like "plans must pay off over a single episode" are very likely to be learned.

Open Problems with Myopia

It should be clear that none of the sophisticated decision theories (superrationality, anthropic uncertainty, or counterfactual mugging) actually provide benefit during generative episodic training. Might the AI model adopt them anyway, despite their incompatibility with the training objective? Realistically, no. Gradient descent is a shredder for complex parameter configurations that do not help the training objective.

Conclusion

Concepts like convergent instrumental goals, treacherous turns, mesa-optimizers, self-induced distributional shift, etc are useful for analyzing AI safety. The generative episodic paradigm is safe. During training no plan that either 1) produces anything like an attack text or 2) takes more than a single episode's duration to produce results ever gets a reinforcing gradient update. It is implausible that the AI model's mesa-optimizer will fail to embody these simple properties of the optimizer's training objective.

Attempts to make AI training objectives more like human values may decrease rather than increase safety. Convergent instrumental goals follow from human-like preferences over time. A promising alternative direction is research focused on how to ensure future AI enhancements retain the generative episodic training objective, and that these aspects of the training objective are easily learned by any mesa-optimizer.

References

Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press. ISBN 9780199678112.

Bostrom, N. The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. Minds & Machines 22, 71–85 (2012). https://doi.org/10.1007/s11023-012-9281-3

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.

Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., & Bengio, Y. (2014). Generative Adversarial Nets. NIPS.

Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., & Garrabrant, S. (2019). Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820.

Krueger, D. S., Maharaj, T., Legg, S., & Leike, J. (2019). Hidden incentives for self-induced distributional shift.

Parfit, D. (1984). Reasons and Persons. (pp. 123-4). Reprinted and corrected edition, 1987. Oxford: Oxford University Press.

Russell, S. (2021). Human-compatible artificial intelligence. Human-Like Machine Intelligence, 3-23.

Soares, N., & Fallenstein, B. (2014). Questions of reasoning under logical uncertainty. Intelligence. org.-2015.-URL: https://intelligence.org/files/QuestionsLogicalUncertainty.pdf.

Soares, N. (2018). The value learning problem. In Artificial intelligence safety and security (pp. 89-97). Chapman and Hall/CRC.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 2:13 AM

I agree with the key point, but for followup work, I'd ask - what is the pitch to robotics teams who are using diffusion for planning about how to constrain their objectives to not lose these benefits?

To restate the core ai safety point in current context, a very strong diffusion planner may be able to propagate structured logical consequences from much further in the future, depending on how strongly generalizing it is, and then be subject to goal mis-generalization. While the main generative objective does not specifically incentivize this, the approximate constraint satisfaction of diffusion would, by default, become more and more able to reason about long term consequences as it becomes able to do internal tokenization of those long term consequences. This sort of generalizing far out of training distribution is not something I'd expect from any sequential autoregressive model, because sequential autoregressive models only need to pull their activation manifolds into shapes that predict the most likely next element; but a diffusion model's activation manifold is pushed to be stable in the face of all noise directions, which means a multiscale diffusion planner of the kind used in robotics could much more easily distill across examples.

I've actually never heard of diffusion for planning. Do you have a reference?

A diffusion model for text generation (like Diffusion-LM) still has the training objective to produce text from the training distribution, optimizing over only the current episode - in this case a short text.