Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Last year Stuart Armstrong announced a contest to come up with the best questions to ask an Oracle AI. Wei Dai wrote,

Submission. For the counterfactual Oracle, ask the Oracle to predict the n best posts on AF during some future time period (counterfactually if we didn’t see the Oracle’s answer).

He later related his answer to Paul Christiano's posts on Human-in-the-counterfactual-loop and Elaborations on apprenticeship learning. Here I'm interested in concrete things that can be expected to go wrong in the near future if we gave GPT-N this task.

To provide a specific example, suppose we provided the prompt,

This is the first post in an Alignment Forum sequence explaining the approaches both MIRI and OpenAI staff believe are the most promising means of auditing the cognition of very complex machine learning models.

If by assumption, GPT-N is at least as good as a human expert team at generating blog posts, we could presumably expect this GPT-N to produce a very high quality post explaining how to inspect machine learning models. We would therefore have a way of to automate alignment research at a high level. But a number of important questions remain, such as,

  • How large would GPT-N need to be before it started producing answers comparable to a human expert team, and
  • Given the size of the model, what high-level incentives should we expect to guide the training of the model? In other words, what mesa optimization-like instantiations can we expect to result from training, exactly, and
  • Is there a clear and unambiguous danger that the model would be manipulative? If so, why?
  • Is the threat model more that we don't know what we don't know, or that we have a specific reason to believe the model would be manipulative in a particular direction?

New to LessWrong?

New Answer
New Comment

6 Answers sorted by

(sorry, couldn't resist)

This is the first post in an Alignment Forum sequence explaining the approaches both MIRI and OpenAI staff believe are the most promising means of auditing the cognition of very complex machine learning models. We will be discussing each approach in turn, with a focus on how they differ from one another. 

The goal of this series is to provide a more complete picture of the various options for auditing AI systems than has been provided so far by any single person or organization. The hope is that it will help people make better-informed decisions about which approach to pursue. 

We have tried to keep our discussion as objective as possible, but we recognize that there may well be disagreements among us on some points. If you think we've made an error, please let us know! 

If you're interested in reading more about the history of AI research and development, see: 

1. What Is Artificial Intelligence? (Wikipedia) 2. How Does Machine Learning Work? 3. How Can We Create Trustworthy AI? 

The first question we need to answer is: what do we mean by "artificial intelligence"? 

The term "artificial intelligence" has been used to refer to a surprisingly broad range of things. The three most common uses are: 

The study of how to create machines that can perceive, think, and act in ways that are typically only possible for humans. The study of how to create machines that can learn, using data, in ways that are typically only possible for humans. The study of how to create machines that can reason and solve problems in ways that are typically only possible for humans. 

In this sequence, we will focus on the third definition. We believe that the first two are much less important for the purpose of AI safety research, and that they are also much less tractable. 

Why is it so important to focus on the third definition? 

The third definition is important because, as we will discuss in later posts, it is the one that creates the most risk. It is also the one that is most difficult to research, and so it requires the most attention.


How many samples did you prune through to get this? Did you do any re-rolls? What was your stopping procedure?

This was literally the first output, with no rerolls in the middle! (Although after posting it, I did some other trials which weren't as good, so I did get lucky on the first one. Randomness parameter was set to 0.5.) I cut it off there because the next paragraph just restated the previous one.

A general method for identifying dangers: For every topic which gets discussed on AF, figure out what could go wrong if GPT-N decided to write a post on that topic.

  • GPT-N writes a post about fun theory. It illustrates principles of fun theory by describing an insanely fun game you can play with an ordinary 52-card deck. FAI work gets pushed aside as everyone becomes hooked on this new game. (Procrastination is an existential threat!)

  • GPT-N writes a post about human safety problems. To motivate its discussion, it offers some extraordinarily compelling reasons why the team which creates the first AGI might want to keep the benefits to themselves.

  • GPT-N writes a post about wireheading. In the "Human Wireheading" section, it describes an incredibly easy and pleasurable form of meditation. Soon everyone is meditating 24/7.

  • GPT-N writes a post about s-risks. Everyone who reads it gets a bad case of PTSD.

  • GPT-N writes a post about existential hope. Everyone who reads it becomes unbearably impatient for the posthuman era. Security mindset becomes a thing of the past. Alternatively, everyone's motivation for living in the present moment gets totally sapped. There are several high-profile suicides.

  • GPT-N has an incredibly bad take on decision theory, game theory, and blackmail. It gets deleted from AF. The Streisand effect occurs and millions of people read it.

  • GPT-N offers a very specific answer to the question "What specific dangers arise when asking GPT-N to write an Alignment Forum post?"

For the prompt you provided, one risk would be that GPT-N says the best way to audit cognition is to look for each of these 10 different types of nefarious activity, and in describing the 10 types, it ends up writing something nefarious.

GPT-N might inadvertently write a post which presents an incredibly compelling argument for an incorrect and harmful conclusion ("FAI work doesn't matter because FAI is totally impossible"), but one hopes that you could simply use GPT-N to write a counterargument to that post to see if the conclusion is actually solid. (Seems like good practice for GPT-N posts in general.)

To me the most obvious risk (which I don't ATM think of as very likely for the next few iterations, or possibly ever, since the training is myopic/SL) would be that GPT-N in fact is computing (e.g. among other things) a superintelligent mesa-optimization process that understands the situation it is in and is agent-y. This risk is significantly more severe if nobody realizes this is the case or looking out for it.

In this case, the mesa-optimizer probably has a lot of leeway in terms of what it can say while avoiding detection. Everything is says has to stay within some "plausibility space" of arguments that will be accepted by readers (I'm neglecting more sophisticated mind-hacking, but probably shouldn't), but for many X, it can probably choose between compelling arguments for X and not-X in order to advance its goals. (If we used safety-via-debate, and it works, that would significantly restrict the "plasuability space").

Now, if we're unlucky, it can convince enough people that something that effectively unboxes it is safe and a good idea.

And once it's unboxed, we're in a Superintelligence-type scenario.


Another risk that could occur (without mesa-optimization) would be incidental belief-drift among alignment researchers, if it just so happens that the misalignment between "predict next token" and "create good arguments" is significant enough.

Incidental deviation from the correct specification is usually less of a concern, but with humans deciding which research directions to pursue based on outputs of GPT-N, there could be a feedback loop...

I think I believe the AI alignment research community is good enough at tracking the truth that this seems less plausible?

On the other hand, it becomes harder to track the truth if there is an alternative narrative plowing ahead making much faster progress... So if GPT-N enables much faster progress on a particular plausible seeming path towards alignment that was optimized for "next token prediction" rather than "good ideas"... I guess we could end up rolling the dice on whether "next token prediction" was actually likely to generate "good ideas".

To me the most obvious risk (which I don't ATM think of as very likely for the next few iterations, or possibly ever, since the training is myopic/SL) would be that GPT-N in fact is computing (e.g. among other things) a superintelligent mesa-optimization process that understands the situation it is in and is agent-y.

Do you have any idea of what the mesa objective might be. I agree that this is a worrisome risk, but I was more interested in the type of answer that specified, "Here's a plausible mesa objective given the incentives." Mesa optimization is a more general risk that isn't specific to the narrow training scheme used by GPT-N.

3Ricardo Meneghin3y
The mesa-objective could be perfectly aligned with the base-objective (predicting the next token) and still have terrible unintended consequences, because the base-objective is unaligned with actual human values. A superintelligent GPT-N which simply wants to predict the next token could, for example, try to break out of the box in order to obtain more resources and use those resources to more correctly output the next token. This would have to happen during a single inference step, because GPT-N really just wants to predict the next token, but it's mesa-optimization process may conclude that world domination is the best way of doing so. Whether such system could be learned through current gradient-descent optimizers is unclear to me.
1David Scott Krueger (formerly: capybaralet)3y
No, and I don't think it really matters too much... what's more important is the "architecture" of the "mesa-optimizer". It's doing something that looks like search/planning/optimization/RL. Roughly speaking, the simplest form of this model of how things works says: "Its so hard to solve NLP without doing agent-y stuff that when we see GPT-N produce a solution to NLP, we should assume that it's doing agenty stuff on the inside... i.e. what probably happened is it evolved or stumbled upon something agenty, and then that agenty thing realized the situation it was in and started plotting a treacherous turn".
1David Scott Krueger (formerly: capybaralet)3y
In other words, there is a fully general argument for learning algorithms producing mesa-optimization to the extent that they use relatively weak learning algorithms on relatively hard tasks. It's very unclear ATM how much weight to give this argument in general, or in specific contexts. But I don't think it's particularly sensitive to the choice of task/learning algorithm.

It may be the case that solving inner alignment problems means hitting a narrow target; meaning that if we naively carry out a super-large-scale training process that spits out a huge AGI-level NN, dangerous logic is very likely to arise somewhere in the NN at some point during training. Since this concern doesn't point at any specific-type-of-dangerous-logic I guess it's not what you're after in this post; but I wouldn't classify it as part of the threat model that "we don't know what we don't know".

Having said all that, here's an attempt at describing a specific scenario as requested:

Suppose we finally train our AGI-level GPT-N and we think that the distribution it learned is "the human writing distribution", HWD for short. HWD is a distribution that roughly corresponds to our credences when answering questions like "which of these two strings is more likely to have appeared on the internet prior to 2020-07-28?". But unbeknown to us, the inductive bias of our training process made GPT-N learn the distribution HWD*, which is just like HWD except that some fraction of [the strings with a prefix that looks like "a prompt by humans-trying-to-automate-AI-safety"] are manipulative and make AI safety researchers, upon reading, invoke an AGI with a goal system X. Turns out that the inductive bias of our training process caused GPT-N to model agents-with-goal-system-X and such agents tend to sample lots of strings from the HWD* distribution in order to "steal" the cosmic endowment of reckless civilizations like ours. This would be a manifestation of is the same type of failure mode as the universal prior problem.

One class of problem comes about if GPT-N starts thinking about "what would a UFAI do in situation X":

  • Inspired by AI box experiments, GPT-N writes a post about the danger posed by ultra persuasive AI-generated arguments for bad conclusions, and provides a concrete example of such an argument.
  • GPT-N writes a post where it gives a detailed explanation of how a UFAI could take over the world.  Terrorists read the post and notice that UFAI isn't a hard requirement for the plan to work.
  • GPT-N begins writing a post about mesa-optimizers and starts simulating a mesa-optimizer midway through.

As far as I understand GPT-N it's not very agent-like (it doesn't perform me vs environment abstraction and doesn't look for ways to transform its perceived environment to satisfy some utility function). I wouldn't expect it to "scheme" against people since it lacks any concept of "affecting its environment".

However it seems likely that GTP-N can perfect the skill of crowd-pleasing (we already see that; we're constantly amazed by it, despite little meaning of created texts). It can precisely modulate it's tone and identify the talking points that get the most response.

So I expect the GTP-N generated texts to sound really persuasive, not because of novel ideas but because of superhuman ability to compose heard ideas into persuasive essay.

I would expect GTP-N to focus on presenting solutions for alignement (therefore making us overly optimistic about naive approaches), presenting novel risks (it's easy to make something up by simple rehashing) and possibly venturing in philosophical muddling the water (humans prove to be very easily engaged by certain topics, like self-consciousness)

we already see that; we're constantly amazed by it, despite little meaning of created texts

But GPT-3 is only trained to minimize prediction loss, not to maximize response. GPT-N may be able to crowd-please if it's trained on approval, but I don't think that's what's currently happening.

1Jan Rzymkowski3y
Upon reflection, you're right that it won't be maximizing response per se. But as we get deeper it's not so straightforward. GTP-3 models can be trained to minimize prediction loss (or, plainly speaking, to simply predict more accurately) on many different tasks, which usually are very simply stated (eg. choose a word that would fill the blank). But we end up with people taking models trained thusly and use them to generate a long texts based on some primer. And yes, in most cases such abuse of the model will end up with text that is simply coherent. But I would expect humans to have a tendency to conflate coherence and persuasiveness. I suppose one can fairly easily choose such prediction loss for GTP-3 models that the longer texts would have some desired characteristics. But also even standard tasks probably shape GTP-3 so that it would keep producing vague sentences that continue the primer and that give the reader a feel of "it making sense". That would entail possibly producing fairly persuasive texts reinforcing primer thesis.