# [AN #141]: The case for practicing alignment work on GPT-3 and other large models

Alignment Newsletter12 min read10th Mar 20214 comments

# Ω 18

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

# HIGHLIGHTS

The case for aligning narrowly superhuman models (Ajeya Cotra) (summarized by Rohin): One argument against work on AI safety is that it is hard to do good work without feedback loops. So how could we get feedback loops? The most obvious approach is to actually try to align strong models right now, in order to get practice with aligning models in the future. This post fleshes out what such an approach might look like. Note that I will not be covering all of the points mentioned in the post; if you find yourself skeptical, you may want to read the full post as your question might be answered there.

The author specifically suggests that we work on aligning narrowly superhuman models to make them more useful. Aligning a model roughly means harnessing the full capabilities of the model and orienting these full capabilities towards helping humans. For example, GPT-3 presumably “knows” a lot about medicine and health. How can we get GPT-3 to apply this knowledge as best as possible to be maximally useful in answering user questions about health?

Narrowly superhuman means that the model has more knowledge or “latent capability” than either its overseers or its users. In the example above, GPT-3 almost certainly has more medical knowledge than laypeople, so it is at least narrowly superhuman at “giving medical advice” relative to laypeople. (It might even be so relative to doctors, given how broad its knowledge is.)

Learning to Summarize with Human Feedback (AN #116) is a good example of what this could look like: that paper attempted to “bring out” GPT-3’s latent capability to write summaries, and outperformed the reference summaries written by humans. This sort of work will be needed for any new powerful model we train, and so it has a lot of potential for growing the field of people concerned about long-term risk.

Note that the focus here is on aligning existing capabilities to make a model more useful, and so simply increasing capabilities doesn’t count. As a concrete example, just scaling up the model capacity or training data or compute would not count as an example of “aligning narrowly superhuman models”, even though it might make the model more useful, since scaling increases raw capabilities without improving alignment. This makes it pretty different from what profit-maximizing companies would do by default: instead of baking in domain knowledge and simply scaling up models in order to solve the easiest profitable problems (as you would do if you wanted to maximize profit), work in this research area would look for general and scalable techniques, would not be allowed to scale up models, and would select interestingly difficult problems.

Why is this a fruitful area of research? The author points out four main benefits:

1. Most importantly, the more we align systems ahead of time, the more likely that researchers will be able to put thought and consideration into new issues like treacherous turns, rather than spending all their time putting out fires.

2. We can build practical know-how and infrastructure for alignment techniques like learning from human feedback.

3. As the world gets progressively faster and crazier, we’ll have better AI assistants helping us to navigate the world.

4. It improves our chances of discovering or verifying a long-term or “full” alignment solution.

See also MIRI’s comments, which were more positive than I expected.

Read more: MIRI comments

Rohin's opinion: I am very sympathetic to the argument that we should be getting experience with aligning powerful models right now, and would be excited to see more work along these lines. As the post mentions, I personally see this sort of work as a strong baseline, and while I currently think that the conceptual work I’m doing is more important, I wouldn’t be surprised if I worked on a project in this vein within the next two years.

I especially agree with the point that this is one of the most scalable forms of research, and am personally working on a benchmark meant to incentivize this sort of research for similar reasons.

# TECHNICAL AI ALIGNMENT

## AGENT FOUNDATIONS

A Semitechnical Introductory Dialogue on Solomonoff Induction (Eliezer Yudkowsky) (summarized by Rohin): This post is a good introduction to Solomonoff induction and why it’s interesting (though note it is quite long).

## INTERPRETABILITY

What mechanisms drive agent behaviour? (Grégoire Déletang et al) (summarized by Rohin): A common challenge when understanding the world is that it is very hard to infer causal structure from only observational data. Luckily, we aren’t limited to observational data in the case of AI systems: we can intervene on either the environment the agent is acting in, or the agent itself, and see what happens. In this paper, the authors present an “agent debugger” that helps with this, which has all the features you’d normally expect in a debugger: you can set breakpoints, step forward or backward in the execution trace, and set or monitor variables.

Let’s consider an example where an agent is trained to go to a high reward apple. However, during training the location of the apple is correlated with the floor type (grass or sand). Suppose we now get an agent that does well in the training environment. How can we tell if the agent looks for the apple and goes there, rather than looking at the floor type and going to the location where the apple was during training?

We can’t distinguish between these possibilities with just observational data. However, with the agent debugger, we can simulate what the agent would do in the case where the floor type and apple location are different from how they were in training, which can then answer our question.

We can go further: using the data collected from simulations using the agent debugger, we can also build a causal model that explains how the agent makes decisions. We do have to identify the features of interest (i.e. the nodes in the causal graph), but the probability tables can be computed automatically from the data from the agent debugger. The resulting causal model can then be thought of as an “explanation” for the behavior of the agent.

Rohin's opinion: I very much like the general idea that we really can look at counterfactuals for artificial agents, given that we can control their inputs and internal state. This is the same idea underlying cross-examination (AN #86), as well as various other kinds of interpretability research.

In addition, one nice aspect of causal models as your form of “explanation” is that you can modulate the size of the causal model based on how many nodes you add to the graph. The full causal model for e.g. GPT-3 would be way too complex to understand, but perhaps we can get a high-level understanding with a causal model with higher-level concepts. I’d be very interested to see research tackling these sorts of scaling challenges.

## FORECASTING

How does bee learning compare with machine learning? (Guilhermo Costa) (summarized by Rohin): The biological anchors approach (AN #121) to forecasting AI timelines estimates the compute needed for transformative AI based on the compute used by animals. One important parameter of the framework is needed to “bridge” between the two: if we find that an animal can do a specific task using X amount of compute, then what should we estimate as the amount of compute needed for an ML model to do the same task? This post aims to better estimate this parameter, by comparing few-shot image classification in bees to the same task in ML models. I won’t go through the details here, but the upshot is that (after various approximations and judgment calls) ML models can reach the same performance as bees on few-shot image classification using 1,000 times less compute.

If we plug this parameter into the biological anchors framework (without changing any of the other parameters), the median year for transformative AI according to the model changes from 2050 to 2035, though the author advises only updating to (say) 2045 since the results of the investigation are so uncertain. The author also sees this as generally validating the biological anchors approach to forecasting timelines.

Rohin's opinion: I really liked this post: the problem is important, the approach to tackle it makes sense, and most importantly it’s very easy to follow the reasoning. I don’t think that directly substituting in the 1,000 number into the timelines calculation is the right approach; I think there are a few reasons (explained here, some of which were mentioned in the post) to think that the comparison was biased in favor of the ML models. I would instead wildly guess that this comparison suggests that a transformative model would use 20x less compute than a human, which still shortens timelines, probably to 2045 or so. (This is before incorporating uncertainty about the conclusions of the report as a whole.)

## MISCELLANEOUS (ALIGNMENT)

On the alignment problem (Rob Wiblin and Brian Christian) (summarized by Rohin): This 80,000 Hours podcast goes over many of the examples from Brian’s book, The Alignment Problem (AN #120). I recommend listening to it if you aren’t going to read the book itself; the examples and stories are fascinating. (Though note I only skimmed through the podcast.)

Epistemological Framing for AI Alignment Research (Adam Shimi) (summarized by Rohin): This post recommends that we think about AI alignment research in the following framework:

1. Defining the problem and its terms: for example, we might want to define “agency”, “optimization”, “AI”, and “well-behaved”.

2. Exploring these definitions, to see what they entail.

3. Solving the now well-defined problem.

This is explicitly not a paradigm, but rather a framework in which we can think about possible paradigms for AI safety. A specific paradigm would choose a specific problem formulation and definition (or at least something significantly more concrete than “solve AI safety”). However, we are not yet sufficiently deconfused to be able to commit to a specific paradigm; hence this overarching framework.

# AI GOVERNANCE

NSCAI Final Report (Eric Schmidt et al) (summarized by Rohin): In the US, the National Security Commission on AI released their report to Congress. The full pdf is over 750 pages long, so I have not read it myself, and instead I’m adding in some commentary from others. In their newsletter, CSET says that highlights include:

- A warning that the U.S. military could be at a competitive disadvantage within the next decade if it does not accelerate its AI adoption. The report recommends laying the foundation for widespread AI integration by 2025, comprising a DOD-wide digital ecosystem, a technically literate workforce, and more efficient business practices aided by AI.

- A recommendation that the White House establish a new “Technology Competitiveness Council,” led by the vice president, to develop a comprehensive technology strategy and oversee its implementation.

- A recommendation that the U.S. military explore using autonomous weapons systems, provided their use is authorized by human operators.

- A proposal to establish a new Digital Service Academy and a civilian National Reserve to cultivate domestic AI talent.

- A call to provide $35 billion in federal investment and incentives for domestic semiconductor manufacturing. - A recommendation to double non-defense AI R&D funding annually until it reaches$32 billion per year, and to triple the number of National AI Research Institutes.

- A call for reformed export controls, coordinated with allies, on key technologies such as high-end semiconductor manufacturing equipment.

- A recommendation that Congress pass a second National Defense Education Act and reform the U.S. immigration system to attract and retain AI students and workers from abroad.

While none of the report’s recommendations are legally binding, it has reportedly been well-received by key members of both parties.

Matthew van der Merwe also summarizes the recommendations in Import AI; this has a lot of overlap with the CSET summary so I won't copy it here.

Jeff Ding adds in ChinAI #134:

[I]f you make it past the bluster in the beginning — or take it for what it is: obligatory marketing to cater to a DC audience hooked on a narrow vision of national security — there’s some smart moderate policy ideas in the report (e.g. chapter 7 on establishing justified confidence in AI systems).

In email correspondence, Jon Rodriguez adds some commentary on the safety implications:

1. The report acknowledges the potential danger of AGI, and specifically calls for value alignment research to take place (pg. 36). To my knowledge, this is one of the first times a leading world government has called for value alignment.

2. The report makes a clear statement that the US prohibits AI from authorizing the launch of nuclear weapons (pg. 98).

3. The report calls for dialogues with China and Russia to ensure that military decisions made by military AI at "machine speed" does not lead to out-of-control conflict escalation which humans would not want (pg. 97).

# OTHER PROGRESS IN AI

## DEEP LEARNING

Learning Curve Theory (Marcus Hutter) (summarized by Rohin): Like last week’s highlight (AN #140), this paper proposes a theoretical model that could predict empirically observable scaling laws. The author considers a very simple online learning model, in which we are given a feature vector and must classify it into one of two categories. We’ll also consider a very simple tabular algorithm that just memorizes the classifications of all previously seen vectors and spits out the correct classification if it has been seen before, and otherwise says “I don’t know”. How does the error incurred by this algorithm scale with data size?

The answer of course depends on the data distribution -- if we always see the same feature vector, then we never make an error after the first timestep, whereas if the vector is chosen uniformly at random, we’ll always have maximal error. The author analyzes several possible data distributions in between these extremes.

The most interesting case is when the data is drawn from a Zipf distribution. In this case, when you order the feature vectors from most to least likely, the nth vector has probability proportional to n^(-(α+1)). Then we see a power law for the scaling, n^(-β), where β = α / (α+1). This could explain the scaling laws observed in the wild.

Rohin's opinion: As with last week’s paper, I’m happy to see more work on understanding scaling laws. For this paper, the “assumption on reality” is in which data distribution we assume the data is drawn from. However, overall I feel less compelled by this paper than with the one from last week, for two reasons. First, it seems to me that using a tabular (memorization) algorithm is probably too coarse of a model; I would guess that there are facts about neural nets that are relevant to scaling that aren’t captured by tabular algorithms. Second, I prefer the assumption that the data are drawn from a low-dimensional manifold, rather than that the data are drawn from some specific distribution like a Zipf distribution (or others discussed in the paper).

#### FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

#### PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

# Ω 18

4 comments, sorted by Click to highlight new comments since:
New Comment

I have a hard time saying which of the scaling laws explanations I like better (I haven't read either paper in detail, but I think I got the gist of both).
What's interesting about Hutter's is that the model is so simple, and doesn't require generalization at all.
I feel like there's a pretty strong Occam's Razor-esque argument for preferring Hutter's model, even though it seems wildly less intuitive to me.
Or maybe what I want to say is more like "Hutter's model DEMANDS refutation/falsification".

I think both models also are very interesting for understanding DNN generaliztion... I really think it goes beyond memorization and local generalization (c.f. https://openreview.net/forum?id=rJv6ZgHYg), but it's interesting that those are basically the mechanisms proposed by Hutter and Sharma & Kaplan (resp.)...

I feel like there's a pretty strong Occam's Razor-esque argument for preferring Hutter's model, even though it seems wildly less intuitive to me.

?? Overall this claim feels to me like:

• Observing that cows don't float into space
• Making a model of spherical cows with constant density ρ and showing that as long as ρ is more than density of air, the cows won't float
• Concluding that since the model is so simple, Occam's Razor says that cows must be spherical with constant density.

Some ways that you could refute it:

• It requires your data to be Zipf-distributed -- why expect that to be true?
• The simplicity comes from being further away from normal neural nets -- surely the one that's closer to neural nets is more likely to be true?

Or maybe what I want to say is more like "Hutter's model DEMANDS refutation/falsification".

Taken literally, this is easy to do. Neural nets often get the right answer on never-before-seen data points, whereas Hutter's model doesn't. Presumably you mean something else but idk what.

Intersting... Maybe this comes down to different taste or something.  I understand, but don't agree with, the cow analogy... I'm not sure why, but one difference is that I think we know more about cows than DNNs or something.

I haven't thought about the Zipf-distributed thing.

> Taken literally, this is easy to do. Neural nets often get the right answer on never-before-seen data points, whereas Hutter's model doesn't. Presumably you mean something else but idk what.

I'd like to see Hutter's model "translated" a bit to DNNs, e.g. by assuming they get anything right that's within epsilon of a training data poing or something... maybe it even ends up looking like the other model in that context...

I'd like to see Hutter's model "translated" a bit to DNNs, e.g. by assuming they get anything right that's within epsilon of a training data poing or something

With this assumption, asymptotically (i.e. with enough data) this becomes a nearest neighbor classifier. For the -dimensional manifold assumption in the other model, you can apply the arguments from the other model to say that you scale as  for some constant  (probably c = 1 or 2, depending on what exactly we're quantifying the scaling of).

I'm not entirely sure how you'd generalize the Zipf assumption to the "within epsilon" case, since in the original model there was no assumption on the smoothness of the function being predicted (i.e. [0, 0, 0] and [0, 0, 0.000001] could have completely different values.)