[AN #157]: Measuring misalignment in the technology underlying Copilot

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

Evaluating Large Language Models Trained on Code (Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan et al) (summarized by Rohin): You’ve probably heard of GitHub Copilot, the programming assistant tool that can provide suggestions while you are writing code. This paper evaluates Codex, a precursor to the model underlying Copilot. There’s a lot of content here; I’m only summarizing what I see as the highlights.

The core ingredient for Codex was the many, many public repositories on GitHub, which provided hundreds of millions of lines of training data. With such a large dataset, the authors were able to get good performance by training a model completely from scratch, though in practice they finetuned an existing pretrained GPT model as it converged faster while providing similar performance.

Their primary tool for evaluation is HumanEval, a collection of 164 hand-constructed Python programming problems where the model is provided with a docstring explaining what the program should do along with some unit tests, and the model must produce a correct implementation of the resulting function. Problems are not all equally difficult; an easier problem asks Codex to “increment all numbers in a list by 1” while a harder one provides a function that encodes a string of text using a transposition cipher and asks Codex to write the corresponding decryption function.

To improve performance even further, they collect a sanitized finetuning dataset of problems formatted similarly to those in HumanEval and train Codex to perform well on such problems. These models are called Codex-S. With this, we see the following results:

1. Pretrained GPT models get roughly 0%.

2. The largest 12B Codex-S model succeeds on the first try 29% of the time. (A Codex model of the same size only gets roughly 22%.)

3. There is a consistent scaling law for reduction in loss. This translates into a less consistent graph for performance on the HumanEval dataset, where once the model starts to solve at least (say) 5% of the tasks, there is a roughly linear increase in the probability of success when doubling the size of the model.

4. If instead we generate 100 samples and check whether they pass the unit tests to select the best one, then Codex-S gets 78%. If we still generate 100 samples but select the sample that has the highest mean log probability (perhaps because we don’t have an exhaustive suite of unit tests), then we get 45%.

They also probe the model for bad behavior, including misalignment. In this context, they define misalignment as a case where the user wants A, but the model outputs B, and the model is both capable of outputting A and capable of distinguishing between cases where the user wants A and the user wants B.

Since Codex is trained primarily to predict the next token, it has likely learned that buggy code should be followed by more buggy code, that insecure code should be followed by more insecure code, and so on. This suggests that if the user accidentally provides examples with subtle bugs, then the model will continue to create buggy code, even though the user would want correct code. They find that exactly this effect occurs, and that the divergence between good and bad performance increases as the model size increases (presumably because larger models are better able to pick up on the correlation between previous buggy code and future buggy code).

Rohin's opinion: I really liked the experiment demonstrating misalignment, as it seems like it accurately captures the aspects that we expect to see with existentially risky misaligned AI systems: they will “know” how to do the thing we want, they simply won’t be “motivated” to actually do it.

TECHNICAL AI ALIGNMENT

TECHNICAL AGENDAS AND PRIORITIZATION

Measurement, Optimization, and Take-off Speed (Jacob Steinhardt) (summarized by Sudhanshu): In this blogpost, the author argues that "trying to measure pretty much anything you can think of is a good mental move that is heavily underutilized in machine learning". He motivates the value of measurement and additional metrics by (i) citing evidence from the history of science, policy-making, and engineering (e.g. x-ray crystallography contributed to rapid progress in molecular biology), (ii) describing how, conceptually, "measurement has several valuable properties" (one of which is to act as interlocking constraints that help to error-check theories), and (iii) providing anecdotes from his own research endeavours where such approaches have been productive and useful (see, e.g. Rethinking Bias-Variance Trade-off (AN #129)).

He demonstrates his proposal by applying it to the notion of optimization power -- an important idea that has not been measured or even framed in terms of metrics. Two metrics are offered: (a) the change (typically deterioration) of performance when trained with a perturbed objective function with respect to the original objective function, named Outer Optimization, and (b) the change in performance of agents during their own lifetime (but without any further parameter updates), such as the log-loss on the next sentence for a language model after it sees X number of sequences at test time, or Inner Adaptation. Inspired by these, the article includes research questions and possible challenges.

He concludes with the insight that take-off would depend on these two continuous processes, Outer Optimization and Inner Adaptation, that work on very different time-scales, with the former being, at this time, much quicker than the latter. However, drawing an analogy from evolution, where it took billions of years of optimization to generate creatures like humans that were exceptional at rapid adaptation, we might yet see a fast take-off were Inner Adaptation turns out to be an exponential process that dominates capabilities progress. He advocates for early, sensitive measurement of this quantity as it might be an early warning sign of imminent risks.

Sudhanshu's opinion: Early on, this post reminded me of Twenty Billion Questions; even though they are concretely different, these two pieces share a conceptual thread. They both consider the measurement of multiple quantities essential for solving their problems: 20BQ for encouraging AIs to be low-impact, and this post for productive framings of ill-defined concepts and as a heads-up about potential catastrophes.

Measurement is important, and this article poignantly argues why and illustrates how. It volunteers potential ideas that can be worked on today by mainstream ML researchers, and offers up a powerful toolkit to improve one's own quality of analysis. It would be great to see more examples of this technique applied to other contentious, fuzzy concepts in ML and beyond. I'll quickly note that while there seems to be minimal interest in this from academia, measurement of optimization power has been discussed earlier in several ways, e.g. Measuring Optimization Power, or the ground of optimization (AN #105).

Rohin's opinion: I broadly agree with the perspective in this post. I feel especially optimistic about the prospects of measurement for (a) checking whether our theoretical arguments hold in practice and (b) convincing others of our positions (assuming that the arguments do hold in practice).

FORECASTING

Fractional progress estimates for AI timelines and implied resource requirements (Mark Xu et al) (summarized by Rohin): One methodology for forecasting AI timelines is to ask experts how much progress they have made to human-level AI within their subfield over the last T years. You can then extrapolate linearly to see when 100% of the problem will be solved. The post linked above collects such estimates, with a typical estimate being 5% of a problem being solved in the twenty year period between 1992 and 2012. Overall these estimates imply a timeline of 372 years.

This post provides a reductio argument against this pair of methodology and estimate. The core argument is that if you linearly extrapolate, then you are effectively saying “assume that business continues as usual: then how long does it take”? But “business as usual” in the case of the last 20 years involves an increase in the amount of compute used by AI researchers by a factor of ~1000, so this effectively says that we’ll get to human-level AI after a 1000^{372/20} = 10^56 increase in the amount of available compute. (The authors do a somewhat more careful calculation that breaks apart improvements in price and growth of GDP, and get 10^53.)

This is a stupendously large amount of compute: it far dwarfs the amount of compute used by evolution, and even dwarfs the maximum amount of irreversible computing we could have done with all the energy that has ever hit the Earth over its lifetime (the bound comes from Landauer’s principle).

Given that evolution did produce intelligence (us), we should reject the argument. But what should we make of the expert estimates then? One interpretation is that “proportion of the problem solved” behaves more like an exponential, because the inputs are growing exponentially, and so the time taken to do the last 90% can be much less than 9x the time taken for the first 10%.

Rohin's opinion: This seems like a pretty clear reductio to me, though it is possible to argue that this argument doesn’t apply because compute isn’t the bottleneck, i.e. even with infinite compute we wouldn’t know how to make AGI. (That being said, I mostly do think we could build AGI if only we had enough compute; see also last week’s highlight on the scaling hypothesis (AN #156).)

MISCELLANEOUS (ALIGNMENT)

Progress on Causal Influence Diagrams (Tom Everitt et al) (summarized by Rohin): Many of the problems we care about (reward gaming, wireheading, manipulation) are fundamentally a worry that our AI systems will have the wrong incentives. Thus, we need Causal Influence Diagrams (CIDs): a formal theory of incentives. These are graphical models (AN #49) in which there are action nodes (which the agent controls) and utility nodes (which determine what the agent wants). Once such a model is specified, we can talk about various incentives the agent has. This can then be used for several applications:

1. We can analyze what happens when you intervene on the agent’s action. Depending on whether the RL algorithm uses the original or modified action in its update rule, we may or may not see the algorithm disable its off switch.

2. We can avoid reward tampering (AN #71) by removing the connections from future rewards to utility nodes; in other words, we ensure that the agent evaluates hypothetical future outcomes according to its current reward function.

3. A multiagent version allows us to recover concepts like Nash equilibria and subgames from game theory, using a very simple, compact representation.

AI GOVERNANCE

A personal take on longtermist AI governance (Luke Muehlhauser) (summarized by Rohin): We’ve previously seen (AN #130) that Open Philanthropy struggles to find intermediate goals in AI governance that seem robustly good to pursue from a longtermist perspective. (If you aren’t familiar with longtermism, you probably want to skip to the next summary.) In this personal post, the author suggests that there are three key bottlenecks driving this:

1. There are very few longtermists in the world; those that do exist often don’t have the specific interests, skills, and experience needed for AI governance work. We could try to get others to work on relevant problems, but:

2. We don’t have the strategic clarity and forecasting ability to know which intermediate goals are important (or even net positive). Maybe we could get people to help us figure out the strategic picture? Unfortunately:

3. It's difficult to define and scope research projects that can help clarify which intermediate goals are worth pursuing when done by people who are not themselves thinking about the issues from a longtermist perspective.

Given these bottlenecks, the author offers the following career advice for those who hope to do work from a longtermist perspective in AI governance:

1. Career decisions should be especially influenced by the value of experimentation, learning, aptitude development, and career capital.

2. Prioritize future impact, for example by building credentials to influence a 1-20 year “crunch time” period. (But make sure to keep studying and thinking about how to create that future impact.)

3. Work on building the field, especially with an eye to reducing bottleneck #1. (See e.g. here.)

4. Try to reduce bottleneck #2 by doing research that increases strategic clarity, though note that many people have tried this and it doesn’t seem like the situation has improved very much.

NEWS

Open Philanthropy Technology Policy Fellowship (Luke Muehlhauser) (summarized by Rohin): Open Philanthropy is seeking applicants for a US policy fellowship program focused on high-priority emerging technologies, especially AI and biotechnology. Application deadline is September 15.

Read more: EA Forum post

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

Rohin's opinion: I really liked the experiment demonstrating misalignment, as it seems like it accurately captures the aspects that we expect to see with existentially risky misaligned AI systems: they will “know” how to do the thing we want, they simply won’t be “motivated” to actually do it.

I think that this is a very good example where the paper (based on your summary) and your opinion assumes some sort of higher agency/goals in GPT-3 than what I feel we have evidence for. Notably, there are IMO pretty good arguments (mostly by people affiliated with EleutherAI, I'm pushing them to post on the AF) that GPT-3 seems to work more like a simulator of language-producing processes (for lack of a better word), than as an agent trying to predict the next token.

Like what you write here:

They also probe the model for bad behavior, including misalignment. In this context, they define misalignment as a case where the user wants A, but the model outputs B, and the model is both capable of outputting A and capable of distinguishing between cases where the user wants A and the user wants B.

For a simulator-like model, this is not misalignment, this is intended behavior. It is trained to find the most probable continuation, not to analyze what you meant and solve your problem. In that sense, GPT-3 fails the "chatbot task": for a lot of the great things it's great at doing, you have to handcraft (or constrain) the prompts to make -- it won't find out precisely what you mean.

Or put it differently: people which are good at making GPT-3 do what they want have learned to not use it like a smart agent figuring out what you really mean, but more like a "prompt continuation engine". You can obviously say "it's an agent that does really care about the context", but I doesn't look like it adds anything to the picture, and I have the gut feeling that being agenty makes it harder to do that task (as you need a very un-goal-like goal).

(I think this points out to what you mention in that comment, about approval-directedness being significantly less goal-directed: if GPT-3 is agenty, it looks quite a lot like a sort of approval-directed agent.)

I think that this is a very good example where the paper (based on your summary) and your opinion assumes some sort of higher agency/goals in GPT-3 than what I feel we have evidence for.

Where do you see any assumption of agency/goals?

(I find this some combination of sad and amusing as a commentary on the difficulty of communication, in that I feel like I tend to be the person pushing against ascribing goals to GPT.)

Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is motivated to do something else.

Maybe you're objecting to words like "know" and "capable"? But those don't seem to imply agency/goals; it seems reasonable to say that Google Maps knows about traffic patterns and is capable of predicting route times.

As an aside, this was Codex rather than GPT-3, though I'd say the same thing for both.

For a simulator-like model, this is not misalignment, this is intended behavior. It is trained to find the most probable continuation, not to analyze what you meant and solve your problem.

I don't care what it is trained for; I care whether it solves my problem. Are you telling me that you wouldn't count any of the reward misspecification examples as misalignment? After all, those agents were trained to optimize the reward, not to analyze what you meant and fix your reward.

You can obviously say "it's an agent that does really care about the context", but I doesn't look like it adds anything to the picture,

Agreed, which is why I didn't say anything like that?

Sorry for ascribing you beliefs you don't have. I guess I'm just used to people here and in other places assuming goals and agency in language models, and also some of your choices of words sounded very goal-directed/intentional stance to me.

Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is motivated to do something else.

Sure, but don't you agree that it's a very confusing use of the term? Like, if I say GPT-3 isn't trying to kill me, I'm not saying it is trying to kill anyone, but I'm sort of implying that it's the right framing to talk about it. In this case, the "motivated" part did triggered me, because it implied that the right framing is to think about what Codex wants, which I don't think is right (and apparently you agree).

(Also the fact that gwern, which ascribe agency to GPT-3, quoted specifically this part in his comment is another evidence that you're implying agency for different people)

Maybe you're objecting to words like "know" and "capable"? But those don't seem to imply agency/goals; it seems reasonable to say that Google Maps knows about traffic patterns and is capable of predicting route times.

Agreed with you there.

As an aside, this was Codex rather than GPT-3, though I'd say the same thing for both.

True, but I don't feel like there is a significant difference between Codex and GPT-3 in terms of size or training to warrant different conclusions with regard to ascribing goals/agency.

I don't care what it is trained for; I care whether it solves my problem. Are you telling me that you wouldn't count any of the reward misspecification examples as misalignment? After all, those agents were trained to optimize the reward, not to analyze what you meant and fix your reward.

First, I think I interpreted "misalignment" here to mean "inner misalignment", hence my answer. I also agree that all examples in Victoria's doc are showing misalignment. That being said, I still think there is a difference with the specification gaming stuff.

Maybe the real reason it feels weird for me to call this behavior of Codex misalignment is that it is so obvious? Almost all specification gaming examples are subtle, or tricky, or exploiting bugs. They're things that I would expect a human to fail to find, even given the precise loss and training environment. Whereas I expect any human to complete buggy code with buggy code once you explain to them that Codex looks for the most probable next token based on all the code.

But there doesn't seem to be a real disagreement between us: I agree that GPT-3/Codex seem fundamentally unable to get really good at the "Chatbot task" I described above, which is what I gather you mean by "solving my problem".

(By the way, I have an old post about formulating this task that we want GPT-3 to solve. It was written before I actually studied GPT-3 but that holds decently well I think. I also did some experiments on GPT-3 with EleutherAI people on whether bigger models get better at answering more variations of the prompt for the same task.)

Sure, but don't you agree that it's a very confusing use of the term?

Maybe? Idk, according to me the goal of alignment is "create a model that is motivated to help us", and so misalignment = not-alignment = "the mode is not motivated to help us". Feels pretty clear to me but illusion of transparency is a thing.

I am making a claim that for the purposes of alignment of capable systems, you do want to talk about "motivation". So to the extent GPT-N / Codex-N doesn't have a motivation, but is existentially risky, I'm claiming that you want to give it a motivation. I wouldn't say this with high confidence but it is my best guess for now.

(Also the fact that gwern, which ascribe agency to GPT-3, quoted specifically this part in his comment is another evidence that you're implying agency for different people)

I think Gwern is using "agent" in a different way than you are ¯\_(ツ)_/¯

I don't think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He'd probably be more specific than me just because he's worked with it a lot more than I have.)

Maybe the real reason it feels weird for me to call this behavior of Codex misalignment is that it is so obvious?

It doesn't seem like whether something is obvious or not should determine whether it is misaligned -- it's obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned.

Almost all specification gaming examples are subtle, or tricky, or exploiting bugs.

I think that's primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.

Sorry for the delay in answering, I was a bit busy.

I am making a claim that for the purposes of alignment of capable systems, you do want to talk about "motivation". So to the extent GPT-N / Codex-N doesn't have a motivation, but is existentially risky, I'm claiming that you want to give it a motivation. I wouldn't say this with high confidence but it is my best guess for now

That makes some sense, but I do find the "motivationless" state interesting from an alignment point of view. Because if it has no motivation, it also doesn't have a motivation to do all the things we don't want. We thus get some corrigibility by default, because we can change its motivation just by changing the prompt.

I think Gwern is using "agent" in a different way than you are ¯\_(ツ)_/¯
I don't think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He'd probably be more specific than me just because he's worked with it a lot more than I have.)

Agreed that there's not much difference when predicting GPT-3. But it's because we're at the place in the scaling where Gwern (AFAIK) describe the LM as an agent very good at predicting-agent. By definition it will not do anything different from a simulator, since its "goal" literally encode all of its behavior.

Yet there is a difference when scaling. If Gwern is right (or if LM because more like what he's describing as they get bigger), then we end up with a single agent which we probably shouldn't trust because of all our many worries with alignment. On the other hand, if scaled up LM are non-agentic/simulator-like, then they would stay motivationless, and there would be at least the possibility to use them to help alignment research for example, by trying to simulate non-agenty systems.

It doesn't seem like whether something is obvious or not should determine whether it is misaligned -- it's obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned.

Fair enough.

I think that's primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.

Yeah, you're probably right.

Yet there is a difference when scaling. If Gwern is right (or if LM because more like what he's describing as they get bigger), then we end up with a single agent which we probably shouldn't trust because of all our many worries with alignment. On the other hand, if scaled up LM are non-agentic/simulator-like, then they would stay motivationless, and there would be at least the possibility to use them to help alignment research for example, by trying to simulate non-agenty systems.

Yeah, I agree that in the future there is a difference. I don't think we know which of these situations we're going to be in (which is maybe what you're arguing). Idk what Gwern predicts.

Exactly. I'm mostly arguing that I don't think the case for the agent situation is as clear cut as I've seen some people defend it, which doesn't mean it's not possibly true.

@Adam I'm interested if you have the same criticism of the language in the paper (in appendix E)?

(I mostly wrote it, and am interested whether it sounds like it's ascribing agency too much)

My message was really about Rohin's phrasing, since I usually don't read the papers in details if I think the summary is good enough.

Reading the section now, I'm fine with it. There are a few intentional stance words, but the scare quotes and the straightforwardness of cashing out "is capable" into "there is a prompt to make it do what we want" and "chooses" into "what it actually returns for our prompt" makes it quite unambiguous.

I also like this paragraph in the appendix:

However, there is an intuitive notion that, given its training objective, Codex is better described as “trying” to continue the prompt by either matching or generalizing the training distribution, than as “trying” to be helpful to the user.

Rohin also changed my mind on my criticism of calling that misalignment; I now agree that misalignment is the right term.

One thought I just had: this looks more like a form of proxy alignment to what we really want, which is not ideal but significantly better than deceptive alignment. Maybe autoregressive language models point to a way of paying a cost of proxy alignment to avoid deceptive alignment?

Rohin's opinion: I really liked the experiment demonstrating misalignment, as it seems like it accurately captures the aspects that we expect to see with existentially risky misaligned AI systems: they will “know” how to do the thing we want, they simply won’t be “motivated” to actually do it.

Nic jokes:

In the end humanity was saved by adding "Super safe." to all their requests of the AGI

My counter joke (in EAI) was:

"AGI but its supr safe."

(GPT-3 is an agent-predicting agent.)

I suspect that "progress toward human level AI" is extremely non-linear in the outputs, not just the inputs. I think the ranges between "dog" and "human", and between "human" and "unequivocally superhuman" are pretty similar and narrow on some absolute scale: perhaps 19, 20, and 21 respectively in some arbitrary units.

We have developed some narrow AI capabilities that are superhuman, but in terms of more general intelligence we're still well short of "dog". I wouldn't be very surprised if we were at about 10-15 on the previous fictional scale, up from maybe 2-3 back in the 80's.

Superficially it doesn't look like a lot of progress is being made on human-level general intelligence, because our best efforts are in general terms still stupider than the dumbest dogs and that fact hasn't changed in the last 40 years. We've broadened the fraction of tasks where AI can do almost as well as a human or better by maybe a couple of percent, which doesn't look like much progress.

But that's exactly what we should expect. Many of the tasks we're interested in have "sharp" evaluation curves, where they require multiple capabilities and anything less than human performance in any one capability required for the task will lead to a score near zero.

If this model has any truth to it, by the time we get to 19 on this general intelligence scale we'll probably still be bemoaning (or laughing at) how dumb our AIs are. Even while in many more respects they will have superior capabilities, and right on the verge of becoming unequivocally superhuman.

Right now it looks like your 19-21 scale corresponds to something like log(# of parameters) (basically every scaling graph with parameters you'll see uses this as its x-axis). So it still requires exponential increase in inputs to drive a linear increase on that scale.

Oh yes, it's very likely that inputs are also non-linear.

The non-linearity of outputs I was referring to though was the likelihood of "sharp" tasks that require near-human capability in multiple aspects to score anything non-negligible in evaluation.

I must admit to a greater degree of ignorance than usual for my comment here, but I have a huge problem with the longtermists [at least from the longtermist paper I read]: their position reeks of begging the question. If we suppose that an immense numbers of people will live in the future, that the short term is not immensely easier to knowingly influence, that improving the short term does not improve the long term, that there is no medium term worth considering, that influences are percentage based, and that we care equally about immensely future people [including nonhuman ones] as we do our loved ones, then and only then does their argument make sense. That said, I'm perfectly fine with pursuing long term benefit, and think that one of their points should be highly pursued; research into how best to influence the future seems worthwhile.

It seems clearly made as a justification for the position they already wanted to take. There's nothing wrong with that, but I think their premises are unlikely. I think it is obvious that the near term is much easier to influence. Assumptions I find highly questionable are that the short term doesn't have significant knock on effects [I think it clearly does], that we shouldn't consider a distinct medium term, and that we shouldn't care more about those closer to us (in time, space, likeness, and just plain affection). Percentage based is also highly questionable, considering we know requirements to improve tech seem exponential, and things that are naturally easier to quantify are probably much easier to improve [so things other than tech are hard to improve. AI safety is in a philosophy stage]. They also fail to include a scenario where the number of future people doesn't explode. I also don't believe in quick AI takeoff if AGI ever happens, and so even if I were one of them, I wouldn't focus so much AI safety. (I am aware this community was built by people very concerned about AI safety.)

In the linked post, I think they can't tell which intermediate goals to pursue for two reasons. First, they are looking too far into the future, and two, AI simply isn't advanced enough yet to build good hypothesis about how they will really end up working [this is one reason it is philosophy]. The possibilities space is immense, and we have few clues where to look. (I do think current approaches are also subtly very wrong, so they are actually looking in the wrong general areas too.)

Also, focusing on AI governance is a bit of a strange way to influence AI safety, and so it is hard to know what effect you will have based on what you do. Influencing the people that influence the laws and norms of the society AI researchers are operating in when there are hundreds of countries, and possibly thousands of cultures is a highly difficult task. Historically, many such influences have turned out to be malignant regardless of whether the people behind them were beneficent or malevolent. There are other approaches to influence, but they are even less reliable. It seems like a genuinely very tricky problem that may be clarified later once AI is really understood, but not until then. Focusing on understanding how and why AI will do things seems likely to be much more valuable than locking in governance before we understand.

Like Luke I'm going to take longtermism as an axiom for most purposes (I find it decently convincing given my values), though if you're interested in debating it you could post on the EA Forum. (One note: my understanding of longtermism is "the primary determinant of whether an action is one of the best that you can take is its consequences on the far future"; you seem to be interpreting it as a stronger / more specific claim than that.)

Also, focusing on AI governance is a bit of a strange way to influence AI safety

You're misunderstanding the point of AI governance. AI governance isn't a subset of AI safety, unless you interpret the term "AI safety" very very broadly. Usually I think of AI safety as "how do we build AI systems that do what their designers intend"; AI governance is then "how do we organize society so that humanity uses this newfound power of AI for good, and in particular doesn't use it to destroy ourselves" (e.g. how do we prevent humans from using AI in a way that makes wars existentially risky, that enforces robust totalitarianism, that persuades humans to change their values, etc). I guess part of governance is "how do we make sure no one builds unsafe AI", which is somewhat related to AI safety, but that's not the majority of AI governance.

A lot of these issues don't seem to become that more clarified even with a picture of how AGI will come about, e.g. I have such a picture in mind, and even if I condition on that picture being completely accurate (which it obviously won't be), many of the relevant questions still don't get resolved. This is because often they're primarily questions about human society rather than questions about how AI works.

I am unlikely to post on the EA forum. (I only recently started posting much here, and I find most of EA rather unconvincing, aside from the one sentence summary, which is obviously a good thing.) Considering my negativity toward long-termism, I'm glad you decided more on the productive side for your response. My response is a bit long, I didn't manage to get what I was trying to say down when it was shorter. Feel free to ignore it.

I will state that all of that is AI safety. Even the safety of the AI is determined by the overarching world upon which it is acting. A perfectly well controlled AI is unsafe if regulations followed by defense-bot-3000 state that all rebels must be ended, and everyone matches the definition of a rebel. The people that built defense-bot-3000 probably didn't intend to end humanity because a human law said to. Identically, they probably didn't mean for defense-bot-4000 to stand by and let it happen because a human is required in the loop by the 4000 version, and defense-bot-3000 made sure to kill those in charge of defense-bot-4000 at the start for its instrumental value.

Should a police bot let criminals it can prove are guilty run free, because their actions are justified in this instance? Should a facial recognition system point out that it has determined that new intern matches a spy for the government of that country? Should people be informed that a certain robot is malfunctioning, and likely to destroy an important machine in a hospital [when that means the people will simply destroy the sapient robot, but if the machine is destroyed people might die]? These are moral, and legal governance questions, that are also clearly AI safety questions.

I'd like to compare it to computer science where we know seemingly very disparate things are theoretically identical, such as iteration versus recursion, and hardware vs software. Regulation internal to the AI is the narrow construal of AI safety, while regulation external to it is governance. (Whether this regulation is on people or on the AI directly can be an important distinction, but either way it is still governance.)

Governance is thus actually a subset of AI safety broadly construed. And it should be broadly construed, since there is no difference between an inbuilt part of the AI and a part of the environment it is being used in if the lead to the same actions.

That wasn't actually my point though. The definition of whether or not you call it AI safety isn't important. You want to make it safe to have AI in use in society through regulation and cultural action. If you don't understand AI, your regulation and cultural bits will be wrong. You do not currently understand AI, especially what effects it will actually have dealing with people [since sufficient AIs don't exist to get data, and current approaches are not well understood in terms of why they do what they do].

Human culture has been massively changed by computers, the internet, cellphones, and so on. If I was older, I'd have a much longer list. If [and this is a big if] AI turns out to be that big of a thing, you can't know what it will look like at this stage. That's why you have to wait to find out [while trying to figure out what it will actually do.] If AI turns out to mostly be good at tutoring people, you need completely different regulation that if it turns out to only be good at war, and both are very different than if it is good at a wide variety of things.

Questions of human society rest on two things. First, what are people truly like on the inside. We aren't good at figuring that out, but we have documented several thousand years of trying, and we're starting to learn. Second, what is the particular culture like? Actual human level AI would massively change all of our cultures, to fit or banish the contours of the actual and perceived effects of the devices. (Also, what are the AI's like on the inside? What are their natures? What cultures would spring up amongst different AIs?)

I agree that regulation is harder to do before you know all the details of the technology, but it doesn't seem obviously doomed, and it seems especially-not-doomed to productively think about what regulations would be good (which is the vast majority of current AI governance work by longtermists).

As a canonical example I'd think of the Asilomar conference, which I think happened well before the details of the technology were known. There are a few more examples, but overall not many. I think that's primarily because we don't usually try to foresee problems because we're too caught up in current problems, so I don't see that as a very strong update against thinking about governance in advance.

Perhaps I was unclear. I object to the idea that you should get attached to any ideas now, not that you shouldn't think about them. People being people, they are much more prone to getting attached to their ideas than is wise. Understand before risking attachment.

The problem with AI governance, is that AI is a mix between completely novel abilities, and things humans have been doing as long as there have been humans. The latter don't need special 'AI governance' and the former are not understood.

(It should be noted that I am absolutely certain that AI will not take off quickly if it ever does takeoff beyond human limits.)

The Asilomar conference isn't something I'm particularly familiar with, but it sounds like people actually had significant hands on experience with the technology, and understood them already. They stopped the experiments because they needed the clarity, not because someone else made rules earlier. There are not details as to whether they did a good job, and the recommendations seem very generic. Of course, it is wikipedia. We are not at this point with nontrivial AI. Needless to say, I don't think this is against my point.

Rohin's opinion: I really liked the experiment demonstrating misalignment, as it seems like it accurately captures the aspects that we expect to see with existentially risky misaligned AI systems: they will “know” how to do the thing we want, they simply won’t be “motivated” to actually do it.

Like what you write here:

They also probe the model for bad behavior, including misalignment. In this context, they define misalignment as a case where the user wants A, but the model outputs B, and the model is both capable of outputting A and capable of distinguishing between cases where the user wants A and the user wants B.

I think that this is a very good example where the paper (based on your summary) and your opinion assumes some sort of higher agency/goals in GPT-3 than what I feel we have evidence for.

Where do you see any assumption of agency/goals?

(I find this some combination of sad and amusing as a commentary on the difficulty of communication, in that I feel like I tend to be the person pushing against ascribing goals to GPT.)

Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is motivated to do something else.

As an aside, this was Codex rather than GPT-3, though I'd say the same thing for both.

For a simulator-like model, this is not misalignment, this is intended behavior. It is trained to find the most probable continuation, not to analyze what you meant and solve your problem.

You can obviously say "it's an agent that does really care about the context", but I doesn't look like it adds anything to the picture,

Agreed, which is why I didn't say anything like that?

Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is motivated to do something else.

(Also the fact that gwern, which ascribe agency to GPT-3, quoted specifically this part in his comment is another evidence that you're implying agency for different people)

Maybe you're objecting to words like "know" and "capable"? But those don't seem to imply agency/goals; it seems reasonable to say that Google Maps knows about traffic patterns and is capable of predicting route times.

Agreed with you there.

As an aside, this was Codex rather than GPT-3, though I'd say the same thing for both.

True, but I don't feel like there is a significant difference between Codex and GPT-3 in terms of size or training to warrant different conclusions with regard to ascribing goals/agency.

I don't care what it is trained for; I care whether it solves my problem. Are you telling me that you wouldn't count any of the reward misspecification examples as misalignment? After all, those agents were trained to optimize the reward, not to analyze what you meant and fix your reward.

Sure, but don't you agree that it's a very confusing use of the term?

(Also the fact that gwern, which ascribe agency to GPT-3, quoted specifically this part in his comment is another evidence that you're implying agency for different people)

I think Gwern is using "agent" in a different way than you are ¯\_(ツ)_/¯

Maybe the real reason it feels weird for me to call this behavior of Codex misalignment is that it is so obvious?

Almost all specification gaming examples are subtle, or tricky, or exploiting bugs.

I think that's primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.

Sorry for the delay in answering, I was a bit busy.

I am making a claim that for the purposes of alignment of capable systems, you do want to talk about "motivation". So to the extent GPT-N / Codex-N doesn't have a motivation, but is existentially risky, I'm claiming that you want to give it a motivation. I wouldn't say this with high confidence but it is my best guess for now

I think Gwern is using "agent" in a different way than you are ¯\_(ツ)_/¯
I don't think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He'd probably be more specific than me just because he's worked with it a lot more than I have.)

It doesn't seem like whether something is obvious or not should determine whether it is misaligned -- it's obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned.

Fair enough.

I think that's primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.

Yeah, you're probably right.

Yet there is a difference when scaling. If Gwern is right (or if LM because more like what he's describing as they get bigger), then we end up with a single agent which we probably shouldn't trust because of all our many worries with alignment. On the other hand, if scaled up LM are non-agentic/simulator-like, then they would stay motivationless, and there would be at least the possibility to use them to help alignment research for example, by trying to simulate non-agenty systems.

Yeah, I agree that in the future there is a difference. I don't think we know which of these situations we're going to be in (which is maybe what you're arguing). Idk what Gwern predicts.

Exactly. I'm mostly arguing that I don't think the case for the agent situation is as clear cut as I've seen some people defend it, which doesn't mean it's not possibly true.

@Adam I'm interested if you have the same criticism of the language in the paper (in appendix E)?

(I mostly wrote it, and am interested whether it sounds like it's ascribing agency too much)

My message was really about Rohin's phrasing, since I usually don't read the papers in details if I think the summary is good enough.

I also like this paragraph in the appendix:

However, there is an intuitive notion that, given its training objective, Codex is better described as “trying” to continue the prompt by either matching or generalizing the training distribution, than as “trying” to be helpful to the user.

Rohin also changed my mind on my criticism of calling that misalignment; I now agree that misalignment is the right term.

Rohin's opinion: I really liked the experiment demonstrating misalignment, as it seems like it accurately captures the aspects that we expect to see with existentially risky misaligned AI systems: they will “know” how to do the thing we want, they simply won’t be “motivated” to actually do it.

Nic jokes:

In the end humanity was saved by adding "Super safe." to all their requests of the AGI

My counter joke (in EAI) was:

"AGI but its supr safe."

(GPT-3 is an agent-predicting agent.)

Oh yes, it's very likely that inputs are also non-linear.

The non-linearity of outputs I was referring to though was the likelihood of "sharp" tasks that require near-human capability in multiple aspects to score anything non-negligible in evaluation.

Also, focusing on AI governance is a bit of a strange way to influence AI safety

(It should be noted that I am absolutely certain that AI will not take off quickly if it ever does takeoff beyond human limits.)

28

[AN #157]: Measuring misalignment in the technology underlying Copilot

28

Ω 18

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

TECHNICAL AGENDAS AND PRIORITIZATION

FORECASTING

MISCELLANEOUS (ALIGNMENT)

AI GOVERNANCE

NEWS

FEEDBACK

PODCAST

28

Ω 18

28

Ω 18