Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic Status: I only know as much as anyone else in my reference class (I build ML models, I can grok the GPT papers, and I don't work for OpenAI or a similar lab). But I think my thesis is original.

Related: Gwern on GPT-3

For the last several years, I've gone around saying that I'm worried about transformative AI, an AI capable of making an Industrial Revolution sized impact (the concept is agnostic on whether it has to be AGI or self-improving), because I think we might be one or two cognitive breakthroughs away from building one.

GPT-3 has made me move up my timelines, because it makes me think we might need zero more cognitive breakthroughs, just more refinement / efficiency / computing power: basically, GPT-6 or GPT-7 might do it. My reason for thinking this is comparing GPT-3 to GPT-2, and reflecting on what the differences say about the "missing pieces" for transformative AI.

My Thesis:

The difference between GPT-2 and GPT-3 has made me suspect that there's a legitimate comparison to be made between the scale of a network architecture like the GPTs, and some analogue of "developmental stages" of the resulting network. Furthermore, it's plausible to me that the functions needed to be a transformative AI are covered by a moderate number of such developmental stages, without requiring additional structure. Thus GPT-N would be a transformative AI, for some not-too-large N, and we need to redouble our efforts on ways to align such AIs. 

The thesis doesn't strongly imply that we'll reach transformative AI via GPT-N especially soon; I have wide uncertainty, even given the thesis, about how large we should expect N to be, and whether the scaling of training and of computation slows down progress before then. But it's also plausible to me now that the timeline is only a few years, and that no fundamentally different approach will succeed before then. And that scares me.

Architecture and Scaling

GPT, GPT-2, and GPT-3 use nearly the same architecture; each paper says as much, with a sentence or two about minor improvements to the individual transformers. Model size (and the amount of training computation) is really the only difference.

GPT took 1 petaflop/s-day to train 117M parameters, GPT-2 took 10 petaflop/s-days to train 1.5B parameters, and the largest version of GPT-3 took 3,000 petaflop/s-days to train 175B parameters. By contrast, AlphaStar seems to have taken about 30,000 petaflop/s-days of training in mid-2019, so the pace of AI research computing power projects that there should be about 10x that today. The upshot is that OpenAI may not be able to afford it, but if Google really wanted to make GPT-4 this year, they could afford to do so.

Analogues to Developmental Stages

There are all sorts of (more or less well-defined) developmental stages for human beings: image tracking, object permanence, vocabulary and grammar, theory of mind, size and volume, emotional awareness, executive functioning, et cetera.

I was first reminded of developmental stages a few years ago, when I saw the layers of abstraction generated in this feature visualization tool for GoogLeNet.

We don't have feature visualization for language models, but we do have generative outputs. And as you scale up an architecture like GPT, you see higher levels of abstraction. Grammar gets mastered, then content (removing absurd but grammatical responses), then tone (first rough genre, then spookily accurate authorial voice). Topic coherence is mastered first on the phrase level, then the sentence level, then the paragraph level. So too with narrative flow.

Gwern's poetry experiments (GPT-2, GPT-3) are good examples. GPT-2 could more or less continue the meter of a poem and use words that fit the existing theme, but even its best efforts can get stuck in topic loops:

…Methinks I see her in her blissful dreams:
—Or, fancy-like, in some mirage she lies,
Majestic yet majestic, and of seems
The image of the unconquerable skies.
Methinks I see her in her blissful dreams:
—Or, fancy-like, in some majestic cell,
Where lordly seraphs strew their balmy dreams
On the still night, or in their golden shell.


In the dark the sun doth gleam,
And in the dark the moon doth seem
But now the evening is begun—
Gone is the sun upon the earth!
The silver moon doth like a cup
Of blood-red wine, and as that cup
Is drained of life, doth quench no drop.
What man will drink such wine?

GPT-3, though, has the sense of going somewhere, at least on the topic level. (Prompts in bold.)

O time thy pyramids, thy palaces forgot,
Pensive through nations wander with no more state
Than in thy days, when wandering was thy trade:
Mourn for Patraos, for the glory fled
Of him and thee, and in their stead left, left
To other eyes a length of rolling sand,
And such memorials as the traveller reads,
Half guessed in shade, of sculptured walls half round,
And long-defaced metopes, and friezes old.

[EDIT: Previously I also included its completion of a famous Emily Dickinson poem here, but as benkuhn pointed out, GPT-3 had simply memorized the poem and recited it. I'm really embarrassed, and also kind of shocked that I looked at the actual text of "Because I could not stop for Death" and thought, "yup, that looks like something GPT-3 could produce".]

(One last shocking bit is that, while GPT-2 had to be fine-tuned by taking the general model and training it some more on a poetry-only dataset, you're seeing what GPT-3's model does with no fine-tuning, with just a prompt that sounds poetic!)

Similarly, GPT-3's ability to write fiction is impressive- unlike GPT-2, it doesn't lose track of the plot, it has sensible things happen, it just can't plan its way to a satisfying resolution.

I'd be somewhat surprised if GPT-4 shared that last problem.

What's Next?

How could one of the GPTs become a transformative AI, even if it becomes a better and better imitator of human prose style? Sure, we can imagine it being used maliciously to auto-generate targeted misinformation or things of that sort, but that's not the real risk I'm worrying about here.

My real worry is that causal inference and planning are starting to look more and more like plausible developmental stages that GPT-3 is moving towards, and that these were exactly the things I previously thought were the obvious obstacles between current AI paradigms and transformative AI.

Learning causal inference from observations doesn't seem qualitatively different from learning arithmetic or coding from examples (and not only is GPT-3 accurate at adding three-digit numbers, but apparently at writing JSX code to spec), only more complex in degree.

One might claim that causal inference is harder to glean from language-only data than from direct observation of the physical world, but that's a moot point, as OpenAI are using the same architecture to learn how to infer the rest of an image from one part.

Planning is more complex to assess. We've seen GPTs ascend from coherence of the next few words, to the sentence or line, to the paragraph or stanza, and we've even seen them write working code. But this can be done without planning; GPT-3 may simply have a good enough distribution over next words to prune out those that would lead to dead ends. (On the other hand, how sure are we that that's not the same as planning, if planning is just pruning on a high enough level of abstraction?)

The bigger point about planning, though, is that the GPTs are getting feedback on one word at a time in isolation. It's hard for them to learn not to paint themselves into a corner. It would make training more finicky and expensive if we expanded the time horizon of the loss function, of course. But that's a straightforward way to get the seeds of planning, and surely there are other ways.

With causal modeling and planning, you have the capability of manipulation without external malicious use. And the really worrisome capability comes when it models its own interactions with the world, and makes plans with that taken into account.

Could GPT-N turn out aligned, or at least harmless?

GPT-3 is trained simply to predict continuations of text. So what would it actually optimize for, if it had a pretty good model of the world including itself and the ability to make plans in that world?

One might hope that because it's learning to imitate humans in an unsupervised way, that it would end up fairly human, or at least act in that way. I very much doubt this, for the following reason:

  • Two humans are fairly similar to each other, because they have very similar architectures and are learning to succeed in the same environment.
  • Two convergently evolved species will be similar in some ways but not others, because they have different architectures but the same environmental pressures.
  • A mimic species will be similar in some ways but not others to the species it mimics, because even if they share recent ancestry, the environmental pressures on the poisonous one are different from the environmental pressures on the mimic.

What we have with the GPTs is the first deep learning architecture we've found that scales this well in the domain (so, probably not that much like our particular architecture), learning to mimic humans rather than growing in an environment with similar pressures. Why should we expect it to be anything but very alien under the hood, or to continue acting human once its actions take us outside of the training distribution?

Moreover, there may be much more going on under the hood than we realize; it may take much more general cognitive power to learn and imitate the patterns of humans, than it requires us to execute those patterns.

Next, we might imagine GPT-N to just be an Oracle AI, which we would have better hopes of using well. But I don't expect that an approximate Oracle AI could be used safely with anything like the precautions that might work for a genuine Oracle AI. I don't know what internal optimizers GPT-N ends up building along the way, but I'm not going to count on there being none of them.

I don't expect that GPT-N will be aligned or harmless by default. And if N isn't that large before it gets transformative capacity, that's simply terrifying.

What Can We Do?

While the short timeline suggested by the thesis is very bad news from an AI safety readiness perspective (less time to come up with better theoretical approaches), there is one silver lining: it at least reduces the chance of a hardware overhang. A project or coalition can feasibly wait and take a better-aligned approach that uses 10x the time and expense of an unaligned approach, as long as they have that amount of resource advantage over any competitor. 

Unfortunately, the thesis also makes it less likely that a fundamentally different architecture will reach transformative status before something like GPT does.

I don't want to take away from MIRI's work (I still support them, and I think that if the GPTs peter out, we'll be glad they've been continuing their work), but I think it's an essential time to support projects that can work for a GPT-style near-term AGI, for instance by incorporating specific alignment pressures during training. Intuitively, it seems as if Cooperative Inverse Reinforcement Learning or AI Safety via Debate or Iterated Amplification are in this class.

We may also want to do a lot of work on how better to mold a GPT-in-training into the shape of an Oracle AI.

It would also be very useful to build some GPT feature "visualization" tools ASAP.

In the meantime, uh, enjoy AI Dungeon, I guess?

New Comment
73 comments, sorted by Click to highlight new comments since: Today at 12:47 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Unfortunately what you say sounds somewhat plausible to me; I look forward to hearing the responses.

I'll add this additional worry: If you are an early chemist exploring the properties of various metals, and you discover a metal that gets harder as it gets colder, this should increase your credence that there are other metals that share this property. Similarly, I think, for AI architectures. The GPT architecture seems to exhibit pretty awesome scaling properties. What if there are other architectures that also have awesome scaling properties, such that we'll discover this soon? How many architectures have had 1,000+ PF-days pumped into them? Seems like just two or three. And equally importantly, how many architectures have been tried with 100+ billion parameters? I don't know, please tell me if you do.

EDIT: By "architectures" I mean "Architectures + training setups (data, reward function, etc.)"

I find this interesting in the context of the recent podcast on errors in the classic arguments for AI risk - which boil down to, there is no necessary reason why instrumental convergence or orthogonality apply to your systems, and there are actually strong reasons, a priori, to think increasing AI capabilities and increasing AI alignment go together to some degree... and then GPT-3 comes along, and suggests that, practically speaking, you can get highly capable behaviour that scales up easily without much in the way of alignment.

On the one hand, GPT-3 is quite useful while being not robustly aligned, but on the other hand GPT-3's lack of alignment is impeding its capabilities to some degree.

Maybe if you update on both you just end up back where you started.

I think the errors in the classic arguments have been greatly exaggerated. So for me the update is just in one direction.

8Sammy Martin4y
What would you say is wrong with the 'exaggerated' criticism? I don't think you can call the arguments wrong if you also think the Orthogonality Thesis and Instrumental Convergence are real and relevant to AI safety, and as far as I can tell the criticism doesn't claim that - just that there are other assumptions needed for disaster to be highly likely.

I don't have an elevator pitch summary of my views yet, and it's possible that my interpretation of the classic arguments is wrong, I haven't reread them recently. But here's an attempt:

--The orthogonality thesis and convergent instrumental goals arguments, respectively, attacked and destroyed two views which were surprisingly popular at the time: 1. that smarter AI would necessarily be good (unless we deliberately programmed it not to be) because it would be smart enough to figure out what's right, what we intended, etc. and 2. that smarter AI wouldn't lie to us, hurt us, manipulate us, take resources from us, etc. unless it wanted to (e.g. because it hates us, or because it has been programmed to kill, etc) which it probably wouldn't. I am old enough to remember talking to people who were otherwise smart and thoughtful who had views 1 and 2.

--As for whether the default outcome is doom, the original argument makes clear that default outcome means absent any special effort to make AI good, i.e. assuming everyone just tries to make it intelligent, but no effort is spent on making it good, the outcome is likely to be doom. This is, I think, true. La... (read more)

--The orthogonality thesis and convergent instrumental goals arguments, respectively, attacked and destroyed two views which were surprisingly popular at the time: 1. that smarter AI would necessarily be good (unless we deliberately programmed it not to be) because it would be smart enough to figure out what's right, what we intended, etc. and 2. that smarter AI wouldn't lie to us, hurt us, manipulate us, take resources from us, etc. unless it wanted to (e.g. because it hates us, or because it has been programmed to kill, etc) which it probably wouldn't. I am old enough to remember talking to people who were otherwise smart and thoughtful who had views 1 and 2.

Speaking from personal experience, those views both felt obvious to me before I came across Orthogonality Thesis or Instrumental convergence.

--As for whether the default outcome is doom, the original argument makes clear that default outcome means absent any special effort to make AI good, i.e. assuming everyone just tries to make it intelligent, but no effort is spent on making it good, the outcome is likely to be doom. This is, I think, true.

It depends on what you mean by 'special effort... (read more)

I think that the criticism sees it the second way and so sees the arguments as not establishing what they are supposed to establish, and I see it the first way - there might be a further fact that says why OT and IC don't apply to AGI like they theoretically should, but the burden is on you to prove it. Rather than saying that we need evidence OT and IC will apply to AGI.

I agree with that burden of proof. However, we do have evidence that IC will apply, if you think we might get AGI through RL. 

I think that hypothesized AI catastrophe is usually due to power-seeking behavior and instrumental drives. I proved that that optimal policies are generally power-seeking in MDPs. This is a measure-based argument, and it is formally correct under broad classes of situations, like "optimal farsighted agents tend to preserve their access to terminal states" (Optimal Farsighted Agents Tend to Seek Power, §6.2 Theorem 19) and "optimal agents generally choose paths through the future that afford strictly more options" (Generalizing the Power-Seeking Theorems, Theorem 2). 

The theorems aren't conclusive evidence: 

  • maybe we don't get AGI through RL
  • learned policies are not going to be o
... (read more)
I agree that your paper strengthens the IC (and is also, in general, very cool!). One possible objection to the ICT, as traditionally formulated, has been that it's too vague: there are lots of different ways you could define a subset of possible minds, and then a measure over that subset, and not all of these ways actually imply that "most" minds in the subset have dangerous properties. Your paper definitely makes the ICT crisper, more clearly true, and more closely/concretely linked to AI development practices. I still think, though, that the ICT only gets us a relatively small portion of the way to believing that extinction-level alignment failures are likely. A couple of thoughts I have are: 1. It may be useful to distinguish between "power-seeking behavior" and omnicide (or equivalently harmful behavior). We do want AI systems to pursue power-seeking behaviors, to some extent. Making sure not to lock yourself in the bathroom, for example, qualifies as a power-seeking behavior -- it's akin to avoiding "State 2" in your diagram -- but it is something that we'd want any good house-cleaning robot to do. It's only a particular subset of power-seeking behavior that we badly want to avoid (e.g. killing people so they can't shut you off.) This being said, I imagine that, if we represented the physical universe as an MDP, and defined a reward function over states, and used a sufficiently low discount rate, then the optimal policy for most reward functions probably would involve omnicide. So the result probably does port over to this special case. Still, I think that keeping in mind the distinction between omnicide and "power-seeking behavior" (in the context of some particular MDP) does reduce the ominousness of the result to some degree. 2. Ultimately, for most real-world tasks, I think it's unlikely that people will develop RL systems using hand-coded reward functions (and then deploy them). I buy the framing in (e.g.) the DM "scalable agent alignment" p
8Rohin Shah4y
RL with a randomly chosen reward leads to catastrophe at optimum. The proof is for randomly distributed rewards. Ben's main critique is that the goals evolve in tandem with capabilities, and goals will be determined by what humans care about. These are specific reasons to deny the conclusion of analysis of random rewards. (A random Python program will error with near-certainty, yet somehow I still manage to write Python programs that don't error.) I do agree that this isn't enough reason to say "there is no risk", but it surely is important for determining absolute levels of risk. (See also this comment by Ben.)
Right, it’s for randomly distributed rewards. But if I show a property holds for reward functions generically, then it isn’t necessarily enough to say “we’re going to try to try to provide goals without that property”. Can we provide reward functions without that property?  Every specific attempt so far has been seemingly unsuccessful (unless you want the AI to choose a policy at random or shut down immediately). The hope might be that future goals/capability research will help, but I’m not personally convinced that researchers will receive good Bayesian evidence via their subhuman-AI experimental results.  I agree it’s relevant that we will try to build helpful agents, and might naturally get better at that. I don’t know that it makes me feel much better about future objectives being outer aligned. ETA: also, i was referring to the point you made when i said  “the results don't prove how hard it is tweak the reward function distribution, to avoid instrumental convergence”  
4Rohin Shah4y
Idk, I could say that every specific attempt made by the safety community to demonstrate risk has been seemingly unsuccessful, therefore systems must not be risky. This pretty quickly becomes an argument about priors and reference classes and such. But I don't really think I disagree with you here. I think this paper is good, provides support for the point "we should have good reason to believe an AI system is safe, and not assume it by default", and responds to an in-fact incorrect argument of "but why would any AI want to kill us all, that's just anthropomorphizing". But when someone says "These arguments depend on some concept of a 'random mind', but in reality it won't be random, AI researchers will fix issues and goals and capabilities will evolve together towards what we want, seems like IC may or may not apply", it seems like a response of the form "we have support for IC, not just in random minds, but also for random reward functions" has not responded to the critique and should not be expected to be convincing to that person. Aside: I am legitimately unconvinced that it matters whether you are outer aligned at optimum. Not just being a devil's advocate here. (I am also not convinced of the negation.)
I agree that the paper should not be viewed as anything but slight Bayesian evidence for the difficulty of real objective distributions. IIRC I was trying to reply to the point of "but how do we know IC even exists?" with "well, now we can say formal things about it and show that it exists generically, but (among other limitations) we don't (formally) know how hard it is to avoid if you try".  I think I agree with most of what you're arguing.
4Rohin Shah4y
I think this is a slight misunderstanding of the theory in the paper. I'd translate the theory of the paper to English as: Any time the paper talks about "distributions" over reward functions, it's talking from our perspective. The way the theory does this is by saying that first a reward function is drawn from the distribution, then it is given to the agent, then the agent thinks really hard, and then the agent executes the optimal policy. All of the theoretical analysis in the paper is done "before" the reward function is drawn, but there is no step where the agent is doing optimization but doesn't know its reward. I'd rewrite this as:
4Rohin Shah4y
I do think this is mostly a matter of translation of math to English being hard. Like, when Alex says "optimal agents seek power", I think you should translate it as "when we don't know what goal an optimal agent has, we should assign higher probability that it will go to states that have higher power", even though the agent itself is not thinking "ah, this state is powerful, I'll go there".
Great observation. Similarly, a hypothesis called "Maximum Causal Entropy" once claimed that physical systems involving intelligent actors tended tended towards states where the future could be specialized towards many different final states, and that maybe this was even part of what intelligence was. However, people objected: (monogamous) individuals don't perpetually maximize their potential partners -- they actually pick a partner, eventually.  My position on the issue is: most agents steer towards states which afford them greater power, and sometimes most agents give up that power to achieve their specialized goals. The point, however, is that they end up in the high-power states at some point in time along their optimal trajectory. I imagine that this is sufficient for the  catastrophic power-stealing incentives: the AI only has to disempower us once for things to go irreversibly wrong.
To clarify, I don't assume that. The terminal states, even those representing the off-switch, also have their reward drawn from the same distribution. When you distribute reward IID over states, the off-state is in fact optimal for some low-measure subset of reward functions. But, maybe you're saying "for realistic distributions, the agent won't get any reward for being shut off and therefore π∗ won't ever let itself be shut off". I agree, and this kind of reasoning is captured by Theorem 3 of Generalizing the Power-Seeking Theorems. The problem is that this is just a narrow example of the more general phenomenon. What if we add transient "obedience" rewards, what then? For some level of farsightedness (γ close enough to 1), the agent will still disobey, and simultaneously disobedience gives it more control over the future. The paper doesn't draw the causal diagram "Power → instrumental convergence", it gives sufficient conditions for power-seeking being instrumentally convergent. Cycle reachability preservation is one of those conditions. Yes, right. The point isn't that alignment is impossible, but that you have to hit a low-measure set of goals which will give you aligned or non-power-seeking behavior. The paper helps motivate why alignment is generically hard and catastrophic if you fail.  Yes, if r=h, introduce the agent. You can formalize a kind of "alignment capability" by introducing a joint distribution over the human's goals and the induced agent goals (preliminary Overleaf notes). So, if we had goal X, we'd implement an agent with goal X', and so on. You then take our expected optimal value under this distribution and find whether you're good at alignment, or whether you're bad and you'll build agents whose optimal policies tend to obstruct you. The doubling depends on the environment structure. There are game trees and reward functions where this holds, and some where it doesn't.  If the rewards are ϵ-close in sup-norm, then you can get nice regret
The freshly updated paper answers this question in great detail; see section 6 and also appendix B.
Great question. One thing you could say is that an action is power-seeking compared to another, if your expected (non-dominated subgraph; see Figure 19) power is greater for that action than for the other.  Power is kinda weird when defined for optimal agents, as you say - when γ=1, POWER can only decrease. See Power as Easily Exploitable Opportunities for more on this. Shortly after Theorem 19, the paper says: "In appendix C.6.2, we extend this reasoning to k-cycles (k >1) via theorem 53 and explain how theorem19 correctly handles fig. 7". In particular, see Figure 19. The key insight is that Theorem 19 talks about how many agents end up in a set of terminal states, not how many go through a state to get there. If you have two states with disjoint reachable terminal state sets, you can reason about the phenomenon pretty easily. Practically speaking, this should often suffice: for example, the off-switch state is disjoint from everything else. If not, you can sometimes consider the non-dominated subgraph in order to regain disjointness. This isn't in the main part of the paper, but basically you toss out transitions which aren't part of a trajectory which is strictly optimal for some reward function. Figure 19 gives an example of this. The main idea, though, is that you're reasoning about what the agent's end goals tend to be, and then say "it's going to pursue some way of getting there with much higher probability, compared to this small set of terminal states (ie shutdown)". Theorem 17 tells us that in the limit, cycle reachability totally controls POWER.  I think I still haven't clearly communicated all my mental models here, but I figured I'd write a reply now while I update the paper. Thank you for these comments, by the way. You're pointing out important underspecifications. :) I think one problem is that power-seeking agents are generally not that corrigible, which means outcomes are extremely sensitive to the initial specification.
6Daniel Kokotajlo4y
I mostly agree with what you say here--which is why I said the criticisms were exaggerated, not totally wrong--but I do think the classic arguments are still better than you portray them. In particular, I don't remember coming away from Superintelligence (I read it when it first came out) thinking that we'd have an AI system capable of optimizing any goal and we'd need to figure out what goal to put into it. Instead I thought that we'd be building AI through some sort of iterative process where we look at existing systems, come up with tweaks, build a new and better system, etc. and that if we kept with the default strategy (which is to select for and aim for systems with the most impressive capabilities/intelligence, and not care about their alignment--just look at literally every AI system made in the lab so far! Is AlphaGo trained to be benevolent? Is AlphaStar? Is GPT? Etc.) then probably doom. It's true that when people are building systems not for purposes of research, but for purposes of economic application -- e.g. Alexa, Google Search, facebook's recommendation algorithm -- then they seem to put at least some effort into making the systems aligned as well as intelligent. However history also tells us that not very much effort is put in, by default, and that these systems would totally kill us all if they were smarter. Moreover, usually systems appear in research-land first before they appear in economic-application-land. This is what I remember myself thinking in 2014, and I still think it now. I think the burden of proof has totally not been met; we still don't have good reason to think the outcome will probably be non-doom in the absence of more AI safety effort. It's possible my memory is wrong though. I should reread the relevant passages.
2Sammy Martin4y
When I wrote that I was mostly taking what Ben Garfinkel said about the 'classic arguments' at face value, but I do recall that there used to be a lot of loose talk about putting values into an AGI after building it.
I suppose I disagree that at least the orthogonality thesis and instrumental convergence, on their own, shift the burden. The OT basically says: "It is physically possible to build an AI system that would try to kill everyone." The ICT basically says: "Most possible AI systems within some particular set would try to kill everyone." If we stop here, then we haven't gotten very far. To repurpose an analogy: Suppose that you lived very far back in the past and suspected the people would eventually try to send rockets with astronauts to the moon. It's true that it's physically possible to build a rocket that shoots astronauts out aimlessly into the depths of space. Most possible rockets that are able to leave earth's atmosphere would also send astronauts aimlessly out into the depths of space. But I don't think it'd be rational to conclude, on these grounds, that future astronauts will probably be sent out into the depths of space. The fact that engineers don't want to make rockets that do this, and are reasonably intelligent, and can learn from lower-stakes experiences (e.g. unmanned rockets and toy rockets), does quite a lot of work. If you're not worried about just one single rocket trajectory failure, but systematically more severe trajectory failures (e.g. people sending larger and larger manned rockets out into the depths of space), then the rational degree of worry becomes increasingly low. Even sillier example: It's possible to make poisons, and there are way more substances that are deadly to people than there are substances that inoculate people are against coronavirus, but we don't need to worry much about killing everyone in the process of developing and deploying coronavirus vaccines. This is true even if it turned out that we don't currently know how to make an effective coronavirus vaccine. I think the OT and ICT on their own almost definitely aren't enough to justify an above 1% credence in extinction from AI. To get the rational credence up into (e.g

I think the purpose of the OT and ICT is to establish that lots of AI safety needs to be done. I think they are successful in this. Then you come along and give your analogy to other cases (rockets, vaccines) and argue that lots of AI safety will in fact be done, enough that we don't need to worry about it. I interpret that as an attempt to meet the burden, rather than as an argument that the burden doesn't need to be met.

But maybe this is a merely verbal dispute now. I do agree that OT and ICT by themselves, without any further premises like "AI safety is hard" and "The people building AI don't seem to take safety seriously, as evidenced by their public statements and their research allocation" and "we won't actually get many chances to fail and learn from our mistakes" does not establish more than, say, 1% credence in "AI will kill us all," if even that. But I think it would be a misreading of the classic texts to say that they were wrong or misleading because of this; probably if you went back in time and asked Bostrom right before he published the book whether he agrees with you re the implications of OT and ICT on their own, he would have completely agreed. And the text itself seems to agree.

I mostly agree with this. (I think, in responding to your initial comment, I sort of glossed over "and various other premises"). Superintelligence and other classic presentations of AI risk definitely offer additional arguments/considerations. The likelihood of extremely discontinuous/localized progress is, of course, the most prominent one. I think that "discontinuity + OT + ICT," rather than "OT + ICT" alone, has typically been presented as the core of the argument. For example, the extended summary passage from Superintelligence: If we drop the 'likely discontinuity' premise, as some portion of the community is inclined to do, then OT and OCT are the main things left. A lot of weight would then rests on these two theses, unless we supplement them with new premises (e.g. related to mesa-optimization.) I'd also say that there are three especially salient secondary premises in the classic arguments: (a) even many seemingly innocuous descriptions of global utility functions ("maximize paperclips," "make me happy," etc.) would result in disastrous outcomes if these utility functions were optimized sufficiently well; (b) if a broadly/highly intelligent is inclined toward killing you, it may be good at hiding this fact; and (c) if you decide to run a broadly superintelligent system, and that superintelligent system wants to kill you, you may be screwed even if you're quite careful in various regards (e.g. even if you implement "boxing" strategies). At least if we drop the discontinuity premise, though, I don't think they're compelling enough to bump us up to a high credence in doom.
1Sammy Martin4y
Perhaps what is going on here is that the arguments as stated in brief summaries like 'orthogonality thesis + instrumental convergence' just aren't what the arguments actually were, and that there were from the start all sorts of empirical or more specific claims made around these general arguments. This reminds me of Lakatos' theory of research programs - where the core assumptions, usually logical or a priori in nature, are used to 'spin off' secondary hypotheses that are more empirical or easily falsifiable. Lakatos' model fits AI safety rather well - OT and IC are some of these non-emperical 'hard core' assumptions that are foundational to the research program and then in ~2010 there were some secondary assumptions, discontinuous progress, AI maximises a simple utility function etc. but in ~2020 we have some different secondary assumptions: mesa-optimisers, you get what you measure, direct evidence of current misalignment
I agree that this is the biggest concern with these models, and the GPT-n series running out of steam wouldn't be a huge relief. It looks likely that we'll have the first human-scale (in terms of parameters) NNs before 2026 - Metaculus, 81% as of 13.08.2020. Does anybody know of any work that's analysing the rate at which, once the first NN crosses the n-parameter barrier, other architectures are also tried at that scale? If no-one's done it yet, I'll have a look at scraping the data from Papers With Code's databases on e.g. ImageNet models, it might be able to answer your question on how many have been tried at >100B as well.
I don't want to take away from MIRI's work (I still support them, and I think that if the GPTs peter out, we'll be glad they've been continuing their work), but I think it's an essential time to support projects that can work for a GPT-style near-term AGI

I'd love to know of a non-zero integer number of plans that could possibly, possibly, possibly work for not dying to a GPT-style near-term AGI.

Here are 11. I wouldn't personally assign greater than 50/50 odds to any of them working, but I do think they all pass the threshold of “could possibly, possibly, possibly work.” It is worth noting that only some of them are language modeling approaches—though they are all prosaic ML approaches—so it does sort of also depend on your definition of “GPT-style” how many of them count or not.

Maybe put out some sort of prize for the best ideas for plans?

Pretty sure OpenPhil and OpenAI currently try to fund plans that claim to look like this (e.g. all the ones orthonormal linked in the OP), though I agree that they could try increasing the financial reward by 100x (e.g. a prize) and see what that inspires.

If you want to understand why Eliezer doesn't find the current proposals feasible, his best writeups critiquing them specifically are this long comment containing high level disagreements with Alex Zhu's FAQ on iterated amplification and this response post to the details of Iterated Amplification.

As I understand it, the high level summary (naturally Eliezer can correct me) is that (a) corrigible behaviour is very unnatural and hard to find (most nearby things in mindspace are not in equilibrium and will move away from corrigibility as they reflect / improve), and (b) using complicated recursive setups with gradient descent to do supervised learning is incredible chaotic and hard to manage, and shouldn't be counted on working without major testing and delays (i.e. could not be competitive).

There's also some more subtle and implicit disagreement that's not been quite worked out but feeds into the above, where a lot of the ML-focused... (read more)

As I understand it, the high level summary (naturally Eliezer can correct me) is that (a) corrigible behaviour is very unnatural and hard to find (most nearby things in mindspace are not in equilibrium and will move away from corrigibility as they reflect / improve), and (b) using complicated recursive setups with gradient descent to do supervised learning is incredible chaotic and hard to manage, and shouldn't be counted on working without major testing and delays (i.e. could not be competitive).

Perhaps Eliezer can interject here, but it seems to me like these are not knockdown criticisms that such an approach can't “possibly, possibly, possibly work”—just reasons that it's unlikely to and that we shouldn't rely on it working.

My model is that those two are the well-operationalised disagreements and thus productive to focus on, but that most of the despair is coming from the third and currently more implicit point.

Stepping back, the baseline is that most plans are crossing over dozens of kill-switches without realising it (e.g. Yann LeCun's "objectives can be changed quickly when issues surface"). 

Then there are more interesting proposals that require being able to fully inspect the cognition of an ML system and have it be fully introspectively clear and then use it as a building block to build stronger, competitive, corrigible and aligned ML systems. I think this is an accurate description of Iterated Amplification + Debate as Zhu says in section 1.1.4 of his FAQ, and I think something very similar to this is what Chris Olah is excited about re: microscopes about reverse engineering the entire codebase/cognition of an ML system.

I don't deny that there are lot of substantive and fascinating details to a lot of these proposals and that if this is possible we might indeed solve the alignment problem, but I think that is a large step that sounds from some initial perspectives kind of magical. And don't... (read more)

I feel like it's one reasonable position to call such proposals non-starters until a possibility proof is shown, and instead work on basic theory that will eventually be able to give more plausible basic building blocks for designing an intelligent system.

I agree that deciding to work on basic theory is a pretty reasonable research direction—but that doesn't imply that other proposals can't possibly work. Thinking that a research direction is less likely to mitigate existential risk than another is different than thinking that a research direction is entirely a non-starter. The second requires significantly more evidence than the first and it doesn't seem to me like the points that you referenced cross that bar, though of course that's a subjective distinction.

Even if available plans do get funded getting new plan ideas might be underfunded. 
1[comment deleted]4y

As for planning, we've seen the GPTs ascend from planning out the next few words, to planning out the sentence or line, to planning out the paragraph or stanza. Planning out a whole text interaction is well within the scope I could imagine for the next few iterations, and from there you have the capability of manipulation without external malicious use.

Perhaps a nitpick, but is what it does planning?

Is it actually thinking several words ahead (a la AlphaZero evaluating moves) when it decides what word to say next, or is it just doing free-writing, and it just happens to be so good at coming up with words that fit with what's come before that it ends up looking like a planned out text?

You might argue that if it ends up as-good-as-planned, then it doesn't make a difference if it was actually planned or not. But it seems to me like it does make a difference. If it has actually learned some internal planning behavior, then that seems more likely to be dangerous and to generalize to other kinds of planning.

That's not a nitpick at all!

Upon reflection, the structured sentences, thematically resolved paragraphs, and even JSX code can be done without a lot of real lookahead. And there's some evidence it's not doing lookahead - its difficulty completing rhymes when writing poetry, for instance.

(Hmm, what's the simplest game that requires lookahead that we could try to teach to GPT-3, such that it couldn't just memorize moves?)

Thinking about this more, I think that since planning depends on causal modeling, I'd expect the latter to get good before the former. But I probably overstated the case for its current planning capabilities, and I'll edit accordingly. Thanks!

Yes! I was thinking about this yesterday, it occurred to me that GPT-3's difficulty with rhyming consistently might not just be a byte-pair problem, any highly structured text with extremely specific, restrictive forward and backward dependencies is going to be a challenge if you're just linearly appending one token at a time onto a sequence without the ability to revise it (maybe we should try a 175-billion parameter BERT?). That explains and predicts a broad spectrum of issues and potential solutions (here I'm calling them A, B and C): performance should correlate to (1) the allowable margin of error per token-group (coding syntax is harsh, solving math equations is harsh, trying to come up with a rhyme for 'orange' after you've written it is harsh), and (2) the extent to which each token-group depends on future token-groups. Human poets and writers always go through several iterations, but we're asking it to do what we do in just one pass.

So in playing around with GPT-3 (AID), I've found two (three?) meta approaches for dealing with this issue. I'll call them Strategies A, B and C.

A is the more general one. You just give it multiple ... (read more)

I'm confused about the "because I could not stop for death" example. You cite it as an example of GPT-3 developing "the sense of going somewhere, at least on the topic level," but it seems to have just memorized the Dickinson poem word for word; the completion looks identical to the original poem except for some punctuation.

(To be fair to GPT-3, I also never remember where Dickinson puts her em dashes.)

I... oops. You're completely right, and I'm embarrassed. I didn't check the original, because I thought Gwern would have noted it if so. I'm going to delete that example. What's really shocking is that I looked at what was the original poetry, and thought to myself, "Yeah, that could plausibly have been generated by GPT-3." I'm sorry, Emily.
I did warn in the preface to that section that for really famous poems, GPT-3 will typically continue them and only improvise later on. I assumed that anyone interested in poems these famous would know where the original stopped and the new began, but probably that's expecting too much. I've gone back and annotated further where there seems to be copying.

I think GPT-N is definitely not aligned, for mesa-optimizer reasons. It'll be some unholy being with a superhuman understanding of all the different types of humans, all the different parts of the internet, all the different kinds of content and style... but it won't itself be human, or anything close.

Of course, it's also not outer-aligned in Evan's sense, because of the universal prior being malign etc.

Suppose that GPT-6 does turn out to be some highly transformative AI capable of human-level language understanding and causal reasoning? What would the remaining gap be between that and an Agentive AGI? Possibly, it would not be much of a further leap.

There is this list of remaining capabilities needed for AGI in an older post I wrote, with the capabilities of 'GPT-6' as I see them underlined:

Stuart Russell’s List
human-like language comprehension
cumulative learning
discovering new action sets
managing its own mental activity
For reference, I’ve included two capabilities we already have that I imagine being on a similar list in 1960
perception and object recognition
efficient search over known facts

So we'd have discovering new action sets, and managing mental activity - effectively, the things that facilitate long-range complex planning, remaining. Unless you think those could also arise with GPT-N?

Suppose GPT-8 gives you all of those, just spontaneously, but its nothing but a really efficient text-predictor. Supposing that no dangerous mesa-optimisers arise, what then? Would it be relatively easy to turn it into something agentive, or would agent-like behavi... (read more)

For a start you could see how it predicts or extrapolates moral reasoning. The datasets I've seen for that are "moral machines” and 'am I the arsehole' on reddit. EDIT Something like this was just released Aligning AI With Shared Human Values

You're careful here to talk about transformative AI rather than AGI, and I think that's right. GPT-N does seem like it stands to have transformative effects without necessarily being AGI, and that is quite worrisome. I think many of us expected to find ourselves in a world where AGI was primarily what we had to worry about, and instead we're in a world where "lesser" AI is on track to be powerful enough to dramatically change society. Or at least, so it seems from where we stand, extracting out the trends.

Why do you think "lesser" AI being transformative is more worrying than AGI? This scenario seems similar to past technological progress.
6Gordon Seidoh Worley4y
I didn't say GPT-N is more worrying than AGI, I'm saying I'm surprised we near term have to worry or be concerned about GPT-N in a way I (and I think many others) expected only to have to worry about things we would all agree were AGI.
I see, thanks for clarifying!

There are some posts with perennial value, and some which depend heavily on their surrounding context. This post is of the latter type. I think it was pretty worthwhile in its day (and in particular, the analogy between GPT upgrades and developmental stages is one I still find interesting), but I leave it to you whether the book should include time capsules like this.

It's also worth noting that, in the recent discussions, Eliezer has pointed to the GPT architecture as an example that scaling up has worked better than expected, but he diverges from the thes... (read more)

I think it's an essential time to support projects that can work for a GPT-style near-term AGI , for instance by incorporating specific alignment pressures during training. Intuitively, it seems as if Cooperative Inverse Reinforcement Learning or AI Safety via Debate or Iterated Amplification are in this class.

As I argued here, I think GPT-3 is more likely to be aligned than whatever we might do with CIRL/IDA/Debate ATM, since it is trained with (self)-supervised learning and gradient descent.

The main reason such a system could pose an x-risk by its... (read more)

BTW with regard to "studying mesa-optimization in the context of such systems", I just published this post: Why GPT wants to mesa-optimize & how we might change this. I'm still thinking about the point you made in the other subthread about MAML. It seems very plausible to me that GPT is doing MAML type stuff. I'm still thinking about if/how that could result in dangerous mesa-optimization.
Could you provide more details on this? Sometimes people will give GPT-3 a prompt with some examples of inputs along with the sorts of responses they'd like to see from GPT-3 in response to those inputs ("few-shot learning", right? I don't know what 0-shot learning you're referring to.) Is your claim that GPT-3 succeeds at this sort of task by doing something akin to training a model internally? If that's what you're saying... That seems unlikely to me. GPT-3 is essentially a stack of 96 transformers right? So if it was doing something like gradient descent internally, how many consecutive iterations would it be capable of doing? It seems more likely to me that GPT-3 is simply able to learn sufficiently rich internal representations such that when the input/output examples are within its context window, it picks up their input/output structure and forms a sufficiently sophisticated conception of that structure that the word that scores highest according to next-word prediction is a word that comports with the structure. 96 transformers would appear to offer a very limited budget for any kind of serial computation, but there's a lot of parallel computation going on there, and there are non-gradient-descent optimization algorithms, genetic algorithms say, that can be parallelized. I guess the query matrix could be used to implement some kind of fitness function? It would be interesting to try some kind of layer-wise pretraining on transformer blocks and train them to compute steps in a parallelizable optimization algorithm (probably you'd want to pick a deterministic algorithm which is parallelizable instead of a stochastic algorithm like genetic algorithms). Then you could look at the resulting network and based on it, try to figure out what the telltale signs of a mesa-optimizer are (since this network is almost certainly implementing a mesa-optimizer). Still, my impression is you need 1000+ generations to get interesting results with genetic algorithms, which s
3David Scott Krueger (formerly: capybaralet)3y
No, that's zero-shot. Few shot is when you train on those instead of just stuffing them into the context. It looks like mesa-optimization because it seems to be doing something like learning about new tasks or new prompts that are very different from anything its seen before, without any training, just based on the context (0-shot). By "training a model", I assume you mean "a ML model" (as opposed to, e.g. a world model). Yes, I am claiming something like that, but learning vs. inference is a blurry line. I'm not saying it's doing SGD; I don't know what it's doing in order to solve these new tasks. But TBC, 96 steps of gradient descent could be a lot. MAML does meta-learning with 1.

Next, we might imagine GPT-N to just be an Oracle AI, which we would have better hopes of using well. But I don't expect that an approximate Oracle AI could be used safely with anything like the precautions that might work for a genuine Oracle AI. I don't know what internal optimizers GPT-N ends up building along the way, but I'm not going to count on there being none of them.

Is the distinguishing feature between Oracle AI and approximate Oracle AI, as you use the terms here, just about whether there are inner optimizers or not?

(When I started the paragrap... (read more)

The outer optimizer is the more obvious thing: it's straightforward to say there's a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way. Inner optimizers are an illustration of another failure mode.
GPT generates text by repeatedly picking whatever word seems highest probability given all the words that came before. So if its notion of "highest probability" is almost, but not quite, answering every question accurately, I would expect a system which usually answers questions accurately but sometimes answers them inaccurately. That doesn't sound very scary?
Got it. Thanks!

And the really worrisome capability comes when it models its own interactions with the world, and makes plans with that taken into account.


Someone who's been playing with GPT-3 as a writing assistant gives an example which looks very much like GPT-3 describing this process:

"One could write a program to generate a story that would create an intelligence. One could program the story to edit and refine itself, and to make its own changes in an attempt to improve itself over time. One could write a story to not only change the reader, but also to change

... (read more)

GPT-3 can generate a plan and then a way to implement it: bold is prompt.

"Below is a plan of preparing a dinner.

1) Chop up some vegetables and put them in the fridge for later use.

2) Cook some meat, then you can eat it tonight!

3) Wash your hands, because this is going to be messy!

4) And lastly...

5) Eat!

Now you start doing it:

You chop up some carrots, onions and potatoes. You cook some beef, then you can have dinner tonight!

After eating, you wash your hands and get ready for bed. You don't know how long it will take before you feel sleepy again s... (read more)

That's not what I mean by planning. I mean "outputting a particular word now because most alternatives would get you stuck later". An example is rhyming poetry. GPT-3 has learned to maintain the rhythm and the topic, and to end lines with rhyme-able words. But then as it approaches the end of the next line, it's painted itself into a corner- there very rarely exists a word that completes the meter of the line, makes sense conceptually and grammatically, and rhymes exactly or approximately with the relevant previous line. When people are writing rhyming metered poetry, we do it by having some idea where we're going - setting ourselves up for the rhyme in advance. It seems that GPT-3 isn't doing this. ...but then again, if it's rewarded only for predictions one word at a time, why should it learn to do this? And could it learn the right pattern if given a cost function on the right kind of time horizon? As for why your example isn't what I'm talking about, there's no point at which it needs to think about later words in order to write the earlier words.

I don't believe rhymes are an example of a failure to plan. They are a clearcut case of BPE problems.

They follow the same patterns as other BPE problems: works on the most common (memorized) instances, rapidly degrading with rarity, the relevant information cannot be correctly represented by BPEs, they are inherently simple yet GPT-3 performs really badly despite human-like performance on almost identical tasks (like non-rhyming poetry, or non-pun based humor), and have improved minimally over GPT-2. With rhymes, it's even more clearly not a planning problem because Peter Vessenes, I think, on the Slack set up a demo problem where the task was merely to select the rhyming word for a target word out of a prespecified list of possible rhymes; in line with BPEs, GPT-3 could correctly select short common rhyme pairs, and then fell apart as soon as you used rarer words. Similarly, I found little gain for prespecified rhymes. The problem is not that GPT-3 can't plan good rhymes, the problem is that GPT-3 doesn't know what words rhyme, period.

As far as planning goes, next-token prediction is entirely consistent with implicit planning. During each forward pass, GPT-3 probably has plenty of... (read more)

Funny thing about BPEs: GPT-3 has to know about the individual letters, because I taught it how to spell both real words and nonsense words. (Prompts in bold, including two where I edited GPT-3's responses.)

The students were ready to test their spelling.
The teacher stood at the front of the class. "Adam, please spell PARTY."
Adam replied, "Party. P-A-R-T-Y." 
The teacher said, "Correct. Beatrice, please spell THROUGH."
Beatrice replied, "Through. T-H-O-R-O-U-G-H."
The teacher said, "Incorrect. Through is spelled T-H-R-O-U-G-H. Carlos, please spell SPELL."
Carlos replied, "Spell. S-P-E-L-L."
The teacher said, "Correct. Daphne, please spell NUCLEAR."
Daphne replied, "Nuclear. N-U-C-L-E-A-R."
The teacher said, "Correct. Adam, please spell INFINITE."
Adam replied, "Infinite. I-N-F-I-N-A-T-E."
The teacher replied, "Incorrect. Infinite is spelled I-N-F-I-N-I-T-E. Beatrice, please spell BALLOON."
Beatrice replied, "Balloon. B-A-L-L-O-O-N."
The teacher replied, "Correct. Carlos, please spell ENCLOSURE."
Carlos replied, "Enclosure. I-N-C-L-O-S-U-R-E."
The teacher replied, "Incorrect. Enclosure is spelled E-N-C-L-O-S-U-R-E. Daphne, please spell ELECTRON."
Daphne replied, "Electron. E-L-E-C-T-R-O-N."
Th... (read more)

Sure. It's seen plenty of individual letters (letters have their own BPEs as fallbacks if longer BPEs don't capture them, AFAIK). Stuff like my acrostics demonstration relies on the fact that GPT-3 has knowledge of letters and can, with some difficulty, manipulate them for various tasks.
3Aaro Salosensaari4y
(Reply to gwern's comment but not only addressing gwern.) Concerning the planning question: I agree that next-token prediction is consistent with some sort of implicit planning of multiple tokens ahead. I would phrase it a bit differently. Also, "implicit" is doing lot of work here (Please someone correct me if I say something obviously wrong or silly; I do not know how GPT-3 works, but I will try to say something about how it works after reading some sources [1].) To recap what I have thus far got from [1]: GPT-3-like transformers are trained by regimen where the loss function evaluates prediction error of the next word in the sequence given the previous word. However, I am less sure if one can say they do it in isolation. During training (by SGD I figure?), transformer decoder layers have (i) access to previous words in the sequence, and (ii) both attention and feedforward parts of each transformer layer has weights (that are being trained) to compute the output predictions. Also, (iii) the GPT transformer architecture considers all words in each training sequence, left to right, masking the future. And this is done for many meaningful Common Crawl sequences, though exact same sequences won't repeat. So, it sounds a bit trivial that GPTs trained weights allow "implicit planning": if given a sequence of words w_1 to w_i-1 GPT would output word w for position i, this is because a trained GPT model (loosely speaking, abstracting away many details I don't understand) "dynamically encodes" many plausible "word paths" to word w, and [w_1 ... w_i-1] is such a path; by iteration, it also encodes many word paths from w to other words w', where some words are likelier to follow w than others. The representations in the stack of attention and feedforward layers allows it to generate text much more better than eg old good Markov chain. And "self-attending" to some higher-level representation that allows it generate text in particular prose style seems a lot of like a kin
Yes, I understand that it doesn't actually plan things, but we can make it mimic planing via special prompts, the same way as GPT mimics reasoning and other things.
3Aaro Salosensaari4y
I contend it is not an *implementation* in a meaningful sense of the word. It is more a prose elaboration / expansion of the first generated bullet point list (an inaccurate one: "plan" mentions chopping vegetables, putting them in a fridge and cooking meat; prose version tells of chopping a set of vegetables, skips the fridge and then cooks beef, and then tells an irrelevant story where you go to sleep early and find it is a Sunday and no school). Mind, substituting abstract category words with sensible more specific ones (vegetables -> carrots, onions and potatoes) is an impressive NLP task for an architecture where the behavior is not hard-coded in (because that's how some previous natural language generators worked), and even more impressive that it can produce the said expansion with a NLP input prompt, but hardly a useful implementation of a plan. An improved experiment of "implementing plans" that could be within capabilities of GPT-3 or similar system: get GPT-3 to first output a plan of doing $a_thing and then the correct keystroke sequence input for UnReal World, DwarfFortress or Sims or some other similar simulated environment to produce it.

It would also be very useful to build some GPT feature "visualization" tools ASAP.

Do you have anything more specific in mind? I see the Image Feature Visualization tool, but in my mind it's basically doing exactly what you're already doing by comparing GPT-2 and GPT-3 snippets.

No, the closest analogue of comparing text snippets is staring at image completions, which is not nearly as informative as being able to go neuron-by-neuron or layer-by-layer and get a sense of the concepts at each level.
[+][comment deleted]4y8