Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I have recently encountered a number of people with misconceptions about OpenAI. Some common impressions are accurate, and others are not. This post is intended to provide clarification on some of these points, to help people know what to expect from the organization and to figure out how to engage with it. It is not intended as a full explanation or evaluation of OpenAI's strategy. 

The post has three sections:

  • Common accurate impressions
  • Common misconceptions
  • Personal opinions

The bolded claims in the first two sections are intended to be uncontroversial, i.e., most informed people would agree with how they are labeled (correct versus incorrect). I am less sure about how commonly believed they are. The bolded claims in the last section I think are probably true, but they are more open to interpretation and I expect others to disagree with them.

Note: I am an employee of OpenAI. Sam Altman (CEO of OpenAI) and Mira Murati (CTO of OpenAI) reviewed a draft of this post, and I am also grateful to Steven Adler, Steve Dowling, Benjamin Hilton, Shantanu Jain, Daniel Kokotajlo, Jan Leike, Ryan Lowe, Holly Mandel and Cullen O'Keefe for feedback. I chose to write this post and the views expressed in it are my own.

Common accurate impressions

Correct: OpenAI is trying to directly build safe AGI.

OpenAI's Charter states: "We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome." OpenAI leadership describes trying to directly build safe AGI as the best way to currently pursue OpenAI's mission, and have expressed concern about scenarios in which a bad actor is first to build AGI, and chooses to misuse it.

Correct: the majority of researchers at OpenAI are working on capabilities. 

Researchers on different teams often work together, but it is still reasonable to loosely categorize OpenAI's researchers (around half the organization) at the time of writing as approximately:

  • Capabilities research: 100
  • Alignment research: 30
  • Policy research: 15

Correct: the majority of OpenAI employees did not join with the primary motivation of reducing existential risk from AI specifically.

My strong impressions, which are not based on survey data, are as follows. Across the company as a whole, a minority of employees would cite reducing existential risk from AI as their top reason for joining. A significantly larger number would cite reducing risk of some kind, or other principles of beneficence put forward in the OpenAI Charter, as their top reason for joining. Among people who joined to work in a safety-focused role, a larger proportion of people would cite reducing existential risk from AI as a substantial motivation for joining, compared to the company as a whole. Some employees have become motivated by existential risk reduction since joining OpenAI.

Correct: most interpretability research at OpenAI stopped after the Anthropic split.

Chris Olah led interpretability research at OpenAI before becoming a cofounder of Anthropic. Although several members of Chris's former team still work at OpenAI, most of them are no longer working on interpretability.

Common misconceptions

Incorrect: OpenAI is not working on scalable alignment.

OpenAI has teams focused both on practical alignment (trying to make OpenAI's deployed models as aligned as possible) and on scalable alignment (researching methods for aligning models that are beyond human supervision, which could potentially scale to AGI). These teams work closely with one another. Its recently-released alignment research includes self-critiquing models (AF discussion), InstructGPTWebGPT (AF discussion) and book summarization (AF discussion). OpenAI's approach to alignment research is described here, and includes as a long-term goal an alignment MVP (AF discussion).

Incorrect: most people who were working on alignment at OpenAI left for Anthropic. 

The main group of people working on alignment (other than interpretability) at OpenAI at the time of the Anthropic split at the end of 2020 was the Reflection team, which has since been renamed to the Alignment team. Of the 7 members of the team at that time (who are listed on the summarization paper), 4 are still working at OpenAI, and none are working at Anthropic. Edited to add: this fact alone is not intended to provide a complete picture of the Anthropic split, which is more complicated than I am able to explain here.

Incorrect: OpenAI is a purely for-profit organization.

OpenAI has a hybrid structure in which the highest authority is the board of directors of a non-profit entity. The members of the board of directors are listed here. In legal paperwork signed by all investors, it is emphasized that: "The [OpenAI] Partnership exists to advance OpenAI Inc [the non-profit entity]'s mission of ensuring that safe artificial general intelligence is developed and benefits all of humanity. The General Partner [OpenAI Inc]'s duty to this mission and the principles advanced in the OpenAI Inc Charter take precedence over any obligation to generate a profit. The Partnership may never make a profit, and the General Partner is under no obligation to do so."

Incorrect: OpenAI is not aware of the risks of race dynamics.

OpenAI's Charter contains the following merge-and-assist clause: "We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”"

Incorrect: OpenAI leadership is dismissive of existential risk from AI.

OpenAI has a Governance team (within Policy Research) that advises leadership and is focused on strategy for avoiding existential risk from AI. In multiple recent all-hands meetings, OpenAI leadership have emphasized to employees the need to scale up safety efforts over time, and encouraged employees to familiarize themselves with alignment ideas. OpenAI's Chief Scientist, Ilya Sutskever, recently pivoted to spending 50% of his time on safety.

Personal opinions

Opinion: OpenAI leadership cares about reducing existential risk from AI.

I think that OpenAI leadership are familiar and agree with the basic case for concern and appreciate the magnitude of what's at stake. Existential risk is an important factor, but not the only factor, in OpenAI leadership's decision making. OpenAI's alignment work is much more than just a token effort.

Opinion: capabilities researchers at OpenAI have varying attitudes to existential risk.

I think that capabilities researchers at OpenAI have a wide variety of views, including some with long timelines who are skeptical of attempts to mitigate risk now, and others who are supportive but may consider the question to be outside their area of expertise. Some capabilities researchers actively look for ways to help with alignment, or to learn more about it.

Opinion: disagreements about OpenAI's strategy are substantially empirical.

I think that some of the main reasons why people in the alignment community might disagree with OpenAI's strategy are largely disagreements about empirical facts. In particular, compared to people in the alignment community, OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities, and are more concerned about bad actors developing and misusing AGI. I would expect OpenAI leadership to change their mind on these questions given clear enough evidence to the contrary.

Opinion: I am personally extremely uncertain about strategy-related questions.

I do not spend most of my time thinking about strategy. If I were forced to choose between OpenAI speeding up or slowing down its work on capabilities, my guess is that I would end up choosing the latter, all else equal, but I am very unsure.

Opinion: OpenAI's actions have drawn a lot of attention to large language models.

I think that the release of GPT-3 and the OpenAI API led to significantly increased focus and somewhat of a competitive spirit around large language models. I consider there to be advantages and disadvantages to this. I don't think OpenAI predicted this in advance, and believe that it would have been challenging, but not impossible, to foresee this.

Opinion: OpenAI is deploying models in order to generate revenue, but also to learn about safety.

I think that OpenAI is trying to generate revenue through deployment in order to directly create value and in order to fund further research and development. At the same time, it also uses deployment as a way to learn in various ways, and about safety in particular.

Opinion: OpenAI's particular research directions are driven in large part by researchers.

I think that OpenAI leadership has control over staffing and resources that affects the organization's overall direction, but that particular research directions are largely delegated to researchers, because they have the most relevant context. OpenAI would not be able to do impactful alignment research without researchers who have a strong understanding of the field. If there were talented enough researchers who wanted to lead new alignment efforts at OpenAI, I would expect them to be enthusiastically welcomed by OpenAI leadership.

Opinion: OpenAI should be focusing more on alignment.

I think that OpenAI's alignment research in general, and its scalable alignment research in particular, has significantly higher average social returns than its capabilities research on the margin.

Opinion: OpenAI is a great place to work to reduce existential risk from AI.

I think that the Alignment, RL, Human Data, Policy Research, Security, Applied Safety, and Trust and Safety teams are all doing work that seems useful for reducing existential risk from AI.

230

Ω 68

138 comments, sorted by Click to highlight new comments since: Today at 9:42 PM
New Comment
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

One comment in this thread compares the OP to Philip Morris’ claims to be working toward a “smoke-free future.” I think this analogy is overstated, in that I expect Philip Morris is being more intentionally deceptive than Jacob Hilton here. But I quite liked the comment anyway, because I share the sense that (regardless of Jacob's intention) the OP has an effect much like safetywashing, and I think the exaggerated satire helps make that easier to see.

The OP is framed as addressing common misconceptions about OpenAI, of which it lists five:

  1. OpenAI is not working on scalable alignment.
  2. Most people who were working on alignment at OpenAI left for Anthropic.
  3. OpenAI is a purely for-profit organization.
  4. OpenAI is not aware of the risks of race dynamics.
  5. OpenAI leadership is dismissive of existential risk from AI.

Of these, I think 1, 3, and 4 address positions that are held by basically no one. So by “debunking” much dumber versions of the claims people actually make, the post gives the impression of engaging with criticism, without actually meaningfully doing that. 2 at least addresses a real argument, but at least as I understand it, is quite misleading—while technically true,... (read more)

9ofer1mo
Another bit of evidence about OpenAI that I think is worth mentioning in this context: OPP recommended a grant of $30M [https://www.openphilanthropy.org/grants/openai-general-support/] to OpenAI in a deal that involved OPP's then-CEO becoming a board member of OpenAI. OPP hoped that this will allow them to make OpenAI improve their approach to safety and governance. Later, OpenAI appointed both the CEO's fiancée and the fiancée's sibling to VP positions.

Both of whom then left for Anthropic with the split, right?

Yes. To be clear, the point here is that OpenAI's behavior in that situation seems similar to how, seemingly, for-profit companies sometimes try to capture regulators by paying their family members. (See 30 seconds from this John Oliver monologue as evidence that such tactics are not rare in the for-profit world.)

Makes sense; it wouldn't surprise me if that's what's happening. I think this perhaps understates the degree to which the attempts at capture were mutual--a theory of change where OPP gives money to OpenAI in exchange for a board seat and the elevation of safety-conscious employees at OpenAI seems like a pretty good way to have an effect. [This still leaves the question of how OPP assesses safety-consciousness.]

I should also note find the 'nondisparagement agreements' people have signed with OpenAI somewhat troubling because it means many people with high context will not be writing comments like Adam Scholl's above if they wanted to, and so the absence of evidence is not as much evidence of absence as one would hope.

Does everyone who work at OpenAI sign a non-disparagement agreement? (Including those who work on governance/policy?)

Sooo this was such an intriguing idea that I did some research -- but reality appears to be more boring:

In a recent informal discussion I believe said OPP CEO remarked he had to give up the OpenAI board seat as his fiancée joining Anthropic creates a conflict of interest. Naively this is much more likely, and I think is much better supported by the timelines.
According to LinkedIn of the mentioned fiancée joined in already as VP in 2018 and was promoted to a probably more serious position in 2020, and her sibling was promoted to VP in 2019.
The Anthropic split occurred in June 2021. 
A new board member (who is arguably very aligned to OPP) was inducted in September 2021, probably in place of OPP CEO.
It is unclear when OPP CEO exactly left the board, but I would guess sometime in 2021. This seem better explained by "conflict of interest with his fiancée joining-cofounding Anthropic" and OpenAI putting an other OPP-aligned board member in his place wouldn't make for very productive scheming.
 

The "conflict of interest" explanation also matches my understanding of the situation better.

4iamthouthouarti1mo
“the presence of which I take the OP to describe as reassuring” I get the sense from this, and from the rest of your comment here that you think we should in fact not find this even mildly reassuring. I’m not going to argue with such a claim, because I don’t think such an effort on my part would be very useful to anyone. However, if I’m not completely off base or I’m not overstating your position (which I totally could be) , then could you go into some more detail as to why you think that we shouldn’t find their presence reassuring at all?

Suppose you're in middle school, and one day you learn that your teachers are planning a mandatory field trip, during which the entire grade will jump off of a skyscraper without a parachute. You approach a school administrator to talk to them about how dangerous that would be, and they say, "Don't worry! We'll all be wearing hard hats the entire time."

Hearing that probably does not reassure you even a little bit, because hard hats alone would not nudge the probability of death below ~100%. It might actually make you more worried, because the fact that they have a prepared response means school administrators were aware of potential issues and then decided the hard hat solution was appropriate. It's generally harder to argue someone out of believing in an incorrect solution to a problem, than into believing the problem exists in the first place.

This analogy overstates the obviousness of (and my personal confidence in) the risk, but to a lot of alignment researchers it's an essentially accurate metaphor for how ineffective they think OpenAI's current precautions will turn out in practice, even if making a doomsday AI feels like a more "understandable" mistake.

4iamthouthouarti1mo
Thank you! I think I understand this position a good deal more now.
1William_S1mo
(I work at OpenAI). Is the main thing you think has the effect of safetywashing here the claim that the misconceptions are common? Like if the post was "some misconceptions I've encountered about OpenAI" it would mostly not have that effect? (Point 2 was edited to clarify that it wasn't a full account of the Anthropic split.)

Incorrect: OpenAI leadership is dismissive of existential risk from AI.

So the reason I think this is very high-level people have made claims like, “the orthogonality thesis is probably false”, and someone I know who talked to a very, very, very high-level person at OpenAI had to explain to them that inner alignment is a thing. If they actually cared, I would expect the leadership to have more familiarity with their critic’s arguments.

No one remembers now, but the founding rhetoric was also pretty bad, though walked back I suppose.

Also, I often see them claim their AI ethics work (train a model not to offend the average Berkeley humanities grad - possibly not useless, I suppose, but not exactly going to save our lightcone) is important alignment work. Obviously, what is going on inside is not legible to me, but what I see from the outside has mostly been disheartening. Their recent blog on alignment was an exception to this.

Though there are people with their priorities straight at OpenAI, I see little evidence that this is true of their leadership. I’m not confident an organization can be net beneficial when this is the case.

If we're thinking about the same "very, very, very high-level person at OpenAI", it does seem like this person now buys that inner alignment is a thing and is concerned about it (or says he's concerned). It is scary because people at these AI labs don't know all that much about AI alignment but also hopeful because they don't seem to disagree with it and maybe just need to be given the arguments in a good way by someone they would listen to?

6Tomás B.1mo
I suspect we are thinking about the same person and that is heartening that they changed their mind.
4Wei_Dai1mo
Wait, you don't think this (I mean the training, not the offending) is a safety problem in and of itself? (See also my previous comment about this [https://www.lesswrong.com/posts/pFAavCTW56iTsYkvR/ai-alignment-open-thread-october-2019?commentId=gzv66WeWZ2onkYrWR] .)

People at OpenAI regularly say things like

And you say:

  • OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities

AFAICT, no-one from OpenAI has publicly explained why they believe that RLHF + amplification is supposed to be enough to safely train systems that can solve alignment for us. The blog post linked above says "we believe" four times, but does not take the time to explain why anyone believes these things.

Writing up this kind of reasoning is time-intensive, but I think it would be worth it: if you're right, then the value of information for the rest of the community is huge; if you're wrong, it's an opportunity to change your minds.

Opinion: disagreements about OpenAI's strategy are substantially empirical.

I think that some of the main reasons why people in the alignment community might disagree with OpenAI's strategy are largely disagreements about empirical facts. In particular, compared to people in the alignment community, OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities, and are more concerned about bad actors developing and misusing AGI. I would expect OpenAI leadership to change their mind on these questions given clear enough evidence to the contrary.

See, this is exactly the problem. Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late. That is the fundamental reason why alignment is harder than other scientific fields. Goodhart problems in outer alignment, deception in inner alignment, phase change in hard takeoff, "getting what you measure" in slow takeoff, however you frame it the issue is the same: things look fine early on, and go wrong later.

And as far as I can tell, OpenAI as an org just totally ignores that whole class of issues/arguments, and charges ahead assuming that if they don't see a problem then there isn't a problem (and meanwhile does things which actively select for hiding problems, like e.g. RLHF).

To clarify, by "empirical" I meant "relating to differences in predictions" as opposed to "relating to differences in values" (perhaps "epistemic" would have been better). I did not mean to distinguish between experimental versus conceptual evidence. I would expect OpenAI leadership to put more weight on experimental evidence than you, but to be responsive to evidence of all kinds. I think that OpenAI leadership are aware of most of the arguments you cite, but came to different conclusions after considering them than you did.

[First of all, many thanks for writing the post; it seems both useful and the kind of thing that'll predictably attract criticism]

I'm not quite sure what you mean to imply here (please correct me if my impression is inaccurate - I'm describing how-it-looks-to-me, and I may well be wrong):

I would expect OpenAI leadership to put more weight on experimental evidence than you...

Specifically, John's model (and mine) has:
X = [Class of high-stakes problems on which we'll get experimental evidence before it's too late]
Y = [Class of high-stakes problems on which we'll get no experimental evidence before it's too late]

Unless we expect Y to be empty, when we're talking about Y-problems the weighting is irrelevant: we get no experimental evidence.

Weighting of evidence is an issue when dealing with a fixed problem.
It seems here as if it's being used to select the problem: we're going to focus on X-problems because we put a lot of weight on experimental evidence. (obviously silly, so I don't imagine anyone consciously thinks like this - but out-of-distribution intuitions may be at work)

What kind of evidence do you imagine would lead OpenAI leadership to change their minds/approach?
Do you / your-model-of-leadership believe that there exist Y-problems?

2Jacob_Hilton1mo
I don't think I understand your question about Y-problems, since it seems to depend entirely on how specific something can be and still count as a "problem". Obviously there is already experimental evidence that informs predictions about existential risk from AI in general, but we will get no experimental evidence of any exact situation that occurs beforehand. My claim was more of a vague impression about how OpenAI leadership and John tend to respond to different kinds of evidence in general, and I do not hold it strongly.
2Joe_Collman1mo
To rephrase, it seems to me that in some sense all evidence is experimental. What changes is the degree of generalisation/abstraction required to apply it to a particular problem. Once we make the distinction between experimental and non-experimental evidence, then we allow for problems on which we only get the "non-experimental" kind - i.e. the kind requiring sufficient generalisation/abstraction that we'd no longer tend to think of it as experimental. So the question on Y-problems becomes something like: * Given some characterisation of [experimental evidence] (e.g. whatever you meant that OpenAI leadership would tend to put more weight on than John)... * ...do you believe there are high-stakes problems for which we'll get no decision-relevant [experimental evidence] before it's too late?

Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late.

I don't think this is the core reason that alignment is hard - even if we had access to a bunch of evidence about AGI misbehavior now, I think it'd still be hard to convert that into a solution for alignment. Nor do I believe we'll see no empirical evidence of power-seeking behavior before it's too late (and I think opinions amongst alignment researchers are pretty divided on this question).

I don't think this is the core reason that alignment is hard - even if we had access to a bunch of evidence about AGI misbehavior now, I think it'd be very hard to convert that into a solution for alignment.

If I imagine that we magically had a boxing setup which let us experiment with powerful AGI alignment without dying, I do agree it would still be hard to solve alignment. But it wouldn't be harder than the core problems of any other field of science/engineering. It wouldn't be unusually hard, by the standards of technical research.

Of course, "empirical evidence of power-seeking behavior" is a lot weaker than a magical box. With only that level of empirical evidence, most of the "no empirical feedback" problem would still be present. More on that next.

Nor do I believe we'll see no empirical evidence of power-seeking behavior before it's too late (and I think opinions amongst alignment researchers are pretty divided on this question).

The key "lack of empirical feedback" property in Goodhart, deceptive alignment, hard left turn, get what you measure, etc, is this: for any given AI, it will look fine early on (e.g. in training or when optimization power is low) and then things will ... (read more)

Huh, I thought you agreed with statements like "if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier".

My model is that John is talking about "evidence on whether an AI alignment solution is sufficient", and you understood him to say "evidence on whether the AI Alignment problem is real/difficult". My guess is you both agree on the former, but I am not confident.

5Richard_Ngo1mo
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work). I don't really know what "reliable empirical feedback" means in this context - if you have sufficiently reliable feedback mechanisms, then you've solved most of the alignment problem. But, out of the things John listed: I expect that we'll observe a bunch of empirical examples of each of these things happening (except for the hard takeoff phase change), and not know how to fix them.

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I do really feel like it would have been really extremely hard to build rockets if we had to get it right on the very first try.

I think for rockets the fact that it is so costly to experiment with stuff, explains the majority of the difficulty of rocket engineering. I agree you also have very little chance to build a successful space rocket without having a good understanding of newtonian mechanics and some aspects of relativity, but I don't know, if I could just launch a rocket every day without bad consequences, I am pretty sure I wouldn't really need a deep understanding of either of those, or would easily figure out the relevant bits as I kept experimenting.

The reason why rocket science relies so much on having solid theoretical models is because we have to get things right in only a few shots. I don't think you really needed any particularly good theory to build trains for example. Just a lot of attempts and tinkering.

At a sufficiently high level of abstraction, I agree that "cost of experimenting" could be seen as the core difficulty. But at a very high level of abstraction, many other things could also be seen as the core difficulty, like "our inability to coordinate as a civilization" or "the power of intelligence" or "a lack of interpretability", etc. Given this, John's comment seemed like mainly rhetorical flourishing rather than a contentful claim about the structure of the difficult parts of the alignment problem.

Also, I think that "on our first try" thing isn't a great framing, because there are always precursors (e.g. we landed a man on the moon "on our first try" but also had plenty of tries at something kinda similar). Then the question is how similar, and how relevant, the precursors are - something where I expect our differing attitudes about the value of empiricism to be the key crux.

2David Scott Krueger (formerly: capybaralet)1mo
Well you could probably build a rocket that looks like it works, anyways. Could you build one you would want to try to travel to the moon in? (Are you imagining you get to fly in these rockets? Or just launch and watch from ground? I was imagining the 2nd...)

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I basically buy that argument, though I do still think lack of shots is the main factor which makes alignment harder than most other technical fields in their preparadigmatic stage.

2Roman Leventov1mo
"Harder" can have two meanings: "the program (of design, and the proof) is longer" and "the program is less likely to be generated in the real world". These meanings are correlated, but not identical.

Here is a similar post one could make about a different company:

A friend of mine has recently encountered a number of people with misconceptions about their employer, Phillip Morris International (PMI). Some common impressions are accurate, and others are not. He encouraged me to write a post intended to provide clarification on some of these points, to help people know what to expect from the organization and to figure out how to engage with it. It is not intended as a full explanation or evaluation of Phillip Morris's strategy.

Common accurate impressions

  • Phillip Morris International is the world's largest producer of cigarettes. 
  • The majority of employees at Phillip Morris International work on tobacco production and marketing for the developing world.
  • The majority of Phillip Morris International's employees did not join with the primary motivation of reducing harm from tobacco smoke specifically.

PMI is the largest tobacco company in the world when measuring by market capitalization or revenue. PMI has six multibillion US$ brands and ships tens of billions of units to (in order of volume) southeast Asia, the European Union, the Middle East and Africa, Eastern Europe, The Americ

... (read more)

I found this comment helpful for me as I was trying to understand AI labs' roles in all this. Please consider retracting the retraction :)

9lc1mo
Now that I have your blessing I shall do that! I was mostly worried cause I have a history of making unhelpfully aggressive AI safety-related comments and I didn't want moderators to get frustrated with me again (which, to be clear, so far has happened only for very understandable reasons).

The parent seems to be redacted, but I wish to express that the satire angle did give quite a clear picture of some dynamics that could get watered down to the point of irrelevance. With the length and intensity it might have been unfriendlier than it could have been.

So in brief and abstract if an oil company promises carbon reductions because of social responcibility can be facing a conflict of interests and might not be pushing in both directions with the same gusto.

So with a organisation both making AI happen and not happen left hand spinning what the right hand is doing is relatively likely.

4Jacob_Hilton1mo
I obviously think there are many important disanalogies, but even if there weren't, rhetoric like this seems like an excellent way to discourage OpenAI employees from ever engaging with the alignment community, which seems like a pretty bad thing to me.

I'd agree if somebody else wrote what you wrote but I don't think it's appropriate for you as an OpenAI employee to say that.

Thank you for causing me to reconsider. I should have said "other OpenAI employees". I do not intend to disengage from the alignment community because of critical rhetoric, and I apologize if my comment came across as a threat to do so. I am concerned about further breakdown of communication between the alignment community and AI labs where alignment solutions may need to be implemented.

I don't immediately see any other reason why my comment might have been inappropriate, but I welcome your clarification if I am missing something.

4gadyp1mo
Thanks for the clarification.

The main group of people working on alignment (other than interpretability) at OpenAI at the time of the Anthropic split at the end of 2020 was the Reflection team, which has since been renamed to the Alignment team. Of the 7 members of the team at that time (who are listed on the summarization paper), 4 are still working at OpenAI, and none are working at Anthropic.

I think this is literally true, but at least as far as I know is not really conveying the underlying dynamics and so I expect readers to walk away with the wrong impression.

Again, I might be totally wrong here, but as far as I understand the underlying dynamics is that there was a substantial contingent of people who worked at OpenAI because they cared about safety but worked in a variety of different roles, including many engineering roles. That contingent had pretty strong disagreements with leadership about a mixture of safety and other operating priorities (but I think mostly safety). Dario in-particular had lead a lot of the capabilities research and was dissatisfied with how the organization was run.

Dario left and founded Anthropic, taking a substantial number of engineering and research talent with him (I don'... (read more)

[I privately wrote the following quick summary of some publicly-available information on (~safety-relevant) talent leaving OpenAI since the founding of Anthropic. Seems worth pasting here since it already exists but I'd have been more careful if I wrote it with public sharing in mind, it's not comprehensive, and I don't have time to really edit. I'd advise against updating too hard on it because:

  • I basically don't have any visibility into OpenAI
  • Inferences from LinkedIn often don't give a super accurate sense of somebody's contribution.
  • I wrote down what I know about departures from OpenAI but didn't try to write up new hires in the same way.
  • It's often impossible for people at orgs to talk publicly about personnel issues/departures so if Jacob/others don't correct me, it's not very strong evidence that nothing below is inaccurate/misleading.]

The main group of people working on alignment (other than interpretability) at OpenAI at the time of the Anthropic split at the end of 2020 was the Reflection team, which has since been renamed to the Alignment team. Of the 7 members of the team at that time (who are listed on the summarization paper), 4 are still working at OpenAI, and no

... (read more)
6Jacob_Hilton1mo
Without commenting on the specifics, I have edited to the post to mitigate potential confusion: "this fact alone is not intended to provide a complete picture of the Anthropic split, which is more complicated than I am able to explain here".

Incorrect: OpenAI leadership is dismissive of existential risk from AI.

Why, then, would they continue to build the technology which causes that risk? Why do they consider it morally acceptable to build something which might well end life on Earth?

A common view is that the timelines to risky AI are largely driven by hardware progress and deep learning progress occurring outside of OpenAI. Many people (both at OpenAI and elsewhere) believe that questions of who builds AI and how are very important relative to acceleration of AI timelines. This is related to lower estimates of alignment risk, higher estimates of the importance of geopolitical conflict, and (perhaps most importantly of all) radically lower estimates for the amount of useful alignment progress that would occur this far in advance of AI if progress were to be slowed down. Below I'll also discuss two arguments that delaying AI progress would on net reduce alignment risk which I often encountered at OpenAI.

I think that OpenAI has had a meaningful effect on accelerating AI timelines and that this was a significant cost that the organization did not adequately consider (plenty of safety-focused folk pushed back on various accelerating decisions and this is ultimately related to many departures though not directly my own). I also think that OpenAI is significantly driven by the desire to do something impactful and to reap the short-term benefits of AI. In significant ... (read more)

Another fairly common argument and motivation at OpenAI in the early days was the risk of "hardware overhang," that slower development of AI would result in building AI with less hardware at a time when they can be more explosively scaled up with massively disruptive consequences. I think that in hindsight this effect seems like it was real, and I would guess that it is larger than the entire positive impact of the additional direct work that would be done by the AI safety community if AI progress had been slower 5 years ago.

Could you clarify this bit? It sounds like you're saying that OpenAI's capabilities work around 2017 was net-positive for reducing misalignment risk, even if the only positive we count is this effect. (Unless you think that there's substantial reason that acceleration is bad other than giving the AI safety community less time.) But then in the next paragraph you say that this argument was wrong (even before GPT-3 was released, which vaguely gestures at the "around 2017"-time). I don't see how those are compatible.

One positive consideration is: AI will be built at a time when it is more expensive (slowing later progress). One negative consideration is: there was less time for AI-safety-work-of-5-years-ago. I think that this particular positive consideration is larger than this particular negative consideration, even though other negative considerations are larger still (like less time for growth of AI safety community).

5lc1mo
Are you saying that the AI safety community gets less effective at advancing SOTA interpretability/etc. as it gets more funding/interest, or that the negative consideration is the fact that the AI safety has had less time to grow, or something else? It seems odd to me that AI safety research progress would be negatively correlated with the size and amount of volunteer hours in the field, though I can imagine reasons why someone would think that.
8paulfchristiano1mo
I'm saying that faster progress gives less time for the AI safety community to grow. (I added "less time for" to the original comment to clarify.)
2lc1mo
Ahh, ok.
6gadyp1mo
What's the justification for this view? It seems like significant deep learning process happens inside of OpenAI. If who builds AI is such an important question for OpenAI, then why would they publish capabilities research thus giving up majority of control on who builds AI and how? To a layman, It seems like they're on track to deploy GPT-4 as well as publish all the capabilities research related to that soon. Is there any reason to hope they won't be doing that? How is the harm caused by 1% of people dying even remotely equivalent to 1% reduction in survival, even without considering the value lost in the future lightcone? It seems highly doubtful to me that OpenAI's dedication to doing and publishing capabilities research is a deliberate choice to accelerate timelines due to their deep philosophical adherence to myopic altruism. I don't think they would be doing this if they actually thought they were increasing p(doom) by 1% (which is already an optimistic estimate) per 1 year acceleration of timelines - a much simpler explanation is that they're at least somewhat longtermist (like most humans) but they don't really think there's a significant p(doom) (at least the capabilities researchers and the leadership team).
1lcmgcd1mo
I think Paul was speaking in 3rd person for parts of it where you didn't realize
2Chris_Leong20d
Agreed, this is one of the biggest considerations missed, in my opinion, by people who think accelerating progress was good. (TBH, if anyone was attempting to accelerate progress to reduce AI risk, I think that they were trying to be too clever by half; or just rationalisting).

OpenAI's continued practice of publishing the blueprints allowing others to create more powerful models seems to undermine their claims that they are worried about "bad actors getting there first".

If you were a scientist working on the Manhattan project because you were worried about Hitler getting the atomic bomb first, you wouldn't send your research on centrifuge design to german research scientists. Yet every company that claims they are more likely than other groups to create safe AGI continues to publish the blueprints for creating AGI to the open web.

Is there any actual justification for this other than "The prestige of getting published in top journals makes us look impressive?"

3lcmgcd1mo
Makes you wonder who is developing secret AGI as we speak. One might assume that there is 10x more secret research (and researchers?) than meets the eye

Incorrect: OpenAI is not aware of the risks of race dynamics.

OpenAI's Charter contains the following merge-and-assist clause: "We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”"
 

Being worried about race dynamics and then stopping at the last minute makes sense and seems a lot better than nothing. But I'm confused why this understanding doesn't propagate to other beliefs/actions. 

Specifically, below are some confusions I have with OpenAI's worldview. If answered, these could give me a lot more hope in OpenAI's direction. 

  1. How will you know that AGI has a >50% chance of success in the next two years? MIRI certainly seems to think this is hard. 
  2. How does OpenAI leadership feel about accelerating timelines? [1]
  3. What are OpenAI leadership's timelines right now? Wha
... (read more)
3Gurkenglas1mo
If the purpose of the merge-and-assist clause is to prevent a race dynamic, then it's sufficient for that clause to trigger when OpenAI would otherwise decide to start racing. They can interpret their own decision-making, right? Right?

merge-and-assist clause [...] we commit to stop competing with and start assisting this project

So, if you don't think AI should be open (because that looks dangerous), has anyone considered just ... changing the name? (At least, the name of the organization, even if the "OpenAI API" as a product has the string openai embedded in the code too much.) Yeah, it's inconvenient, but ... Alphabet did it! Meta did it! If you're trying to make the most important event in the history of life go well, isn't it worth a little inconvenience to be clear about what that entails?

2lcmgcd1mo
How's it gonna go over if they start calling it closed ai?
5Zack_M_Davis1mo
So call it something else. GoodAI. OpalAI. BeneficiAI. ThoroughAI.
9jefftk1mo
Or OpEnAi: Optimally Envisioning AI. Then the code can still say openai.

Incorrect: OpenAI is not aware of the risks of race dynamics.

I don't think this is a common misconception. I, at least, have never heard anyone claim OpenAI isn't aware of the risk of race dynamics—just that it nonetheless exacerbates them. So I think this section is responding to a far dumber criticism than the one which people actually commonly make.

Alignment research: 30

Could you share some breakdown for what these people work on? Does this include things like the 'anti-bias' prompt engineering?

It includes the people working on the kinds of projects I listed under the first misconception. It does not include people working on things like the mitigation you linked to. OpenAI distinguishes internally between research staff (who do ML and policy research) and applied staff (who work on commercial activities), and my numbers count only the former.

WebGPT seemed like one of the most in-expectation harmful projects that OpenAI has worked on, with no (to me) obvious safety relevance, so my guess is I would still mostly categorize the things you list under the first misconception as capabilities research. InstructGPT also seems to be almost fully capabilities research (like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons)

(Edit: My current guess for full-time equivalents who are doing safety work at OpenAI (e.g. if someone is doing 50% work that a researcher fully focused on capabilities would do and 50% on alignment work, then we count them as 0.5 full-time equivalents) is around 10, maybe a bit less, though I might be wrong here.)

I was the project lead on WebGPT and my motivation was to explore ideas for scalable oversight and truthfulness (some further explanation is given here).

3Noosphere891mo
The real question for Habryka is why does he think that it's bad for WebGPT to be built in order to get truthful AI? Like, isn't solving that problem quite a significant thing already for alignment?

WebGPT is approximately "reinforcement learning on the internet".

There are some very minimal safeguards implemented (search via Bing API, but the AI can click on arbitrary links), but I do indeed think "reinforcement learning on the internet" is approximately the worst direction for modern AI to go in terms of immediate risks.

I don't think connecting GPT-3 to the internet is risky at current capability levels, but pushing AI in the direction of just hooking up language models with reinforcement learning to a browser seems like one of the worst directions for AI to go. And my guess is the majority of the effect of this research will be to cause more people to pursue this direction in the future (Adept.AI seems to be pursuing a somewhat similar approach).

Edit: Jacob does talk about this a bit in a section I had forgotten about in the truthful LM post:

Another concern is that working on truthful LMs may lead to AI being "let out of the box" by encouraging research in which models interact with the external world agentically, in the manner of WebGPT.

I think this concern is worth taking seriously, but that the case for it is weak:

  • As AI capabilities improve, the level of access to the ext
... (read more)

The primary job of OpenAI is to be a clear leader here and do the obvious good things to keep an AI safe, which will hopefully include boxing it. Saying "well, seems like the cost is kinda high so we won't do it" seems like exactly the kind of attitude that I am worried will cause humanity to go extinct. 

  • When you say "good things to keep an AI safe" I think you are referring to a goal like "maximize capability while minimizing catastrophic alignment risk." But in my opinion "don't give your models access to the internet or anything equally risky" is a bad way to make that tradeoff. I think we really want dumber models doing more useful things, not smarter models that can do impressive stuff with less resources. You can get a tiny bit of safety by making it harder for your model to have any effect on the world, but at the cost of significant capability, and you would have been better off just using a slightly dumber model with more ability to do stuff. This effect is much bigger if you need to impose extreme limitations in order to get any of this "boxing benefit" (as claimed by the quote you are objecting to).
  • I assume the harms you are pointing to here are about setting expect
... (read more)

If you thought that researchers working on WebGPT were shortening timelines significantly more efficiently than the average AI researcher, then the direct harm starts to become relevant compared to opportunity costs.

Yeah, my current model is that WebGPT feels like some of the most timelines-reducing work that I've seen (as has most of OpenAIs work). In-general, OpenAI seems to have been the organization that has most shortened timelines in the last 5 years, with the average researcher seeming ~10x more efficient at shortening timelines than even researchers at other AGI companies like Deepmind, and probably ~100x more efficient than researchers at most AI research organizations (like Facebook AI).

WebGPT strikes me on the worse side of OpenAI capabilities research in terms of accelerating timelines (since I think it pushes us into a more dangerous paradigm that will become dangerous earlier, and because I expect it to be the kind of thing that could very drastically increase economical returns from AI). And then it also has the additional side-effect of pushing us into a paradigm of AIs that are much harder to align and so doing alignment work in that paradigm will be slower (as has I think a bunch of the RLHF work, though there I think there is a more reasonable case for a commensurate benefit there in terms of the technology also being useful for AI Alignment).

I think almost all of the acceleration comes from either products that generate $ and hype and further investment, or more directly from scaleup to more powerful models. I think "We have powerful AI systems but haven't deployed them to do stuff they are capable of" is a very short-term kind of situation and not particularly desirable besides.

I'm not sure what you are comparing RLHF or WebGPT to when you say "paradigm of AIs that are much harder to align." I think I probably just think this is wrong, in that (i) you are comparing to pure generative modeling but I think that's the wrong comparison point barring a degree of coordination that is much larger than what is needed to avoid scaling up models past dangerous thresholds, (ii) I think you are wrong about the dynamics of deceptive alignment under existing mitigation strategies and that scaling up generative modeling to the point where it is transformative is considerably more likely to lead to deceptive alignment than using RLHF (primarily via involving much more intelligent models).

Something I learned today that might be relevant: OpenAI was not the first organization to train transformer language models with search engine access to the internet. Facebook AI Research released their own paper on the topic six months before WebGPT came out, though the paper is surprisingly uncited by the WebGPT paper

Generally I agree that hooking language models up to the internet is terrifying, despite the potential improvements for factual accuracy. Paul's arguments seem more detailed on this and I'm not sure what I would think if I thought about them more. But the fact that OpenAI was following rather than leading the field would be some evidence against WebGPT accelerating timelines. 

3habryka1mo
I did not know! However, I don't think this is really the same kind of reference class in terms of risk. It looks like the search engine access for the Facebook case is much more limited and basically just consisted of them appending a number of relevant documents to the query, instead of the model itself being able to send various commands that include starting new searches and clicking on links.
4gwern1mo
It does generate the query [https://arxiv.org/pdf/2107.07566.pdf#page=4] itself, though:
2habryka1mo
Does it itself generate the query, or is it a separate trained system? I was a bit confused about this in the paper.
5gwern1mo
You'd think they'd train the same model weights and just make it multi-task with the appropriate prompting, but no, that phrasing implies that it's a separate finetuned model, to the extent that that matters. (I don't particularly think it does matter because whether it's one model or multiple, the system as a whole still has most of the same behaviors and feedback loops once it gets more access to data or starts being trained on previous dialogues/sessions - how many systems are in your system? [https://www.gwern.net/Computers] Probably a lot, depending on your level of analysis. Nevertheless...)

But people attempting to box smart unaligned AIs, or believing that boxed AIs are significantly safer because they can't access the internet, seems to me like a bad situation. An AI smart enough to cause risk with internet access is very likely to be able to cause risk anyway, and at best you are creating a super unstable situation where a lab leak is catastrophic.

I do think we are likely to be in a bad spot, and talking to people at OpenAI, Deepmind and Anthropic (e.g. the places where most of the heavily-applied prosaic alignment work is happening), I do sure feel unhappy that their plan seems to be to be banking on this kind of terrifying situation, which is part of why I am so pessimistic about the likelihood of doom.

If I had a sense that these organizations are aiming for a much more comprehensive AI Alignment solution that doesn't rely on extensive boxing I would agree with you more, but I am currently pretty sure they aren't ensuring that, and by-default will hope that they can get far enough ahead with boxing-like strategies.

8Rohin Shah1mo
... Who are you talking to? I'm having trouble naming a single person at either of OpenAI or Anthropic who seems to me to be interested in extensive boxing (though admittedly I don't know them that well). At DeepMind there's a small minority who think about boxing, but I think even they wouldn't think of this as a major aspect of their plan. I agree that they aren't aiming for a "much more comprehensive AI alignment solution" in the sense you probably mean it but saying "they rely on boxing" seems wildly off. My best-but-still-probably-incorrect guess is that you hear people proposing schemes that seem to you like they will obviously not work in producing intent aligned systems and so you assume that the people proposing them also believe that and are putting their trust in boxing, rather than noticing that they have different empirical predictions about how likely those schemes are to produce intent aligned systems.
9habryka1mo
Here is an example quote from the latest OpenAI blogpost on AI Alignment: This sounds super straightforwardly to me like the plan of "we are going to train non-agentic AIs that will help us with AI Alignment research, and will limit their ability to influence the world, by e.g. not giving them access to the internet". I don't know whether "boxing" is the exact right word here, but it's the strategy I was pointing to here.
4Rohin Shah1mo
The immediately preceding paragraph is: I would have guessed the claim is "boxing the AI system during training will be helpful for ensuring that the resulting AI system is aligned", rather than "after training, the AI system might be trying to pursue its own goals, but we'll ensure it can't accomplish them via boxing". But I can see your interpretation as well.

Oh, I do think a bunch of my problems with WebGPT is that we are training the system on direct internet access.

I agree that "train a system with internet access, but then remove it, then hope that it's safe", doesn't really make much sense. In-general, I expect bad things to happen during training, and separately, a lot of the problems that I have with training things on the internet is that it's an environment that seems like it would incentivize a lot of agency and make supervision really hard because you have a ton of permanent side effects.

2Rohin Shah1mo
Oh you're making a claim directly about other people's approaches, not about what other people think about their own approaches. Okay, that makes sense (though I disagree). I was suggesting that the plan was "train a system without Internet access, then add it at deployment time" (aka "box the AI system during training"). I wasn't at any point talking about WebGPT.

I don't think "your AI wants to kill you but it can't get out of the box so it helps you with alignment instead" is the mainline scenario. You should be building an AI that wouldn't stab you if your back was turned and it was holding a knife, and if you can't do that then you should not build the AI.

That's interesting. I do think this is true about your current research direction (which I really like about your research and I do really hope we can get there), but when I e.g. talk to Carl Shulman he (if I recall correctly) said things like "we'll just have AIs competing against each other and box them and make sure they don't have long-lasting memory and then use those competing AIs to help us make progress on AI Alignment". Buck's post on "The prototypical catastrophic AI action is getting root access to its datacenter" also suggests to me that the "AI gets access to the internet" scenario is a thing that he is pretty concerned about.

More broadly, I remember that Carl Shulman said that he thinks that the reference class of "violent revolutions" is generally one of the best reference classes for forecasting whether an AI takeover will happen, and that a lot of his hope comes fro... (read more)

9paulfchristiano1mo
Even in those schemes, I think the AI systems in question will have much better levers for causing trouble than access to the internet, including all sorts of internal access and their involvement in the process of improving your AI (and that trying to constrain them so severely would mean increasing their intelligence far enough that you come out behind). The mechanisms making AI uprising difficult are not mostly things like "you are in a secure box and can't get out," they are mostly facts about all the other AI systems you are dealing with. That said, I think you are overestimating how representative these are of the "mainline" hope most places, I think the goal is primarily that AI systems powerful enough to beat all of us combined come after AI systems powerful enough to greatly improve the situation. I also think there are a lot of subtle distinctions about how AI systems are trained that are very relevant to a lot of these stories (e.g. WebGPT is not doing RL over inscrutable long-term consequences on the internet---just over human evaluations of the quality of answers or browsing behavior).

I believe the most important drivers of catastrophic misalignment risk are models that optimize in ways humans don't understand or are deceptively aligned. So the great majority of risk comes from actions that accelerate those events, and especially making models smarter. I think your threat model here is quantitatively wrong, and that it's an important disagreement.

I agree with this! But I feel like this kind of reinforcement learning on a basically unsupervisable action-space while interfacing with humans and getting direct reinforcement on approval is exactly the kind of work that will likely make AIs more strategic and smarter, create deceptive alignment, and produce models that humans don't understand.

I do indeed think the WebGPT work is relevant to both increasing capabilities and increasing likelihood of deceptive alignment (as is most reinforcement learning that directly pushes on human approval, especially in a large action space with permanent side effect).

8habryka1mo
Huh, I definitely expect it to drive >0.1% of OpenAI's activities. Seems like the WebGPT stuff is pretty close to commercial application, and is consuming much more than 0.1% of OpenAI's research staff, while probably substantially increasing OpenAI's ability to generally solve reinforcement learning problems. I am confused why you would estimate it at below 0.1%. 1% seems more reasonable to me as a baseline estimate, even if you don't think it's a particularly risky direction of research (given that it's consuming about 4-5% of OpenAI's research staff).

I think the direct risk of OpenAI's activities is overwhelmingly dominated by training new smarter models and by deploying the public AI that could potentially be used in unanticipated ways.

I agree that if we consider indirect risks broadly (including e.g. "this helps OpenAI succeed or raise money and OpenAI's success is dangerous") then I'd probably move back towards "what % of OpenAI's activities is it."

5David Scott Krueger (formerly: capybaralet)1mo
I don'd think the choice is between "smart and boxed" or "less smart and less boxed". Intelligence (e.g. especially domain knowledge) is not 1-dimensional, boxing is largely a means of controlling what kind of knowledge the AI has. We might prefer AI savants that are super smart about some task-relevant aspects of the world and ignorant about a lot of other strategically-relevant aspects of the world.
3Daniel Kokotajlo1mo
Just to make sure I follow: You told them at the time that it was overdetermined that the risks weren't significant? And if you had instead told them that the risks were significant, they wouldn't have done it?

As in: there seem to have generally been informal discussions about how serious this risk was, and I participated in some of those discussions (though I don't remember which discussions were early on vs prior to paper release vs later). In those discussions I said that I thought the case for risk seemed very weak.

If the case for risk had been strong, I think there are a bunch of channels by which the project would have been less likely. Some involve me---I would have said so, and I would have discouraged rather than encouraged the project in general since I certainly was aware fo it. But most of the channels would have been through other people---those on the team who thought about it would have come to different conclusions, internal discussions on the team would have gone differently, etc. 

Obviously I have only indirect knowledge about decision-making at OpenAI so those are just guesses (hence "I believe that it likely wouldn't have happened"). I think the decision to train WebGPT would be unusually responsive to arguments that it is bad (e.g. via Jacob's involvement) and indeed I'm afraid that OpenAI is fairly likely to do risky things in other cases where there are quite good arguments against.

6lc1mo
Glad to know at least that "Reinforcement Learning but in a highly dynamic and hard-to-measure and uncontrollable environment" is as unsafe as my intuition says it is.
6Quadratic Reciprocity1mo
Letting GPT-3 interact with the internet seems pretty bad to me

like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons

This also seems like an odd statement - it seems reasonable to say "I think the net effect of InstructGPT is to boost capabilities" or even "If someone was motivated by x-risk it would be poor prioritisation/a mistake to work on InstructGPT". But it feels like you're assuming some deep insight into the intention behind the people working on it, and making a much stronger statement than "I think OpenAI's alignment team is making bad prioritisation decisions".

Like, reading the author list of InstructGPT, there are obviously a bunch of people on there who care a bunch about safety including I believe the first two authors - it seems pretty uncharitable and hostile to say that they were motivated by a desire to boost capabilities, even if you think that was a net result of their work.

(Note: My personal take is to be somewhat confused, but to speculate that InstructGPT was mildly good for the world? And that a lot of the goodness comes from field building of getting more people investing in good quality RLHF.)

Yeah, I agree that I am doing reasoning on people's motivations here, which is iffy and given the pushback I will be a bit more hesitant to do, but also like, in this case reasoning about people's motivations is really important, because what I care about is what the people working at OpenAI will actually do when they have extremely powerful AI in their hands, and that will depend a bunch on their motivations.

I am honestly a bit surprised to see that WebGPT was as much driven by people who I do know reasonably well and who seem to be driven primarily by safety concerns, since the case for it strikes me as so weak, and the risk seeming as somewhat obviously high, so I am still trying to process that and will probably make some kind of underlying update.

I do think overall I've had much better success at predicting the actions of the vast majority of people at OpenAI, including a lot of safety work, by thinking of them by being motivated by doing cool capability things, sometimes with a thin safety veneer on top, instead of being motivated primarily by safety. For example, I currently think that the release strategy for the GPT models of OpenAI is much better explained by OpenAI wanti... (read more)

3Neel Nanda1mo
That seems weirdly strong. Why do you think that?
3Jacob_Hilton1mo
For people viewing on the Alignment Forum, there is a separate thread on this question here. [https://www.lesswrong.com/posts/3S4nyoNEEuvNsbXt8/common-misconceptions-about-openai?commentId=KqWWZqaATeBN5Tnze#KqWWZqaATeBN5Tnze] (Edit: my link to LessWrong is automatically converted to an Alignment Forum link, you will have to navigate there yourself.)
3habryka1mo
I moved that thread over the AIAF as well!
2Conor Sullivan1mo
I don't understand this at all. I see InstructGPT as an attempt to make a badly misaligned AI (GPT-3) corrigible. GPT-3 was never at a dangerous capability level, but it was badly misaligned; InstructGPT made a lot of progress.

I think the primary point of InstructGPT is to make the GPT-API more useful to end users (like, it just straightforwardly makes OpenAI more money, and the metric to be optimized is I don't think something particularly close to corrigibility).

I don't think Instruct-GPT has made the AI more corrigible in any obvious way (unless you are using the word corrigible very very broadly). In-general, I think we should expect reinforcement learning to make AIs more agentic and less corrigible, though there is some hope we can come up with clever things in the future that will allow us to use reinforcement learning to also increase corrigibility (but I don't think we've done that yet).

See also a previous discussion between me and Paul where we were talking about whether it makes sense to say that Instruct-GPT is more "aligned" than GPT-3, which maybe explored some related disagreements: https://www.lesswrong.com/posts/auKWgpdiBwreB62Kh/sam-marks-s-shortform?commentId=ktxyWjAaQXGBwvitf

2Richard_Ngo1mo
Could you clarify what you mean by "the primary point" here? As in: the primary actual effect? Or the primary intended effect? From whose perspective?
6habryka1mo
I think it's the primary reason why OpenAI leadership cares about InstructGPT and is willing to dedicate substantial personel and financial resources on it. I expect that when OpenAI leadership is making tradeoffs of different types of training, the primary question is commercial viability, not safety. Similarly, if InstructGPT would hurt commercial viability, I expect it would not get deployed (I think individual researchers would likely still be able to work on it, though I think they would be unlikely to be able to hire others to work on it, or get substantial financial resources to scale it).
1lcmgcd1mo
Any particular research directions you're optimistic about?
2Larks1mo
Thanks!

Correct: OpenAI is trying to directly build safe AGI.

OpenAI's Charter states: "We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome." OpenAI leadership describes trying to directly build safe AGI as the best way to currently pursue OpenAI's mission, and have expressed concern about scenarios in which a bad actor is first to build AGI, and chooses to misuse it.

You seem confused about the difference between "paying lip service to X" and "actually trying to do X".

To be clear, this in itself isn't evidence against the claim that OpenAI is trying to directly build safe AI. But it's not much evidence for it, either.

Correct: the majority of researchers at OpenAI are working on capabilities. 

Researchers on different teams often work together, but it is still reasonable to loosely categorize OpenAI's researchers (around half the organization) at the time of writing as approximately:

  • Capabilities research: 100
  • Alignment research: 30
  • Policy research: 15

I'd guess that is an overestimate of the number of people actually doing alignment research at OpenAI, as opposed to capabilities research... (read more)

Calling work you disagree with "lip service" seems wrong and unhelpful.

There are plenty of ML researchers who think that they are doing real work on alignment and that your research is useless. They could choose to describe the situation by saying that you aren't actually doing alignment research. But I think it would be more accurate and helpful if they were to instead say that you are both working on alignment but have big disagreements about what kind of research is likely to be useful.

(To be clear, plenty of folks also think that my work is useless.)

I definitely do not use "lip service" as a generic term for alignment research I disagree with. I think you-two-years-ago were on a wrong track with HCH, but you were clearly aiming to solve alignment. Same with lots of other researchers today - I disagree with the approaches of most people in the field, but I do not accuse them not actually doing alignment research.

No, this accusation is specifically for things RLHF (which are very obviously not even trying to solve any of the problems which could plausibly kill us), and to things like "AI ethics" work (which are very obviously not even attempting to solve the extinction problem). In general, it has to be not even trying to solve a problem which kills us in order for me to make that sort of accusation.

If someone on the OpenAI team which worked on RLHF thought humanity had a decent (not necessarily large) chance of going extinct from AI, and they honestly thought implementing and popularizing RLHF made that chance go down, and they chose to work on RLHF because of that, then I would say I was wrong to accuse them of merely paying lip service. I'd think they were pretty stupid about their strategy, but hey, it's alignment, lots of u... (read more)

I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is. I think this is an ITT which a lot of people in the broader LW cluster would fail. I think the basic mistake that's being made here is failing to recognize that reality doesn't grade on a curve when it comes to understanding the world - your arguments can be false even if nobody has refuted them. That's particularly true when it comes to very high-level abstractions, like the ones this field is built around (and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment).

Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that's fine, this might be necessary, and so it's good to have some people pushing in this direction, but it seems like a bunch of people around here don't just ign... (read more)

Comments on parts of this other than the ITT thing (response to the ITT part is here)...

(and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment)

I don't usually focus much on the outer/inner abstraction, and when I do I usually worry about outer alignment. I consider RLHF to have been negative progress on outer alignment, same as inner alignment; I wasn't relying on that particular abstraction at all.

Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that's fine, this might be necessary, and so it's good to have some people pushing in this direction, but it seems like a bunch of people around here don't just ignore the skulls, they seem to lack any awareness that the absence of the key components by which scientific progress has basically ever been made is a red flag at all.

I think your model here completely fails to predict Descartes, Laplace, Von... (read more)

I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is.

I don't want to pour a ton of effort into this, but here's my 5-paragraph ITT attempt.

"As an analogy for alignment, consider processor manufacturing. We didn't get to gigahertz clock speed and ten nanometer feature size by trying to tackle all the problems of 10 nm manufacturing processes right out the gate. That would never have worked; too many things independently go wrong to isolate and solve them all without iteration. We can't get many useful bits out of empirical feedback if the result is always failure, and always for a long list of reasons.

And of course, if you know anything about modern fabs, you know there'd have been no hope whatsoever of identifying all the key problems in advance just based on theory. (Side note: I remember a good post or thread from the past year on crazy shit fabs need to do, but can't find it; anyone remember that and have a link?)

The way we actually did it was to start with gigantic millimeter-size features, which were relatively easy to manufacture. And then we scaled down s... (read more)

3RobertKirk1mo
I think people who value empirical alignment work now probably think that (to some extent) we can predict at a high level what future problems we might face (contrasting with "there'd have been no hope whatsoever of identifying all the key problems in advance just based on theory"). Obviously this is a spectrum, but I think the chip fab analogy is I think further towards people believing there are unknown unknowns in the problem space than people at OpenAI are (e.g. OpenAI people possibly think outer alignment and inner alignment capture all of the kinds of problems we'll face). However, they probably don't believe you can work on solutions to those problems without being able to empirically demonstrate those problems and hence iterate on them (and again one could probably appeal to a track record here of most proposed solutions to problems not working unless they were developed by iterating on the actual problem). We can maybe vaguely postulate what the solutions could look like (they would say), but it's going to be much better to try and actually implement solutions on versions of the problem we can demonstrate, and iterate from there. (Note that they probably also perhaps try and produce demonstrations of the problems such that they can then work on those solutions, but this is still all empirical). Otherwise I do think your ITT does seem reasonable to me, although I don't think I'd put myself in the class of people you're trying to ITT, so that's not much evidence.
5habryka1mo
I am confused. How does RLHF help with outer alignment? Isn't optimizing fur human approval the classical outer-alignment problem? (e.g. tiling the universe with smiling faces) I don't think the argument for RLHF runs through outer alignment. I think it has to run through using it as a lens to study how models generalize, and eliciting misalignment (i.e. the points about empirical data that you mentioned, I just don't understand where the inner/outer alignment distinction comes from in this context)
4Richard_Ngo1mo
RLHF helps with outer alignment because it leads to rewards which more accurately reflect human preferences than the hard-coded reward functions (including the classic specification gaming examples [https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity] , but also intrinsic motivation functions like curiosity and empowerment [https://arxiv.org/abs/1908.06976]) which are used to train agents in the absence of RLHF. The smiley faces example feels confusing as a "classic" outer alignment problem because AGIs won't be trained on a reward function anywhere near as limited as smiley faces. An alternative like "AGIs are trained on a reward function in which all behavior on a wide range of tasks is classified by humans as good or bad" feels more realistic, but also lacks the intuitive force of the smiley face example - it's much less clear in this example why generalization will go badly, given the breadth of the data collected.

I think the smiling example is much more analogous than you are making it out here. I think the basic argument for "this just encourages taking control of the reward" or "this just encourages deception" goes through the same way.

Like, RLHF is not some magical "we have definitely figured out whether a behavior is really good or bad" signal, it's historically been just some contractors thinking for like a minute about whether a thing is fine. I don't think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater than the variance in RLHF approval, and so the amount of information conveyed is actually more), so I don't buy that RLHF conveys more about human preferences in any meaningful way.

RLHF (which are very obviously not even trying to solve any of the problems which could plausibly kill us)

Sorry for being dumb, but I thought the naïve case for RLHF is that it helps solve the problem of "people are very bad at manually writing down an explicit utility or reward function that does what they intuitively want"? Does that not count as one of the lethal problems (even if RLHF alone would kill us because of the other problems)? If one of the other problems is Goodharting/unforseen-maxima, it seems like RLHF could be helpful insofar as if RLHF rewards are quantitatively less misaligned than hand-coded rewards, you can get away with optimizing them harder before they kill you?

That is a reasonable case, with the obvious catch that you don't know how hard you can optimize before it goes wrong, and when it does go wrong you're less likely to notice than with a hand-coded utility/reward.

But I expect the people who work on RLHF do not expect an explicit utility/reward to be a problem which actually kills us, because they'd expect visible failures before it gets to the capability level of killing us. RLHF makes those visible failures less likely. Under that frame, it's the lack of a warning shot which kills us.

3Zack_M_Davis1mo
Because it incentivizes learning human models [https://www.lesswrong.com/posts/BKjJJH2cRpJcAnP7T/thoughts-on-human-models#Less_Independent_Audits] which can then be used to be more competently deceptive, or just because once you've fixed the problems you know how to notice, what's left are the ones you don't know how to notice? The latter doesn't seem specific to RLHF (you'd have the same problem if people magically got better at hand-coding rewards), but I see how the former is plausible and bad.

The problem isn't just learning whole human models. RLHF will select for any heuristic/strategy which, even by accident, hides bad behavior from humans. It applies even at low capabilities.

This is testable by asking someone from OpenAI things like

  • how the decision to work on RLHF was made: how many hours were spent on it, who was in charge
  • their models under which RLHF is good and bad for humanity
8David Scott Krueger (formerly: capybaralet)1mo
FWIW, I personally know some of the people involved pretty well since ~2015, and I think you are wrong about their motivations.
2johnswentworth1mo
That is plausible; I have made my position here very easy to falsify if I'm wrong.

How? E.g. Jacob left a comment here about his motivations, does that count as a falsification? Or, if you'd say that this is an example of rationalization, then what would the comment need to look like in order to falsify your claim? Does Paul's comment here mentioning the discussions that took place before launching the GPT-3 work count as a falsification? if not, why not?

Jacob's comment does not count, since it's not addressing the "actually consider whether the project will net decrease chance of extinction" or the "could the answer have plausibly been 'no' and then the project would not have happened" part.

Paul's comment does address both of those, especially this part at the end:

To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn't have happened.

That does indeed falsify my position, and I have updated the top-level comment accordingly. Thankyou for the information.

I think Jacob (OP) said "OpenAI is trying to directly build safe AGI." and cited the charter and other statements as evidence of this claim. Then John replied that the charter and other statements are "not much evidence" either for or against this claim, because talk is cheap. I think that's a reasonable point.

Separately, maybe John in fact believes that the charter and other statements are insincere lip service. If so, I would agree with you (Paul) that John's belief is probably incorrect, based on my very limited knowledge. [Where I disagree with OpenAI, I presume that top leadership is acting sincerely to make a good future with safe AGI, but that they have mistaken beliefs about the hardness of alignment and other topics.]

4paulfchristiano1mo
I was replying to:
2Steven Byrnes1mo
Thanks, sorry for misunderstanding.

In particular, all of the RLHF work is basically capabilities work which makes alignment harder in the long term (because it directly selects for deception), while billing itself as "alignment".

I share your opinion of RLHF work but I'm not sure I share your opinion of its consequences. For situations where people don't believe arguments that RLHF is fundamentally flawed because they're too focused on empirical evidence over arguments, the generation of empirical evidence that RLHF is flawed seems pretty useful for convincing them! 

“OpenAI leadership tend to put more likelihood on slow takeoff”

Could you say more about the timelines of people at OpenAI? My impression was that they’re very short and explicitly include the possibility of scaling language models to AGI. If somebody builds AGI in the next 10 years, OpenAI seems like a leading candidate to do so. Would people at OpenAI generally agree with this?

I would expect OpenAI leadership to change their mind on these questions given clear enough evidence to the contrary.

Why do you expect this? For what sorts of evidence do you expect? What do you suppose they think of arguments about inner alignment, orthogonality, deceptive alignment, FOOM, sharp-left-turn?

I would be very curious to see your / OpenAI's responses to Eliezer's Dimensions of Operational Adequacy in AGI Projects post. Which points do you / OpenAI leadership disagree with? Insofar as you agree but haven't implemented the recommendations, what's stopping you?

Thanks again for writing this.
A few thoughts:

I think that the release of GPT-3 and the OpenAI API led to significantly increased focus and somewhat of a competitive spirit around large language models... I don't think OpenAI predicted this in advance, and believe that it would have been challenging, but not impossible, to foresee this.

Do you believe any general lessons have been learned from this? Specifically, it seems a highly negative pattern if [we can't predict concretely how this is likely to go badly] translates to [we don't see any reason not to go ahead].

I note that there's an asymmetry here: [states of the world we like] are a small target. To the extent that we can't predict the impact of a large-scale change, we should bet on negative impact.

 

OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities, and are more concerned about bad actors developing and misusing AGI...

Questions:

  1. If we're in a scenario with [slow takeoff], [alignment is fairly easy], and [empirical, capabilities-reliant approaches work well], wouldn't we expect alignment to
... (read more)
8Conor Sullivan1mo
DM has to deal with Alphabet management, who is significantly less alignment-aware than DM or OAI leadership. Merging wouldn't solve the race dynamics and would make ownership/leadership issues worse.
6Joe_Collman1mo
Sure, that makes sense to me. I suppose my main point is "why would we expect this to be different in the future?". (perhaps there are reasons to think things would be different, but I've heard no argument to this effect)

Could you explain the rational behind the "Open" in OpenAI? I can understand the rational of trying to beat more reckless companies to achieving AGI first (albeit, this mentality is potentially extremely dangerous too), but what is the rational behind releasing your research? This will enable companies that do not prioritize safety to speed ahead with you, perhaps just a few years behind. And, if OpenAI hesitates to progress, due to concerns over safety, the more risk-taking orgs will likely speed ahead of OpenAI in capabilities. The bottomline is I'm conc... (read more)

I also appreciated reading this.

Thanks for writing this! I agree with most of the claims you consider to be objective, and appreciate you writing this up so clearly.

Thank you very much for writing this post, Jacob. I think it clears up several of the misconceptions you emphasize.

I generally seem to agree with John that the class of problems OpenAI focuses on might be more capabilities-aligned than optimal but at the same time, having a business model that relies on empirical prosaic alignment of language models generates interesting alignment results and I'm excited for the alignment work that OpenAI will be working on!

The Partnership may never make a profit

I couldn't find this quote in the page that you were supposedly quoting from. The only google result for it is this post. Am I missing something?

[This comment is no longer endorsed by its author]Reply
1ofer1mo
Sorry, that text does appear in the linked page (in an image).

I might add the most glaring misconception, at least for me in the early days... I assumed their primary goal was to support Open Source AI, and would "default to open" on all their projects. Instead orgs like HuggingingFace expend significant resources reverse engineering the AI papers and models that OpenAI releases.

New to LessWrong?