This post is the first half of a series about my attempts understand Anthropic’s current strategy and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying. (The impetus for looking into this was to answer the question of whether I should join Anthropic's ops team.) As part of my research, I read a number of Anthropic’s published papers, and spoke to people within and outside of Anthropic.
This post contains “observations” only, which I wanted to write up as a reference for anyone considering similar questions. I will make a separate post about the inferences and conclusions I’ve reached personally about working at Anthropic, based on the info I’m sharing here.
Anthropic is planning to grow. They’re aiming to be one of the “top players”, competitive with OpenAI and Deepmind, working with a similar level of advanced models. They have received outside investment, because keeping up with state of the art is expensive, and going to get moreso. They’ve recently been hiring for a product team, in order to get more red-teaming of models and eventually have more independent revenue streams.
I think Anthropic believes that this is the most promising route to making AGI turn out well for humanity, so it’s worth taking the risk of being part of the competition and perhaps contributing to accelerating capabilities. Alternatively stated, Anthropic leadership believes that you can’t solve the problem of aligning AGI independently from developing AGI.
My current sense is that this strategy makes sense under a particular set of premises:
- There is not, currently, an obviously better plan or route to solving alignment, that doesn’t involve keeping up with state-of-the-art large models. Yes, it’s a plan with some risks, but we don’t have any better ideas yet.
- We don’t understand deep learning systems, and we don’t have a theoretical approach; we’re at the point where actually just running experiments on current models and observing the results is the best way to get information.
- This could at some point lead to a more general theory or theories of alignment.
- Or there may just be practical/empirical evidence of something like an “alignment attractor basin” and knowledge of how to practically stay in it
- There’s a high enough probability that whatever method ends up getting us to AGI will be, basically, an extension and further exploration of current deep learning, rather than a completely new kind of architecture that doesn’t even share the same basic building blocks.
- Note: there’s an argument that in worlds where Anthropic’s research is less useful, Anthropic is also contributing much less to actually-dangerous race dynamics, since faster progress in LLMs won’t necessarily lead to shorter timelines if LLMs aren’t a route to AGI.
- There is, additionally, a high enough probability that behaviors observed in current-generation will also be a factor for much more advanced models.
- (This isn’t a claim that understanding how to align GPT-3 is enough – we’ll need to understand the new and exciting behaviors and alignment challenges that start to emerge at higher levels too – but the knowledge base being fleshed out now will be at all applicable.)
- It’s possible, in principle, to implement this strategy such that the additional progress on alignment-related questions and positive influence on norms in the field will more than cancel out the cost of accelerating progress – that even if it brings the point at which we hit AGI-level capabilities a few months or years earlier, in expectation it will move the point at which we have an alignment solution or process for reaching one earlier by a larger factor.
- This relies on carefully tracking what will or won’t counterfactually accelerate capabilities development, and if necessary being willing to make genuine tradeoffs – in terms of profit from deploying products, or hiring brilliant researchers who don’t care about safety, or pleasing investors – but Anthropic, specifically, is in a position to carry through on that, and will continue to be in that position, avoiding future mission drift despite the potential risk of pressure from investors. A lot of care has been put into ensuring that investors have very little influence over internal decisions and priorities.
- Anthropic will also continue to be in a position where if the landscape changes – if a better idea does appear, or if mission drift becomes super obvious to Dario and Daniela, or if for whatever reason Anthropic’s current strategy no longer seems like a good idea – then they’ll be able to pivot, and switch to a strategy that doesn’t require keeping up as one of the “top players” with all of the attendant risks.
I think someone could disagree or have doubts on any of these points, and I would like to know more about the range of opinions on 1-4 from people who have more technical AI safety background than I do. I’m mainly going to focus on 5, 6, and 7.
Implications for Anthropic’s structure and processes
The staff whom I spoke to believe that Anthropic’s leadership, and the Anthropic team as a whole, have thought very hard about this; that the leadership team applied considerable effort to setting the company up to avoid mission drift, and continue to be cautious and thoughtful around deploying advanced systems or publishing research.
Staff at Anthropic list the following as protective factors, some historical and some ongoing:
- Anthropic’s founding team consists of, specifically, people who formerly led safety and policy efforts at OpenAI, and (I am told) there’s been very low turnover since then. To the extent that Anthropic’s plan relies on the leadership being very committed to prioritizing alignment, this is evidence in that direction.
- Anthropic’s corporate structure is set up to try to mitigate some of the incentives problems with being a for-profit company that takes investment (and thus has fiduciary duties, and social pressure, to focus on profitable projects.) They do take investment and have a board of stakeholders, and plan to introduce a structure to ensure mission continues to be prioritized over profit.
- Anthropic tries to strongly filter new hires for culture fit and taking the potential risks of AI seriously (though at some point this may be in tension with growing the team, and it may already be.) This means that they can have a strong internal culture of prioritizing safety and flagging concerns.
- Anthropic’s internal culture supports all of its staff in expressing and talking about their doubts, and questioning whether deploying an advanced system or publishing a particular paper might be harmful, and these doubts are taken seriously.
- There’s an argument to be made that OpenAI was already intending to push capabilities development as fast as possible, and so adding a new competitor to the ecosystem wasn’t going to give them any additional motive to go faster in order to stay ahead. (Though there are separate concerns about second-order effects, like generally “raising awareness” about the potential economic value of state-of-the-art models, and increasing investor “hype” in AI labs in general.)
- While I don’t think anyone is ignoring the importance of second-order effects, there’s an argument that the first-order effect of Anthropic “competing” with OpenAI is that they might draw away investors and customers who would otherwise have funded OpenAI.
In this post I’ve done my best to neutrally report the information I have about Anthropic’s strategy, reasoning, and structure as relayed to me by staff and others who were kind enough to talk to me, and tried to avoid injecting my own worldview.
In my upcoming post (“Personal musings on Anthropic and incentives”), I intend to talk less neutrally about my reactions to the above and how it plays into my personal decision-making.
Note: I believe Anthropic thinks that large-scale, state-of-the-art models are necessary for their current work on constitutional AI and using AI-based reinforcement learning to train LMMs to be “helpful, harmless, and honest”, and that while some initial progress can be made on their mechanistic interpretability transformers work using smaller models, they also believe this will need to be scaled up in future to get the full value.
I am told that Anthropic has had three doublings of headcount in two years, which is closer to 3x year-over-year growth, and may stay at more like 2x year-over-year, and that this is nothing like OpenAI’s early growth rate of 8x (where purportedly no filtering for cultural fit/alignment interest was applied).
Why do you think Anthropic is not replying to MIRI's challenge?
+1. A few other questions I'm interested in:
I'm sympathetic to the fact that it might be costly (in terms of time and possibly other factors like reputation) to respond to some of these questions. With that in mind, I applaud DeepMind's alignment team for engaging with some of these questions, I applaud OpenAI for publicly stating their alignment plan, and I've been surprised that Anthropic has engaged the least (at least publicly, to my knowledge) about these kinds of questions.
A "Core Views on AI Safety" post is now available at https://www.anthropic.com/index/core-views-on-ai-safety
(Linkpost for that is here: https://www.lesswrong.com/posts/xhKr5KtvdJRssMeJ3/anthropic-s-core-views-on-ai-safety.)
Here's a wild guess. They just "stole" a bunch of core people from OpenAI, that doesn't happen to any organization without tension and bad feelings. Now they are in direct competition to OpenAI for funding, staff, and press coverage. Even worse!
Perhaps they made peace and agreed not to make public releases for some time. Or it could be they want to differentiate themselves before they release their strategy.
Thanks for writing this up! It seems very helpful to have open, thoughtful discussions about different strategies in this space.
Here is my summary of Anthropic’s plan, given what you’ve described (let me know if it seems off):
Leaving aside concerns about arms races and big models being scary in and of themselves, this seems like a pretty reasonable approach to me. In particular, I’m pretty on board with points 1, 2, and 3—i.e., if you don’t have theories, then getting your feet wet with the actual systems, observing them, experimenting, tinkering, and so on, seems like a pretty good way to eventually figure out what’s going on with the systems in a more formal/mechanistic way.
I think the part I have trouble with (which might stem from me just not knowing the relevant stuff) is point 4. Why do you need to do all of this on current models? I can see arguments for this, for instance, perhaps certain behaviors emerge in large models that aren’t present in smaller ones. But I’ve never seen, e.g., a list of such things and why they are important or cruxy enough to justify the emphasis on large models given the risks involved. I would really like to see such an argument! (Perhaps it does exist and I am not aware).
I also have a bit of trouble with the “top player” framing—at the moment I just don’t see why this is necessary. I understand that Anthropic works on large models, and that this is on par with what other “top players” in the field are doing. But why not just say that you want to work with large models? Why mention being competitive with Deepmind or OpenAI at all? The emphasis on “top player” makes me think that something is left unsaid about the motivation, aside from the emphasis on current systems. To the extent that this is true, I wish it were stated explicitly. (To be clear, "you" means Anthropic, not Miranda).
Your summary seems fine!
I think that Anthropic's current work on RL from AI Feedback (RLAIF) and Constitutional AI is based on large models exhibiting behaviors that don't work in smaller models? (But it'd be neat if someone more knowledgeable than me wanted to chime in on this!)
My current best understanding is that running state of the art models is expensive in terms of infrastructure and compute, the next generation models will get even more expensive to train and run, and Anthropic doesn't have (and doesn't expect to realistically be able to get) enough philanthropic funding to work on the current best models let alone future ones – so they need investment and revenue streams,
There's also a consideration that Anthropic wants to have influence in AI governance/policy spaces, where it helps to have a reputation/credibility as one of the major stakeholders in AI work.
I'm interested in asking why you believe what you believe here.
As am I. So many organization's have a whistleblower policy or a safety culture. I'm worked in industry and to put it gently, how these cultures work in practice can be quite a bit different that the stated intention.
It's because from a management perspective letting anyone ask questions has to be balanced against getting things done and having a some top down leadership.
Note that that was inside "Staff at Anthropic list the following as protective factors".
I'd be curious what the OP and what the staff would say more specifically here. "Doubts are taken seriously" is quite a large range, from "can change the overall strategy" to "is diplomatically 'listened to' to use up any dissenting energy". E.g. what would happen at Anthropic with inquiries that could lead to changing the whole strategic direction, as in, "what if we shouldn't be advancing capabilities?"?
My sense is that it's been somewhere in between – on some occasions staff have brought up doubts, and the team did delay a decision until they were addressed, but it's hard to judge how much the end result was a different decision from what would have been made otherwise, versus just happening later.
The sense I've gotten of the culture is compatible with (current) Anthropic being a company that would change their entire strategic direction if staff started coming in with credible arguments that "what if we shouldn't be advancing capabilities?", but I think this hasn't yet been put to the test – people who choose work at Anthropic are going to be selected for agreeing on the premises behind the Anthropic strategy – and it's hard to know for sure how it would go.
On a reread, I noticed that I don't actually know what Anthropic's strategy is. This is actually a question about a couple of things.
The first is what endpoint they're targeting - "solve and implement alignment" is the ultimate goal, of course, but one can coherently imagine targeting something else, as with Encultured, which is explicitly not targeting "solve alignment" but a much smaller subset of what they expect will be a larger ecosystem adding up to a "solution".
The second is what strategy they're currently following in persuit of that endpoint.
There are some details that can be extracted based on the implied premises it relies on, but it would be great to hear from Anthropic directly what the current strategy is, in a way that either rules out substantial chunks of action-space, or requires very specific actions. (I think that in a very meaningful sense, a strategy is a special case of a prediction, which must constrain your expectations about your future actions.)
I just want to add that "whether you should consider applying" probably depends massively on what role you're applying for. E.g. even if you believed that pushing AI capabilities was net negative right now, you might still want to apply for an alignment role.
Not saying you intended this, but I worry about people thinking "it's a an alignment role and therefore good" when considering joining companies that are pushing state of the art, and not thinking about it much harder than that.
What else should people be thinking about? You'd want to be sure that you'll, in fact, be allowed to work on alignment. But what other hidden downsides are there?
People should be thinking about:
Personally, if you working with cutting edge LLMs, you need to pass a high burden of proof/reasoning that this is good. Incentives like prestige, salary, and "meaning" means ought to question oneself pretty hard when doing the equivalent of entering the nuclear bombs or conventional arms manufacturing industries (especially during war times).
"is this a role from which I will push forward alignment faster than I advance capabilities?" is a very different question than "does this job have 'alignment' in the title?" I assume when it's put like that you would choose the first question, but in practice a lot of people take jobs that are predictable nos to the first question and justify it by claiming they're alignment jobs. Given that, I think it's good Ruby pushed back on something that would end up supporting the latter form of the question, even if it wasn't intended.
This claim seems misleading at best: Dario, Anthropic's founder and CEO, led OpenAI's work on GPT-2 and GPT-3, two crucial milestone in terms of public AI capabilities.
Given that I don't have much time to evaluate each claim one by one, and Gell-Mann amnesia, I am a bit more skeptical of the other ones.
Was Dario Amodei not the former head of OpenAI’s safety team?
He wrote "Concrete Problems in AI Safety".
I don't see how the claim isn't just true/accurate. Whether or not he led/contributed to the GPT series, (I an under the impression that) Dario Amodei did lead safety efforts at OpenAI.
If someone reads "Person X is Head of Safety", they wouldn't assume that the person led the main AI capabilities efforts of the company for the last 2 years.
Only saying "head of the safety team" implies that this was his primary activity at OpenAI, which is just factually wrong.
According to his LinkedIn, from 2018 until end of 2020, when he left, he was Director of Research and then VP of Research of OpenAI, where he "set overall research direction" and "led the effort to build GPT2 and 3". He led the safety team before that, between 2016 and 2018.
I do think it's fair to consider the work on GPT-3 a failure of judgement and a bad sign about Dario's commitment to alignment, even if at the time (also based on LinkedIn) it sounds like he was also still leading other teams focused on safety research.
(I've separately heard rumors that Dario and the others left because of disagreements with OpenAI leadership over how much to prioritize safety, and maybe partly related to how OpenAI handled the GPT-3 release, but this is definitely in the domain of hearsay and I don't think anything has been shared publicly about it.)
Is there anything specifically about their corporate structure now that mitigates the incentive problems? I know they are a public benefit corporation, but many of us are unclear on what that actually means besides "Anthropic thinks they have a good mission" - since as you point out they're still a for-profit company with investors. (I actually wasn't able to find any info about Anthropic's board when I searched recently, so the "board of stakeholders" is news to me.)
I know there is a ton involved in building a company like this, so it's ok if they really do have plans to set up a more beneficial structure and just haven't gotten around to it. But since the stakes with AGI are so high, it would be really nice to know more about what those plans are and to see them implemented so that we're not just taking their word for it.
Thanks for doing this post series btw, it's a really great discussion for us to get to have.
Yes, benefit corporation were created to provide an alternative to "shareholder primacy", otherwise widely accepted in law and custom, per Wikipedia: Benefit_corporation#Differences_from_traditional_corporations. Further quoting:
Registering as a Public Benefit corporation means that they, the board of directions of the corporation, can't be sued for failing to maximize shareholder value, and potentially could be challenged if they "fail to consider the effect decisions on stakeholders beyond shareholders."
It would be interesting if they filed as a certified benefit corporation, B Corp, but I'm not sure what would be at stake if they failed to live up to that standard. Perhaps B Lab (non-profit who certified B Corps), or a similar new entity, should endeavor to create a new status for recognizing safe and responsible creation, handling and governance controls of powerful AIs. With external certifications one worries about Goodhard's law, and "safety-washing" to take the place of "green-washing", especially given the (current) non-enforceability of B Corp standards.
Do you find OpenAI's LP entity more credible? Do you have ideas about another legal structure?
Thank you for sharing this; I'd be excited to see more writeups that attempt to analyze the strategy of AI labs.
I found that this introduction raised my expectations for the post and misled me a bit. After reading the introduction, I was expecting to see more analysis of the pros and cons of Anthropic's strategy, as well as more content from people who disagree with Anthropic's strategy.
(For example, you have a section in which you list protective factors as reported by Anthropic staff, but there is no corresponding section that features criticisms from others-- e.g., independent safety researchers, OpenAI employees, etc.)
To be clear, I don't think you should have to do any of that to publish a post like this. I just think that the expectation-setting could have been better. (I plan to recommend this post, but I won't say "here's a post that lays out the facts to consider in terms of whether Anthropic's work is likely to be net positive"; instead, I'll say "here's a post where someone lists some observations about Anthropic's strategy, and my impression is that this was informed largely by talking to Anthropic staff and Anthropic supporters. It seems to underrepresent the views of critics, but I still think it's valuable read.")
It's deliberate that this post covers mostly specifics that I learned from Anthropic staff, and further speculation is going to be in a separate later post. I wanted to make a really clear distinction between "these are things that were said to me about Anthropic by people who have context" (which is, for the most part, people in favor of Anthropic's strategy), and my own personal interpretation and opinion on whether Anthropic's work is net positive, which is filtered through my worldview and which I think most people at Anthropic would disagree with.
Part two is more critical, which means I want to write about it with a lot of effort and care, so I expect I'll put it up in a week or two.
+1. I think this framing is more accurate than the current first paragraph (which, in my reading of it, seems to promise a more balanced and comprehensive analysis).
Edited first line, which hopefully clarifies this better.
It does! I think I'd make it more explicit, though, that the post focuses on the views/opinions of people at Anthropic. Maybe something like this (new text in bold):
I'm looking forward to part 2.