I’m going to try to make sure that my lifestyle and financial commitments continue to make me very financially comfortable both with leaving Anthropic, and with Anthropic’s equity (and also: the AI industry more broadly – I already hold various public AI-correlated stocks) losing value, but I recognize some ongoing risk of distorting incentives, here.
Why do you feel comfortable taking equity? It seems to me that one of the most basic precautions one ought ideally take when accepting a job like this (e.g. evaluating Claude's character/constitution/spec), is to ensure you won't personally stand to lose huge sums of money should your evaluation suggest further training or deployment is unsafe.
(You mention already holding AI-correlated stocks—I do also think it would be ideal if folks with influence over risk assessment at AGI companies divested from these generally, though I realize this is difficult given how entangled they are with the market as a whole. But I'd expect AGI company staff typically have much more influence over their own company's value than that of others, so the COI seems much more extreme).
Speaking for myself as someone who works at Anthropic and holds equity: I think I just bite the bullet that this doesn't affect my decisionmaking that much and the benefits of directing the resources from that equity to good ends are worth it.
(I did think somewhat seriously about finding a way to irrevocably commit all of my equity to donations, or to fully avoid taking possession of it, but mainly for the signaling benefits of there being an employee who was legibly not biased in this particular way in case that was useful when things got crazy; I don't think it would have done much on the object level.)
Some reasons I think this is basically not a concern for me personally:
Others might vary a lot in how they orient to such things, though; I don't claim this is universal.
"Empirically when I advocate internally for things that would be commercially costly to Anthropic I don't notice this weighing on my decisionmaking basically at all, like I'm not sure I've literally ever thought about it in that setting?"
With respect, one of the dangers of being a flawed human is the fact that you aren't aware of every factor that influences your decision making.
I'm not sure that a lack of consciously thinking about financial loss/gain is good empirical evidence that it isn't affecting your choices.
Yep, I agree that's a risk, and one that should seem fairly plausible to external readers. (This is why I included other bullet points besides that one.) I'm not sure I can offer something compelling over text that other readers will find convincing, but I do think I'm in a pretty epistemically justified state here even if I don't think you should think that based on what you know of me.
And TBC, I'm not saying I'm unbiased! I think I am biased in a ton of ways - my social environment, possession of a stable high-status job, not wanting to say something accidentally wrong or hurting people's feelings, inner ring dynamics of being in the know about things, etc are all ways I think my epistemics face pressure here - but I feel quite sure that "the value of my equity goes down if Anthropic is less commercially successful" contributes a tiny tiny fraction to that state of affairs. You're well within your rights to not believe me, though.
This is a bit of a random-ass take, but, I think I care more about Joe not taking equity than you not taking equity, because I think Joe is more likely to be a person where it ends up important that he legibly have as little COI as possible (this is maybe making up a bunch of stuff about Joe's future role in the world, but, it's where my Joe headcanon is at).
From a pure signaling perspective (the ”legibly” part of ”legibly have as little COI as possible”) there’s also a counter consideration: if someone says that there’s danger, and calls for prioritizing safety, that might be even more credible if that’s going against their financial motivations.
I don’t think this matters much for company-external comms. There, I think it’s better to just be as legibly free of COIs as possible, because listeners struggle to tell what’s actually in the company’s best interests. (I might once have thought differently, but empirically ”they just say that superintelligence might cause extinction because that’s good for business” is a very common take.)
But for company-internal comms, I can imagine that someone would be more persuasive if they could say ”look, I know this isn’t good for your equity, it’s not good for mine either. we’re in the same boat. but we gotta do what’s right”.
Agreed - I do think the case for doing this for signaling reasons is stronger for Joe and I think it's plausible he should have avoided this for that reason. I just don't think it's clear that it would be particularly helpful on the object level for his epistemics, which is what I took the parent comment to be saying.
I've made a legally binding pledge to allocate half of it to 501(c)3 charities, the maximum that my employer's donation match covers; I expect to donate the majority of the remainder but have had no opportunities to liquidate any of it yet.
Thanks, that's good to hear. What form does the pledge take? Do you have a DAF that contains half your shares? When do you think the next liquidation opportunity might be? (I guess you weren't eligible for the one in May[1]?)
I'm disappointed that no one (EA-ish or otherwise) seems do have done anything interesting with that liquidation opportunity.
The details are complicated, vary a lot person-to-person, and I'm not sure which are OK to share publicly; the TLDR is that relatively early employees have a 3:1 match on up to 50% of their equity, and later employees a 1:1 match on up to 25%.
I believe that many people eligible for earlier liquidation opportunities used the proceeds from said liquidation to exercise additional stock options, because various tax considerations mean that doing so ends up being extremely leveraged for one's donation potential in the future (at least if one expects the value of said options to increase over time); I expect that most people into doing interesting impact-maximizing things with their money took this route, which doesn't produce much in the way of observable consequences right now.
Interesting. I really hope that some of them do something, soon. Time is fast running out. There's no point being a rich philanthropist (or rich, or a philanthropist) if the world gets destroyed before you deploy your resources.
Feels like something has gone wrong way before when one cares more about money than survival of the human race.
If a man's judgement is really swayable by equity one cant stop to wonder whether he is the right man for the job in the first place.
Sure, but humanity currently has so little ability to measure or mitigate AI risk that I doubt it will be obvious in any given case that the survival of the human race is at stake, or that any given action would help. And I think even honorable humans tend to be vulnerable to rationalization amidst such ambiguity, which (as I model it) is why society generally prefers that people in positions of substantial power not have extreme conflicts of interest.
In a previous discussion about this, an argument mentioned was "having all your friends and colleagues believe in a thing is probably more epistemically compromising than the equity."
Which seems maybe true. But, I update in the other direction of "you shouldn't take equity, and, also, you should have some explicit plan for dealing with the biases of 'the people I spend the most time with think this,
(This also applies to AI pessimists to be clear, but I think it's reasonable to hold people extra accountable about it when they're working at a company who's product has double-digit-odds of destroying the world)
Yeah, certainly there are other possible forms of bias besides financial conflicts of interest; as you say, I think it's worth trying to avoid those too.
Hey Adam — thanks for this. I wrote about this kind of COI in the post, but your comment was a good nudge to think more seriously about my take here.
Basically, I care here about protecting two sorts of values. On the one hand, I do think the sort of COI you’re talking about is real. That is, insofar as people at AI companies who have influence over trade-offs the company makes between safety and commercial success hold equity, deciding in favor of safety will cause them to lose money — and potentially, for high-stakes decisions like dropping out of the race, a lot of money. This is true of people in safety-focused roles, but it’s true of other kinds of employees as well — and of course, especially true of leadership, who have both an outsized amount of equity and an outsized amount of influence. This sort of COI can be a source of epistemic bias (e.g. in safety evaluations of the type you’re focused on), but it can also just be a more straightforward misalignment where e.g. what’s best by the lights of an equity-holder might not be best for the world. I really don’t want my decision-making as an Anthropic employee to end up increasing existential risk from AI because of factors like this. And indeed, given that Anthropic’s stated mission is (roughly) to do what’s best for the world re: AI, in some sense it’s in the job description of every employee to make sure this doesn’t happen.[1] And just refusing to hold equity would indeed go far on this front (though: you can also get similar biases without equity — e.g., maybe you don’t want to put your cash salary at risk by making waves, pissing people off, etc). And even setting aside the reality of a given level of bias/misalignment, there can be additional benefits to it being legible to the world that this kind of bias/misalignment isn’t present (though I am currently much more concerned about the reality of the bias/misalignment at stake).
On the other hand: the amount of money at stake is enough that I don’t turn it down casually. This is partly due to donation potential. Indeed, my current guess is that (depending ofc on values and other views) many EA-ish folks should be glad on net that various employees at Anthropic (including some in leadership, and some who work on safety) didn’t refuse to take any equity in the company, despite the COIs at stake — though it will indeed depend on how much they actually end up donating, and to where. But beyond donation potential, I’m also giving weight to factors like freedom, security, flexibility in future career choices, ability to self-fund my own projects, trading-money-for-time/energy/attention, helping my family, maybe having/raising kids, option value in an uncertain world, etc. Some of these mix in impartially altruistic considerations in important ways, but just to be clear: I care about both altruistic and non-altruistic values; I give weight to both in my decision-making in general; and I am giving both weight here.
I’ll also note a different source of uncertainty for me — namely, what policy/norm would be best to promote here overall. This is a separate question from what *I* should do personally, but insofar as part of the value of e.g. refusing the equity would be to promote some particular policy/norm, it matters to me how good the relevant policy/norm is — and in some cases here, I’m not sure. I’ve put a few more comments on this in footnote.[2]
Currently, my best-guess plan for balancing these factors is to accept the equity and the corresponding COI for now (at least assuming that I stay at Anthropic long enough for the equity to vest[3]), but to keep thinking about it, learning more, and talking with colleagues and other friends/advisors as I actually dive into my role at Anthropic — and if I decide later that I should divest/give up the equity (or do something more complicated to mitigate this and other types of COI), to do that. This could be because my understanding of costs/benefits at stake in the current situation changes, or because the situation itself (e.g., my role/influence, or the AI situation more generally) changes.
Which isn't to say that people will live up to this.
There’s one question whether it would be good (and suitably realistic) for *no* employees at Anthropic, or at any frontier AI company, to hold equity, and to be paid in cash instead (thus eliminating this source of COI in general). There’s another question whether, at the least, safety-focused employees in particular should be paid in cash, as your post here seems to suggest, while making sure that their overall *level* of compensation remains comparable to that of non-safety-focused employees. Then, in the absence of either of these policies, there’s a different question whether safety-focused employees should be paid substantially less than non-safety-focused employees — a policy which would then reduce the attractiveness of these roles relative to e.g. capabilities roles, especially for people who are somewhat interested in safety but who also care a lot about traditional financial incentives as well (I think many strong AI researchers may be in this category, and increasingly so as safety issues become more prominent). And then there’s a final question of whether, in the absence of any changes to how AI companies currently operate, there should be informal pressure/expectation on safety-focused-employees to voluntarily take very large pay cuts (equity is a large fraction of total comp) relative to non-safety-focused employees for the sake of avoiding COI (one could also distribute this pressure/expectation more evenly across all employees at AI companies — but the focus on safety evaluators in your post is more narrow).
And I'll still have COI in the meantime due to the equity I'd get if I stayed long enough.
I think it would be valuable to ask Anthropic's policy team (and/or leadership) if they agree with these statements (or adjacent statements), and if they have any plans to prioritize these kinds of statements in their communications with policymakers & the public.
It seems to me like a lot of Anthropic employees agree with these statements (or adjacent statements), yet this does not appear to be guiding Anthropic's official lobbying or policy activities.
I think that the technology being built by companies like Anthropic has a significant (read: double-digit) probability of destroying the entire future of the human species.
What’s more, I think no private company should be in a position to impose this kind of risk on every living human, and I support efforts to make sure that no company ever is.
Further: I do not think that Anthropic or any other actor has an adequate plan for building superintelligence in a manner that brings the risk of catastrophic, civilization-ending misalignment to a level that a prudent and coordinated civilization would accept.
More specifically: I do not believe that the object-level benefits of advanced AI[18] – serious though they may be – currently justify the level of existential risk at stake in any actor, Anthropic included, developing superintelligence given our current understanding of how to do so safely.[19]
But there is, indeed, a clear solution to this problem in principle: namely, to use various methods of capability restraint (coordination, enforcement, etc) to ensure that no one develops superintelligence until we have a radically better understanding of how to do so safely.
I have no idea how Anthropic's policy team makes decisions, but insofar as they value the input of employees on other teams, it seems plausible to me that Anthropic employees with these beliefs (or adjacent beliefs) could play a meaningful role by speaking out about these beliefs, requesting more information about Anthropic's policy engagements, and having more discussions with Anthropic policy/leadership teams about if/how Anthropic could prioritize these topics more in its policy work & public comms.
What’s more, I think no private company should be in a position to impose this kind of risk on every living human, and I support efforts to make sure that no company ever is.
I don't see your name on the Statement on Superintelligence when I search for it. Assuming you didn't sign it, why not? Do you disagree with it?
It seems like an effort to make sure that no company is in the position to impose this kind of risk on every living human:
We call for a prohibition on the development of superintelligence, not lifted before there is
- broad scientific consensus that it will be done safely and controllably, and
- strong public buy-in.
(Several Anthropic, OpenAI, and Google DeepMind employees signed.)
Um, I really like a lot of your writing. But I think the parts of your post that are in bold paint a very different picture to the parts that aren't in bold.
Echoing MichaelDickens' question on the EA Forum:
Indeed, I think it’s possible that there will, in fact, come a time when Anthropic should basically just unilaterally drop out of the race – pivoting, for example, entirely to a focus on advocacy and/or doing alignment research that it then makes publicly available.
Do you have a picture of what conditions would make it a good idea for Anthropic to drop out of the race?
Would also be interested to know how your thoughts compare with those of Holden's to a related question:
Rob Wiblin: I solicited questions for you on Twitter, and the most upvoted by a wide margin was: “Does Holden have guesses about under what observed capability thresholds Anthropic would halt development of AGI and call for other labs to do the same?” ...
Holden Karnofsky: Yeah. I will definitely not speak for Anthropic, and what I say is going to make no attempt to be consistent with the Responsible Scaling Policy. I’m just going to talk about what I would do if I were running an AI company that were in this kind of situation.
I think my main answer is just that it’s not a capability threshold; it’s other factors that would determine whether I would pause. First off, one question is: what are our mitigations and what is the alignment situation? We could have an arbitrarily capable AI, but if we believe we have a strong enough case that the AI is not trying to take over the world, and is going to be more helpful than harmful, then there’s not a good reason to pause.
On the other hand, if you have an AI that you believe could cause unlimited harm if it wanted to, and you’re seeing concrete signs that it’s malign — that it’s trying to do harm or that it wants to take over the world — I think that combination, speaking personally, would be enough to make me say, “I don’t want to be a part of this. Find something else to do. We’re going to do some safety research.”
Now, what about the grey area? What about if you have an AI that you think might be able to take over the world if it wanted to, and might want to, but you just don’t know and you aren’t sure? In that grey area, that’s where I think the really big question is: what can you accomplish by pausing? And this is just an inherently difficult political judgement.
I would ask my policy team. I would also ask people who know people at other companies, is there a path here? What happens if we announce to the world that we think this is not safe and we are stopping? Does this cause the world to stand up and say, “Oh my god, this is really serious! Anthropic’s being really credible here. We are going to create political will for serious regulation, or other companies are going to stop too.” Or does this just result in, “Those crazy safety doomers, those hypesters! That’s just ridiculous. This is insane. Ha ha. Let’s laugh at them and continue the race.” I think that would be the determining thing. I don’t think I can draw a line in the sand and say when our AI passes this eval.
So that’s my own personal opinion. Again, no attempt to speak for the company. I’m not speaking for it, and no attempt to be consistent with any policies that are written down.
What is/was your total monetary compensation at both jobs? Or if you don't want want to say absolute numbers, what is the relative change in compensation?
OpenPhil's 5th-highest-compensated employee earned about $184k in 2023[1], which gives you a ceiling. Anthropic currently extends offers of ~$550k to mid-level[2] engineers and researchers. Joe's role might not be on the same ladder as other technical roles, but companies like Anthropic tend to pay pretty well across the board.
Edit: retracted first half of the claim, see this reply.
According to their public Form 990 filing.
I realize the job title says "Senior Software Engineer", but given the way their ladder is structured, I think mid-level is probably closer (though it's fuzzy).
I think this is false because that is only the Open Phil 501(c)(3) and Open Phil also employs lots of people at an LLC as well, but that doesn't file a 990
Oh, alas. Thank you for the correction!
(I still expect OpenPhil the LLC to have been paying comparable amounts to its most-remunerated employees, but not so confidently that I would assert it outright.)
(Audio version, read by the author, here, or search for "Joe Carlsmith Audio" on your podcast app.)
Last Friday was my last day at Open Philanthropy. I’ll be starting a new role at Anthropic in mid-November, helping with the design of Claude’s character/constitution/spec. This post reflects on my time at Open Philanthropy, and it goes into more detail about my perspective and intentions with respect to Anthropic – including some of my takes on AI-safety-focused people working at frontier AI companies.
(I shared this post with Open Phil and Anthropic comms before publishing, but I’m speaking only for myself and not for Open Phil or Anthropic.)
I joined Open Philanthropy full-time at the beginning of 2019.[1] At the time, the organization was starting to spin up a new “Worldview Investigations” team, aimed at investigating and documenting key beliefs driving the organization’s cause prioritization – and with a special focus on how the organization should think about the potential impact at stake in work on transformatively powerful AI systems.[2] I joined (and eventually: led) the team devoted to this effort, and it’s been an amazing project to be a part of.
I remember, early on, one pithy summary of the hypotheses we were investigating: “AI soon, AI fast, AI big, AI bad.” Looking back, I think this was a prescient point of focus. And I’m proud of the research that our efforts produced. For example:
On AI big (that is: AI-driven growth and transformation): Tom Davidson’s report on AI-driven explosive growth; David Roodman’s report on modeling the long-run trajectory of GDP.[3]
On AI bad (that is: AI-driven catastrophic risk): my work on power-seeking AI, on scheming AIs, and on solving the alignment problem; Ajeya Cotra’s report on AI takeover; Tom Davidson and Lukas Finnveden’s work (with Rose Hadshar) on AI-enabled coups.[4]
Holden Karnofsky’s “Most Important Century” series also summarized and expanded on many threads in this research. And over the years, the worldview investigations team’s internal and external research has covered a variety of other topics relevant to a world transformed by advanced AI, and to the broader project of positively shaping the long-term future (e.g., Lukas Finnveden’s work on AI for epistemics, making deals with misaligned AIs, and honesty policies for interactions with AIs).[5]
In addition to the concrete research outputs, though, I’m also proud of the underlying aspiration of the worldview investigations project. I remember one early meeting about the team’s mandate. A key goal, we said, was for a thoughtful interlocutor who didn’t trust our staff or advisors to nevertheless be able to understand our big-picture views about AI, and to either be persuaded by them, or to tell us where we were going wrong. One frame we used for thinking about this was: creating something akin to GiveWell’s public write-ups about the cost-effectiveness of e.g. anti-malarial bednet distribution, except for AI – writeups, that is, that people who cared a lot about the issue could engage with in depth, and that others could at least “spot-check” as a source of signal. We recognized that most of Open Phil’s potential audience would not, in fact, engage in this way. But we were betting that it was important to the health of our own epistemics, and to the health of the broader epistemic ecosystem, that the possibility be available. And we wanted to make this bet even in the context of questions that were intimidatingly difficult, cross-disciplinary, pre-paradigmatic, and conceptually gnarly. We wanted rigor and transparency in attempting to arrive at, write down, and explain our best-guess answers regardless.
I feel extremely lucky to have had the chance to pursue this mandate so wholeheartedly over the past seven-ish years. Indeed: before joining Open Phil, I remember hoping, someday, that I would have a chance to really sit down and figure out what I thought about all this AI stuff. And I often meet people in the AI world who wish for similar time and space to try to get clear on their views on such a confusing topic. It’s been a privilege to actually have this kind of time and space – and to have it, what’s more, in an environment so supportive of genuine inquiry, in dialogue with such amazing colleagues, and with such a direct path from research to concrete impact.
Beyond my work on worldview investigations, I also feel grateful to Open Phil for doing so much to support my independent writing over the years. Most of the writing on my website wasn’t done on Open Phil time, but the time and energy I devoted to it has come with real trade-offs with respect to my work for Open Phil, and I deeply appreciate how accommodating the organization has been of these trade-offs. Indeed, in many respects, I feel like my time at Open Phil has given me the chance to pursue an even better version of the sort of philosophical career I dreamed of as an early graduate student in philosophy – one less constrained by the strictures of academia; one with more space for the spiritual, emotional, literary, and personal aspects of philosophical life; and one with more opportunity to focus directly on the topics that matter to me most. It’s a rare opportunity, and I feel very lucky to have had it.
I also feel lucky to have had such deep contact with the organization’s work more broadly. I remember an early project as a trial employee at Open Phil, investigating the impact of the organization’s early funding of corporate campaigns for cage-free eggs. I remember being floored by the sorts of numbers that were coming out of the analysis. It seemed strangely plausible that this organization had just played an important role in a moral achievement of massive scale, the significance of which was going largely unnoticed by the world. Even now, interacting with the farm animal welfare team at Open Phil, I try to remember: maybe, actually, these people are heroes. Maybe, indeed, this is what real heroism often looks like – quiet, humble, doing-the-work.
And I remember, too, a dinner with some of the staff working on grant-making in global health. I forget the specific grant under discussion. But I remember, in particular, the quality of gravity; the way the weight of the decision was being felt: real children who would live or die. I work mostly on risks at a very broad scale, and at that level of abstraction, it’s easy to lose emotional contact with the stakes. That dinner, for me, was a reminder – a reminder of the stakes of my own work; a reminder of where every dollar that went to my work wasn’t going; and a reminder, more broadly, of what it looks like to take real responsibility for decisions that matter.
It’s been an honor to work with people who care so deeply about making the world a better place; who are so empowered to pursue this mission; and who are so committed to seeing clearly the actual impact of efforts in this respect. To everyone who does this work, and who helps make Open Phil what it is: thank you. You are a reminder, to me, of what ethical and epistemic sincerity can make possible.
Open Phil has many flaws. But as far as I can tell, as an institution, it is a truly rare degree of good. I am proud to have been a part of it. It has meant a huge amount to me. And I will carry it with me.
Why am I going to Anthropic? Basically: I think working there might be the best way I can help the transition to advanced AI go well right now. I’m not confident Anthropic is the best place for this, but I think it’s plausible enough to be worth getting more direct data on.
Why might Anthropic be the best place for me to help the transition to advanced AI go well? Part of the case comes specifically from the opportunity to help design Claude’s character/constitution/spec – and in particular, to help Anthropic grapple with some of the challenges that could arise in this context as frontier models start to reach increasingly superhuman levels of capability. This sort of project, I believe, is a technical and philosophical challenge unprecedented in the history of our species; one with rapidly increasing stakes as AIs start to exert more and more influence in our society; and one I think that my background and skillset are especially suited to helping with.
That said, from the perspective of concerns about existential risk from AI misalignment in particular, I also want to acknowledge an important argument against the importance of this kind of work: namely, that most of the existential misalignment risk comes from AIs that are disobeying the model spec, rather than AIs that are obeying a model spec that nevertheless directs/permits them to do things like killing all humans or taking over the world. This sort of argument can take one of two forms. On the first, creating a model spec that robustly disallows killing/disempowering all of humanity is easy (e.g., “rule number 1: seriously, do not take over the world”) – the hard thing is building AIs that obey model specs at all. On the second, creating a model spec that robustly disallows killing/disempowering all of humanity (especially when subject to extreme optimization pressure) is also hard (cf traditional concerns about “King Midas Problems”), but we’re currently on track to fail at the earlier step of causing our AIs to obey model specs at all, and so we should focus our efforts there. I am more sympathetic to the first of these arguments (see e.g. my recent discussion of the role of good instructions in the broader project of AI alignment), but I give both some weight.
Despite these arguments, though, I think that helping Anthropic with the design of Claude’s model spec is worth trying. Key reasons for this include:
That said, even if I end up concluding that work on Claude’s character/constitution/spec isn’t a good fit for me, there is also a ton of other work happening at Anthropic that I might in principle be interested in contributing to.[6] And in general, both in the context of model spec work and elsewhere, one of the key draws of working at Anthropic, for me, is the opportunity to make more direct contact with the reality of the dynamics presently shaping frontier AI development – dynamics about which I’ve been writing from a greater distance for many years. For example: I am nearing the end of an essay series laying out my current picture of our best shot at solving the alignment problem (a series I am still aiming to finish). This picture, though, operates at a fairly high level of abstraction, and having written it up, I am interested in understanding better the practical reality of what it might look like to put it into practice, and of what key pieces of the puzzle my current picture might be missing; and also, in working more closely with some of the people most likely to actually implement the best available approaches to alignment. Indeed, in general (and even if I don’t ultimately stay at Anthropic) I expect to learn a ton from working there – and this fact plays an important role, for me, in the case for trying it.
All that said: I’m not sure that going to Anthropic is the right decision. A lot of my uncertainty has to do with the opportunity cost at stake in my own particular case, and whether I might do more valuable work elsewhere – and I’m not going to explain the details of my thinking on that front here. I do, though, want to say a few words about some more general concerns about AI-safety-focused people going to work at AI companies (and/or, at Anthropic in particular).
The first concern is that Anthropic as an institution is net negative for the world (one can imagine various reasons for thinking this, but a key one is that frontier AI companies, by default, are net negative for the world due to e.g. increasing race dynamics, accelerating timelines, and eventually developing/deploying AIs that risk destroying humanity – and Anthropic is no exception), and that one shouldn’t work at organizations like that. My current first-pass view on this front is that Anthropic is net positive in expectation for the world, centrally because I think (i) there are a variety of good and important actions that frontier AI companies are uniquely and/or unusually well-positioned to do, and that Anthropic is unusually likely to do (see footnote for examples[7]), and (ii) the value at stake in (i) currently looks to me like it outweighs the disvalue at stake in Anthropic’s marginal role in exacerbating race dynamics, accelerating timelines, contributing to risky forms of development/deployment, and so on.[8] For example: when I imagine the current AI landscape both with Anthropic and without Anthropic, I feel worse in the no-Anthropic case.[9] That said, the full set of possible arguments and counter-arguments at stake in assessing Anthropic’s expected impact is complicated, and even beyond the standard sorts of sign-uncertainty that afflict most action in the AI space, I am less sure than I’d like to be that Anthropic is net good.
That said: whether Anthropic as a whole is net good in expectation is also not, for me, a decisive crux for whether or not I should work there, provided that my working there, in particular, would be net good. Here, again, some of the ethics (and decision-theory) can get complicated (see footnote for a bit more discussion[10]). But at a high-level: I know multiple AI-safety-focused people who are working in the context of institutions that I think are much more likely to be net negative than Anthropic, but where it nevertheless seems to me that their doing so is both good in expectation and deontologically/decision-theoretically right. And I have a similar intuition when I think about various people I know working on AI safety at Anthropic itself (for example, people like Evan Hubinger and Ethan Perez). So my overall response to “Anthropic is net negative in expectation, and one shouldn’t work at orgs like that” is something like “it looks to me like Anthropic is net positive in expectation, but it’s also not a decisive crux.”
Another argument against working for Anthropic (or for any other AI lab) comes from approaches to AI safety that focus centrally/exclusively on what I’ve called “capability restraint” – that is, finding ways to restrain (and in the limit, indefinitely halt) frontier AI development, especially in a coordinated, global, and enforceable manner. And the best way to work on capability restraint, the thought goes, is from a position outside of frontier AI companies, rather than within them (this could be for a variety of reasons, but a key one would be: insofar as capability restraint is centrally about restraining the behavior of frontier AI companies, those companies will have strong incentives to resist it). Here, though, while I agree that capability restraint of some form is extremely important, I’m not convinced that people concerned about AI safety should be focusing on it exclusively. Rather, my view is that we should also be investing in learning how to make frontier AI systems safe (what I’ve called “safety progress”). This, after all, is what many versions of capability restraint are buying time for; and while there are visions of capability restraint that hope to not rely on even medium-term technical safety progress (e.g., very long or indefinite global pauses), I don’t think we should be betting the house on them. Also, though: even if I thought that capability restraint should be the central focus of AI safety work, I don’t think it’s clear that working outside of AI companies in this respect is always or even generally preferable to working within them – for example, because many of the “good actions” that AI labs are well-positioned to do (e.g. modeling good industry practices for evaluating danger, credibly sharing evidence of danger, supporting appropriate regulation) are ones that promote capability restraint.
Another argument against AI-safety-focused people working at Anthropic is that it’s already sucking up too much of the AI safety community’s talent. This concern can take various forms (e.g., group-think and intellectual homogeneity, messing with people’s willingness to speak out against Anthropic in particular, feeding bad status dynamics, concentrating talent that would be marginally more useful if more widely distributed, general over-exposure to a particular point of failure, etc). I do think that this is a real concern – and it’s a reason, I think, for safety-focused talent to think hard about the marginal usefulness of working at Anthropic in particular, relative to non-profits, governments, other AI companies, and so on.[11] My current sense is that the specific type of impact opportunity I’m pursuing with respect to model spec work is notably better, for me, at Anthropic in particular; and I do think the concentration of safety-concerned talent at Anthropic has some benefits, too (e.g., more colleagues with a similar focus). Beyond this, though, I’m mostly just biting the bullet on contributing yet further to the concentration of safety-focused people at Anthropic in particular.
Another concern about AI-safety-focused people working at AI companies is that it will restrict/distort their ability to accurately convey their views to the public – a concern that applies with more force to people like myself who are otherwise in the habit of speaking/writing publicly. This was a key concern for me in thinking about moving to Anthropic, and I spent a decent amount of time nailing down expectations re: comms ahead of time. The approach we settled on was that I’ll get Anthropic sign-off for public writing that is specifically about my work at Anthropic (e.g., work on Claude’s model spec), but other than that I can write freely, including about AI-related topics, provided that it’s clear I’m speaking only for myself and not for Anthropic or with the approval of Anthropic comms (though: I’m going to keep Anthropic comms informally updated about AI-related writing I’m planning to do). I currently feel pretty good about this approach. However, I acknowledge that it will still come with some frictions; that comms restrictions/distortions can arise from more informal/social pressures as well; and that working at an AI company, in general, can alter the way one’s takes on AI are received and scrutinized by the public, including in ways that disincentivize speaking about a subject at all. And of course, working at an AI company also involves access to genuinely confidential information (though, I don’t currently expect this to significantly impact my writing about broader issues in AI development and AI risk). Plus: one is just generally quite busy. I am hoping that despite all these factors, I still end up in a position to do roughly the amount and the type of public writing that I want to be doing given my other priorities and opportunities to contribute. If I end up feeling like this isn’t the case at Anthropic, though, then I will view this as a strong reason to leave.
A different concern about working at AI companies is that it will actually distort your views directly – for example, because the company itself will be a very specific, maybe-echo-chamber-y epistemic environment, and people in general are quite epistemically permeable. In this respect, I feel lucky to have had the chance to form and articulate publicly many of my core views about AI prior to joining an AI company, and I plan to make a conscious effort to stay in epistemic contact with people with a variety of perspectives on AI. But I also don’t want to commit, now, to learning nothing that moves my worldview closer to that of other staff at Anthropic, as I don’t believe I have strong enough reason, now, to mistrust my future conclusions in this respect. And of course, there are also concerns about direct financial incentives distorting one’s views/behavior – for example, ending up reliant on a particular sort of salary, or holding equity that makes you less inclined to push in directions that could harm an AI company’s commercial success (though: note that this latter concern also applies to more general AI-correlated investments, albeit in different and less direct ways[12]). I’m going to try to make sure that my lifestyle and financial commitments continue to make me very financially comfortable both with leaving Anthropic, and with Anthropic’s equity (and also: the AI industry more broadly – I already hold various public AI-correlated stocks) losing value, but I recognize some ongoing risk of distorting incentives, here.
A final concern about AI safety people working for AI companies is that their doing so will signal an inaccurate degree of endorsement of the company’s behavior, thereby promoting wrongful amounts of trust in the company and its commitment to safety. Perhaps some of this is inevitable in a noisy epistemic environment, but part of why I’m writing this post is in an effort to at least make it easier for those who care to understand the degree of endorsement that my choice to work at Anthropic reflects. And to be clear: there is in fact some signal here. That is: I feel more comfortable going to work at Anthropic than I would working at some of its competitors, specifically because I feel better about Anthropic’s attitudes towards safety and its alignment with my views and values more generally. That said: it’s not the case that I endorse all of Anthropic’s past behavior or stated views, nor do I expect to do so going forward. For example: my current impression is that relative to some kind of median Anthropic view, both amongst the leadership and the overall staff, I am substantially more worried about classic existential risk from misalignment; I expect this disagreement (along with other potential differences in worldview) to also lead to differences in how much I’d emphasize misalignment risk relative to other threats, like AI-powered authoritarianism (though: I care about that threat, too); and while I don’t know the details of Anthropic’s policy advocacy, I think it’s plausible that I would be pushing harder in favor of various forms of AI regulation, and/or would’ve pushed harder in the past, and that I would be more vocal and explicit about risks from loss of control more generally (though I think some of the considerations here get complicated[13]). For those interested, I’ve also included a footnote with some quick takes on some more specific Anthropic-related public controversies/criticisms from the AI safety community over the years – e.g., about pushing the frontier, revising the Responsible Scaling Policy, secret non-disparagement agreements, epistemic culture, and accelerating capabilities – though I don’t claim to have thought about them each in detail.[14] And in general, I’m not going to see myself as needing to defend Anthropic’s conduct and stated views going forwards (though: I’m also not going to see it as my duty to speak out every time Anthropic does or says something I disagree with).
Also, in case there is any unclarity about this despite all my public writing on the topic (and of course speaking only for myself and not for Anthropic): I think that the technology being built by companies like Anthropic has a significant (read: double-digit) probability of destroying the entire future of the human species. What’s more, I do not think that Anthropic is at all immune from the sorts of concerns that apply to other companies building this technology – and in particular, concerns about race dynamics and other incentives leading to catastrophically dangerous forms of AI development. This means that I think Anthropic itself has a serious chance of causing or playing an important role in the extinction or full-scale disempowerment of humanity – and for all the good intentions of Anthropic’s leadership and employees, I think everyone who chooses to work there should face this fact directly.[15] What’s more, I think no private company should be in a position to impose this kind of risk on every living human, and I support efforts to make sure that no company ever is.[16]
Further: I do not think that Anthropic or any other actor has an adequate plan for building superintelligence in a manner that brings the risk of catastrophic, civilization-ending misalignment to a level that a prudent and coordinated civilization would accept.[17] I say this as someone who has spent a good portion of the past year trying to think through and write up what I see as the most promising plan in this respect – namely, the plan (or perhaps, the “concept of a plan”) described here. I think this plan is quite a bit more promising than some of its prominent critics do. But it is nowhere near good enough, and thinking it through in such detail has increased my pessimism about the situation. Why? Well, in brief: the plan is to either get lucky, or to get the AIs to solve the problem for us. Lucky, here, means that it turns out that we don’t need to rapidly make significant advances in our scientific understanding in order to learn how to adequately align and control superintelligent agents that would otherwise be in a position to disempower humanity – luck that, for various reasons, I really don’t think we can count on. And absent such luck, as far as I can tell, our best hope is to try to use less-than-superintelligent AIs – with which we will have relatively little experience, whose labor and behavior might have all sorts of faults and problems, whose output we will increasingly struggle to evaluate directly, and which might themselves be actively working to undermine our understanding and control – to rapidly make huge amounts of scientific progress in a novel domain that does not allow for empirical iteration on safety-critical failures, all in the midst of unprecedented commercial and geopolitical pressures. True, some combination of “getting lucky” and “getting AI help” might be enough for us to make it through. But we should be trying extremely hard not to bet the lives of every human and the entire future of our civilization on this. And as far as I can tell, any actor on track to build superintelligence, Anthropic included, is currently on track to make either this kind of bet, or something worse.
More specifically: I do not believe that the object-level benefits of advanced AI[18] – serious though they may be – currently justify the level of existential risk at stake in any actor, Anthropic included, developing superintelligence given our current understanding of how to do so safely.[19] Rather, I think the only viable justifications for trying to develop superintelligence appeal to the possibility that someone else will develop it anyways instead.[20] But there is, indeed, a clear solution to this problem in principle: namely, to use various methods of capability restraint (coordination, enforcement, etc) to ensure that no one develops superintelligence until we have a radically better understanding of how to do so safely. I think it’s a complicated question how to act in the absence of this kind of global capability restraint; complicated, too, how to prioritize efforts to cause this kind of restraint vs. improving the situation in other ways; and complicated, as well, how to mitigate other risks that this kind of restraint could exacerbate (e.g., extreme concentrations of power). But I support the good version of this kind of capability restraint regardless, and while it’s not the current focus of my work, I aspire to do my part to help make it possible.
All this is to say: I think that in a wiser, more prudent, and more coordinated world, no company currently aiming to develop superintelligence – Anthropic included – would be allowed to do so given the state of current knowledge. But this isn’t the same as thinking that in the actual world, Anthropic itself should unilaterally shut down;[21] and still less, that no one concerned about AI safety should work there. I do believe, though, that Anthropic should be ready to support and participate in the right sorts of efforts to ensure that no one builds superintelligence until we have a vastly better understanding of how to do so safely. And it implies, too, that even in the absence of any such successful effort, Anthropic should be extremely vigilant about the marginal risk of existential catastrophe that its work creates. Indeed, I think it’s possible that there will, in fact, come a time when Anthropic should basically just unilaterally drop out of the race – pivoting, for example, entirely to a focus on advocacy and/or doing alignment research that it then makes publicly available. And I wish I were more confident that in circumstances where this is the right choice, Anthropic will do it despite all the commercial and institutional momentum to the contrary.
I say all this so as to be explicit about what my choice to work at Anthropic does and doesn’t mean about my takes on the organization itself, the broader AI safety situation, and the ethical dynamics at stake in AI-safety-focused people going to work at AI companies. That said: it’s possible that my views in this respect will evolve over time, and I aspire to let them do so without defensiveness or attachment.[22] And if, as a result, I end up concluding that working at Anthropic is a mistake, I aspire to simply admit that I messed up, and to leave.[23]
In the meantime: I’m going to go and see if I can help Anthropic design Claude’s model spec in good ways.[24] Often, starting a new role like this is exciting – and a part of me is indeed excited. Another part, though, feels heavier. When I think ahead to the kind of work that this role involves, especially in the context of increasingly dangerous and superhuman AI agents, I have a feeling like: this is not something that we are ready to do. This is not a game humanity is ready to play. A lot of this concern comes from intersections with the sorts of misalignment issues I discussed above. But the AI moral patienthood piece looms large for me as well, as do the broader ethical and political questions at stake in our choices about what sorts of powerful AI agents to bring into this world, and about who has what sort of say in those decisions. I’ve written, previously, about the sort of otherness at stake in these new minds we are creating; and about the ethical issues at stake in “designing” their values and character. I hope that the stakes are lower than this; that AI is, at least for the near-term, something more “normal.”[25] But what if it actually isn’t? In that case, it seems to me, we are moving far too fast, with far too little grip on what we are doing.
I also did a three month trial period before that.
Earlier work at Open Phil, like Luke Muehlhauser’s report on consciousness and moral patienthood, can also be viewed as part of a similar aspiration – though, less officially codified at the time.
Roodman wasn’t working officially with the worldview investigations team, but this report was spurred by a similar impulse within the organization.
The AI-enabled coups work was eventually published via Forethought, where Tom went to work in early 2025, but much of the initial ideation occurred at Open Phil.
Some of these were published after Lukas left Open Phil for Redwood Research in summer of this year, but most of the initial ideation occurred during his time at Open Phil. See also Lukas Finnveden’s list here for a sampling of other topics we considered or investigated.
For example, on threat modeling, safety cases, model welfare, AI behavioral science, automated alignment research (especially conceptual alignment research), and automating other forms of philosophical/conceptual reflection.
Good actions here include: modeling and pushing for good industry norms/practices/etc, conducting good alignment research on frontier models and sharing the results as public good, studying and sharing demonstrations of scary model behaviors, pivoting to doing a ton of automated alignment research at the right time, advocating for the right type of regulations and pauses, understanding the technical situation in detail and sharing this information with the public and with relevant decision-makers, freaking out at the right time and in the right way (if appropriate), generally pushing AI development in good/wise directions, etc. That said, I am wary of impact stories that rely on Anthropic taking actions like these when doing so will come at significant (and especially: crippling) costs to its commercial success.
I also think that some parts of the AI safety community has in the past been overly purist/deontological/fastidious about the possibility of safety-focused work accelerating AI capabilities development, but this is a somewhat separate discussion, and I do think there are arguments on both sides.
Though: it’s important, in considering a thought experiment like this, to try to imagine what all of Anthropic’s current staff might be doing instead.
At a high level, from a consequentialist perspective, the most central reason not to work at a net negative institution is that to the first approximation, you should expect to be an additional multiplier/strengthener of whatever vector that institution represents. So: if that vector is net negative, then you should expect to be net negative. But this consideration, famously, can be outweighed by ways in which the overall vector of your work in particular can be pushing in a positive direction – though of course, one needs to look at that case by case, and to adjust for biases, uncertainties, time-worn heuristics, and so on. Even if you grant that it’s consequentialist-good to work at a net-negative institution, though, there remains the further question whether it’s deontologically permissible (and/or, compatible with a more sophisticated decision-theoretic approach to consequentialism – i.e., one which directs you to incorporate possible acausal correlations between your choice and the choices of others, which directs you to act in line with some broader policy you would’ve decided on from some more ignorant epistemic position, and so on – see here for more on my takes on decision theories of this kind). I won’t try to litigate this overall calculus in detail here. But as I discuss in the main text, I have the reasonably strong intuition that it is both good and deontologically/decision-theoretically right for at least some of the people I know who are working at AI companies (and also, at other institutions that I think more likely to be net negative than Anthropic) to do so. And if such an intuition is reliable, this means that at the least, “Anthropic is net negative, and one shouldn’t work at institutions like that” isn’t enough of an argument on its own.
It’s also one of the arguments for thinking that Anthropic might be net negative, and a reason that thought experiments like “imagine the current landscape without Anthropic” might mislead.
In particular, actually being at an AI company – and especially, in a position of influence over its safety-relevant decision-making – puts you in a position to much more directly affect the trade-offs it makes with respect to safety vs. the value of its equity in particular.
For example: insofar as Anthropic’s technical takes about the risk of misalignment are unusually credible given its position as an industry leader, I think it is in fact important for Anthropic to spend its “crying danger” points wisely.
Briefly:
At least assuming they place significant probability on existential catastrophe from advanced AI in general, which I also think they should.
I also think that in an ideal world, no single government or multi-lateral project would ever be in this position, but it’s less clear that this is a feasible policy goal, at least in worlds where superintelligent AIs ever get developed at all.
Here I am assuming some constraints on the realism of the plan in question. And I’m more confident about this if we make further assumptions about the degree to which the civilization in question cares about its long-term future in addition to the purely near-term.
By object-level benefits, I mean things like medical benefits, economic benefits, etc – and not the sorts of benefits that are centrally beneficial because of how they interact with the fact that other actors might build superintelligence as well.
I think this is likely true even if you are entirely selfish, and/or if you only care about the near-term benefits and harms (e.g., the direct risk of death/disempowerment for present-day humans, vs. the potential benefits for present-day humans), because these near-term goals would likely be served better by delaying superintelligence at least a few years in order to improve our safety understanding. But I think it is especially true if, like me, you care a lot about the long-term future of human civilization as well.
To be clear, it is also extremely possible to give bad justifications of this form – for example, “other people will build it anyways, and I want to be part of the action.”
I think this is true even from a more complicated decision-theoretic perspective, which views the AI race as akin to a prisoner’s dilemma that all participants should coordinate to avoid, and which might therefore direct Anthropic to act in line with the policy it wants all participants to obey. The problem with this argument is that some actors in the race (and some potential entrants to it) profess beliefs, values, and intentions that suggest they would be unwilling to participate even in a coordinated policy of avoiding the race – i.e., they plan to charge ahead regardless of what anyone else does. And in such a context, even from a fancier decision-theoretic perspective that aspires to act in line with the policy you hope that everyone whose decision-procedure is suitably correlated with your own will adopt, the “I’ll just charge ahead regardless” actors aren’t suitably correlated with you and hence aren’t suitably influence-able. (Perhaps some decision-theories would direct you to act in accordance with the policy that these actors would adopt if they had better/more-idealized views/intentions, but this seems to me less natural as a first-pass approach.)
Though: there are limits to the energy I’m going to devote to re-litigating the issue.
Though per my comments about opportunity cost above, I think the most likely reason I’d leave Anthropic has to do with the possibility that I could be doing better work elsewhere, rather than something about the ethics of working at a company developing advanced AI in particular.
And/or, to see if I can be suitably helpful elsewhere.
I do think that eventually, realizing anywhere near the full potential of human civilization will require access to advanced AI or something equivalently capable.