LESSWRONG
LW

(Audio version, read by the author, here, or search for "Joe Carlsmith Audio" on your podcast app.)

Last Friday was my last day at Open Philanthropy. I’ll be starting a new role at Anthropic in mid-November, helping with the design of Claude’s character/constitution/spec. This post reflects on my time at Open Philanthropy, and it goes into more detail about my perspective and intentions with respect to Anthropic – including some of my takes on AI-safety-focused people working at frontier AI companies.

(I shared this post with Open Phil and Anthropic comms before publishing, but I’m speaking only for myself and not for Open Phil or Anthropic.)

On my time at Open Philanthropy

I joined Open Philanthropy full-time at the beginning of 2019.^[1] At the time, the organization was starting to spin up a new “Worldview Investigations” team, aimed at investigating and documenting key beliefs driving the organization’s cause prioritization – and with a special focus on how the organization should think about the potential impact at stake in work on transformatively powerful AI systems.^[2] I joined (and eventually: led) the team devoted to this effort, and it’s been an amazing project to be a part of.

I remember, early on, one pithy summary of the hypotheses we were investigating: “AI soon, AI fast, AI big, AI bad.” Looking back, I think this was a prescient point of focus. And I’m proud of the research that our efforts produced. For example:

On AI soon (that is: timelines): Ajeya Cotra’s report on biological anchors, my report on human brain computation, and Tom Davidson’s report on semi-informative priors.
On AI fast (that is: take-off speeds): Tom Davidson’s report on what a compute-centric framework says about take-off speeds.
On AI big (that is: AI-driven growth and transformation): Tom Davidson’s report on AI-driven explosive growth; David Roodman’s report on modeling the long-run trajectory of GDP.^[3]
On AI bad (that is: AI-driven catastrophic risk): my work on power-seeking AI, on scheming AIs, and on solving the alignment problem; Ajeya Cotra’s report on AI takeover; Tom Davidson and Lukas Finnveden’s work (with Rose Hadshar) on AI-enabled coups.^[4]

Holden Karnofsky’s “Most Important Century” series also summarized and expanded on many threads in this research. And over the years, the worldview investigations team’s internal and external research has covered a variety of other topics relevant to a world transformed by advanced AI, and to the broader project of positively shaping the long-term future (e.g., Lukas Finnveden’s work on AI for epistemics, making deals with misaligned AIs, and honesty policies for interactions with AIs).^[5]

In addition to the concrete research outputs, though, I’m also proud of the underlying aspiration of the worldview investigations project. I remember one early meeting about the team’s mandate. A key goal, we said, was for a thoughtful interlocutor who didn’t trust our staff or advisors to nevertheless be able to understand our big-picture views about AI, and to either be persuaded by them, or to tell us where we were going wrong. One frame we used for thinking about this was: creating something akin to GiveWell’s public write-ups about the cost-effectiveness of e.g. anti-malarial bednet distribution, except for AI – writeups, that is, that people who cared a lot about the issue could engage with in depth, and that others could at least “spot-check” as a source of signal. We recognized that most of Open Phil’s potential audience would not, in fact, engage in this way. But we were betting that it was important to the health of our own epistemics, and to the health of the broader epistemic ecosystem, that the possibility be available. And we wanted to make this bet even in the context of questions that were intimidatingly difficult, cross-disciplinary, pre-paradigmatic, and conceptually gnarly. We wanted rigor and transparency in attempting to arrive at, write down, and explain our best-guess answers regardless.

I feel extremely lucky to have had the chance to pursue this mandate so wholeheartedly over the past seven-ish years. Indeed: before joining Open Phil, I remember hoping, someday, that I would have a chance to really sit down and figure out what I thought about all this AI stuff. And I often meet people in the AI world who wish for similar time and space to try to get clear on their views on such a confusing topic. It’s been a privilege to actually have this kind of time and space – and to have it, what’s more, in an environment so supportive of genuine inquiry, in dialogue with such amazing colleagues, and with such a direct path from research to concrete impact.

Beyond my work on worldview investigations, I also feel grateful to Open Phil for doing so much to support my independent writing over the years. Most of the writing on my website wasn’t done on Open Phil time, but the time and energy I devoted to it has come with real trade-offs with respect to my work for Open Phil, and I deeply appreciate how accommodating the organization has been of these trade-offs. Indeed, in many respects, I feel like my time at Open Phil has given me the chance to pursue an even better version of the sort of philosophical career I dreamed of as an early graduate student in philosophy – one less constrained by the strictures of academia; one with more space for the spiritual, emotional, literary, and personal aspects of philosophical life; and one with more opportunity to focus directly on the topics that matter to me most. It’s a rare opportunity, and I feel very lucky to have had it.

I also feel lucky to have had such deep contact with the organization’s work more broadly. I remember an early project as a trial employee at Open Phil, investigating the impact of the organization’s early funding of corporate campaigns for cage-free eggs. I remember being floored by the sorts of numbers that were coming out of the analysis. It seemed strangely plausible that this organization had just played an important role in a moral achievement of massive scale, the significance of which was going largely unnoticed by the world. Even now, interacting with the farm animal welfare team at Open Phil, I try to remember: maybe, actually, these people are heroes. Maybe, indeed, this is what real heroism often looks like – quiet, humble, doing-the-work.

And I remember, too, a dinner with some of the staff working on grant-making in global health. I forget the specific grant under discussion. But I remember, in particular, the quality of gravity; the way the weight of the decision was being felt: real children who would live or die. I work mostly on risks at a very broad scale, and at that level of abstraction, it’s easy to lose emotional contact with the stakes. That dinner, for me, was a reminder – a reminder of the stakes of my own work; a reminder of where every dollar that went to my work wasn’t going; and a reminder, more broadly, of what it looks like to take real responsibility for decisions that matter.

It’s been an honor to work with people who care so deeply about making the world a better place; who are so empowered to pursue this mission; and who are so committed to seeing clearly the actual impact of efforts in this respect. To everyone who does this work, and who helps make Open Phil what it is: thank you. You are a reminder, to me, of what ethical and epistemic sincerity can make possible.

Open Phil has many flaws. But as far as I can tell, as an institution, it is a truly rare degree of good. I am proud to have been a part of it. It has meant a huge amount to me. And I will carry it with me.

On going to Anthropic

Why am I going to Anthropic? Basically: I think working there might be the best way I can help the transition to advanced AI go well right now. I’m not confident Anthropic is the best place for this, but I think it’s plausible enough to be worth getting more direct data on.

Why might Anthropic be the best place for me to help the transition to advanced AI go well? Part of the case comes specifically from the opportunity to help design Claude’s character/constitution/spec – and in particular, to help Anthropic grapple with some of the challenges that could arise in this context as frontier models start to reach increasingly superhuman levels of capability. This sort of project, I believe, is a technical and philosophical challenge unprecedented in the history of our species; one with rapidly increasing stakes as AIs start to exert more and more influence in our society; and one I think that my background and skillset are especially suited to helping with.

That said, from the perspective of concerns about existential risk from AI misalignment in particular, I also want to acknowledge an important argument against the importance of this kind of work: namely, that most of the existential misalignment risk comes from AIs that are disobeying the model spec, rather than AIs that are obeying a model spec that nevertheless directs/permits them to do things like killing all humans or taking over the world. This sort of argument can take one of two forms. On the first, creating a model spec that robustly disallows killing/disempowering all of humanity is easy (e.g., “rule number 1: seriously, do not take over the world”) – the hard thing is building AIs that obey model specs at all. On the second, creating a model spec that robustly disallows killing/disempowering all of humanity (especially when subject to extreme optimization pressure) is also hard (cf traditional concerns about “King Midas Problems”), but we’re currently on track to fail at the earlier step of causing our AIs to obey model specs at all, and so we should focus our efforts there. I am more sympathetic to the first of these arguments (see e.g. my recent discussion of the role of good instructions in the broader project of AI alignment), but I give both some weight.

Despite these arguments, though, I think that helping Anthropic with the design of Claude’s model spec is worth trying. Key reasons for this include:

I do think there is some catastrophic misalignment risk even from models that are obeying the spec (a la King Midas problems), even in quite straightforward ways.
I think that the complexities and ambiguities at stake in the spectrum between “straightforwardly obeying the spec” and “flagrantly disobeying the spec” may themselves have important relevance to the risk of AI takeover;
I expect important interactions between the content of the spec and our efforts to ensure obedience to it of any form (and I broadly expect my work at Anthropic to expose me to both sides of this equation);
I think that the content of the spec (and the broader set of policies that our civilization uses with respect to model specs – e.g. transparency) matters to a variety of other long-term risks from AI other than misalignment (for example, misuse by power-seeking human actors);
I generally feel unsurprised if objects like model specs (i.e., processes for specifying our intentions with respect to AI character, motivation, and behavior) end up mattering in lots of high-stakes ways I am not currently anticipating;
I think that this is an area where I am especially well-positioned to contribute.

That said, even if I end up concluding that work on Claude’s character/constitution/spec isn’t a good fit for me, there is also a ton of other work happening at Anthropic that I might in principle be interested in contributing to.^[6] And in general, both in the context of model spec work and elsewhere, one of the key draws of working at Anthropic, for me, is the opportunity to make more direct contact with the reality of the dynamics presently shaping frontier AI development – dynamics about which I’ve been writing from a greater distance for many years. For example: I am nearing the end of an essay series laying out my current picture of our best shot at solving the alignment problem (a series I am still aiming to finish). This picture, though, operates at a fairly high level of abstraction, and having written it up, I am interested in understanding better the practical reality of what it might look like to put it into practice, and of what key pieces of the puzzle my current picture might be missing; and also, in working more closely with some of the people most likely to actually implement the best available approaches to alignment. Indeed, in general (and even if I don’t ultimately stay at Anthropic) I expect to learn a ton from working there – and this fact plays an important role, for me, in the case for trying it.

All that said: I’m not sure that going to Anthropic is the right decision. A lot of my uncertainty has to do with the opportunity cost at stake in my own particular case, and whether I might do more valuable work elsewhere – and I’m not going to explain the details of my thinking on that front here. I do, though, want to say a few words about some more general concerns about AI-safety-focused people going to work at AI companies (and/or, at Anthropic in particular).

The first concern is that Anthropic as an institution is net negative for the world (one can imagine various reasons for thinking this, but a key one is that frontier AI companies, by default, are net negative for the world due to e.g. increasing race dynamics, accelerating timelines, and eventually developing/deploying AIs that risk destroying humanity – and Anthropic is no exception), and that one shouldn’t work at organizations like that. My current first-pass view on this front is that Anthropic is net positive in expectation for the world, centrally because I think (i) there are a variety of good and important actions that frontier AI companies are uniquely and/or unusually well-positioned to do, and that Anthropic is unusually likely to do (see footnote for examples^[7]), and (ii) the value at stake in (i) currently looks to me like it outweighs the disvalue at stake in Anthropic’s marginal role in exacerbating race dynamics, accelerating timelines, contributing to risky forms of development/deployment, and so on.^[8] For example: when I imagine the current AI landscape both with Anthropic and without Anthropic, I feel worse in the no-Anthropic case.^[9] That said, the full set of possible arguments and counter-arguments at stake in assessing Anthropic’s expected impact is complicated, and even beyond the standard sorts of sign-uncertainty that afflict most action in the AI space, I am less sure than I’d like to be that Anthropic is net good.

That said: whether Anthropic as a whole is net good in expectation is also not, for me, a decisive crux for whether or not I should work there, provided that my working there, in particular, would be net good. Here, again, some of the ethics (and decision-theory) can get complicated (see footnote for a bit more discussion^[10]). But at a high-level: I know multiple AI-safety-focused people who are working in the context of institutions that I think are much more likely to be net negative than Anthropic, but where it nevertheless seems to me that their doing so is both good in expectation and deontologically/decision-theoretically right. And I have a similar intuition when I think about various people I know working on AI safety at Anthropic itself (for example, people like Evan Hubinger and Ethan Perez). So my overall response to “Anthropic is net negative in expectation, and one shouldn’t work at orgs like that” is something like “it looks to me like Anthropic is net positive in expectation, but it’s also not a decisive crux.”

Another argument against working for Anthropic (or for any other AI lab) comes from approaches to AI safety that focus centrally/exclusively on what I’ve called “capability restraint” – that is, finding ways to restrain (and in the limit, indefinitely halt) frontier AI development, especially in a coordinated, global, and enforceable manner. And the best way to work on capability restraint, the thought goes, is from a position outside of frontier AI companies, rather than within them (this could be for a variety of reasons, but a key one would be: insofar as capability restraint is centrally about restraining the behavior of frontier AI companies, those companies will have strong incentives to resist it). Here, though, while I agree that capability restraint of some form is extremely important, I’m not convinced that people concerned about AI safety should be focusing on it exclusively. Rather, my view is that we should also be investing in learning how to make frontier AI systems safe (what I’ve called “safety progress”). This, after all, is what many versions of capability restraint are buying time for; and while there are visions of capability restraint that hope to not rely on even medium-term technical safety progress (e.g., very long or indefinite global pauses), I don’t think we should be betting the house on them. Also, though: even if I thought that capability restraint should be the central focus of AI safety work, I don’t think it’s clear that working outside of AI companies in this respect is always or even generally preferable to working within them – for example, because many of the “good actions” that AI labs are well-positioned to do (e.g. modeling good industry practices for evaluating danger, credibly sharing evidence of danger, supporting appropriate regulation) are ones that promote capability restraint.

Another argument against AI-safety-focused people working at Anthropic is that it’s already sucking up too much of the AI safety community’s talent. This concern can take various forms (e.g., group-think and intellectual homogeneity, messing with people’s willingness to speak out against Anthropic in particular, feeding bad status dynamics, concentrating talent that would be marginally more useful if more widely distributed, general over-exposure to a particular point of failure, etc). I do think that this is a real concern – and it’s a reason, I think, for safety-focused talent to think hard about the marginal usefulness of working at Anthropic in particular, relative to non-profits, governments, other AI companies, and so on.^[11] My current sense is that the specific type of impact opportunity I’m pursuing with respect to model spec work is notably better, for me, at Anthropic in particular; and I do think the concentration of safety-concerned talent at Anthropic has some benefits, too (e.g., more colleagues with a similar focus). Beyond this, though, I’m mostly just biting the bullet on contributing yet further to the concentration of safety-focused people at Anthropic in particular.

Another concern about AI-safety-focused people working at AI companies is that it will restrict/distort their ability to accurately convey their views to the public – a concern that applies with more force to people like myself who are otherwise in the habit of speaking/writing publicly. This was a key concern for me in thinking about moving to Anthropic, and I spent a decent amount of time nailing down expectations re: comms ahead of time. The approach we settled on was that I’ll get Anthropic sign-off for public writing that is specifically about my work at Anthropic (e.g., work on Claude’s model spec), but other than that I can write freely, including about AI-related topics, provided that it’s clear I’m speaking only for myself and not for Anthropic or with the approval of Anthropic comms (though: I’m going to keep Anthropic comms informally updated about AI-related writing I’m planning to do). I currently feel pretty good about this approach. However, I acknowledge that it will still come with some frictions; that comms restrictions/distortions can arise from more informal/social pressures as well; and that working at an AI company, in general, can alter the way one’s takes on AI are received and scrutinized by the public, including in ways that disincentivize speaking about a subject at all. And of course, working at an AI company also involves access to genuinely confidential information (though, I don’t currently expect this to significantly impact my writing about broader issues in AI development and AI risk). Plus: one is just generally quite busy. I am hoping that despite all these factors, I still end up in a position to do roughly the amount and the type of public writing that I want to be doing given my other priorities and opportunities to contribute. If I end up feeling like this isn’t the case at Anthropic, though, then I will view this as a strong reason to leave.

A different concern about working at AI companies is that it will actually distort your views directly – for example, because the company itself will be a very specific, maybe-echo-chamber-y epistemic environment, and people in general are quite epistemically permeable. In this respect, I feel lucky to have had the chance to form and articulate publicly many of my core views about AI prior to joining an AI company, and I plan to make a conscious effort to stay in epistemic contact with people with a variety of perspectives on AI. But I also don’t want to commit, now, to learning nothing that moves my worldview closer to that of other staff at Anthropic, as I don’t believe I have strong enough reason, now, to mistrust my future conclusions in this respect. And of course, there are also concerns about direct financial incentives distorting one’s views/behavior – for example, ending up reliant on a particular sort of salary, or holding equity that makes you less inclined to push in directions that could harm an AI company’s commercial success (though: note that this latter concern also applies to more general AI-correlated investments, albeit in different and less direct ways^[12]). I’m going to try to make sure that my lifestyle and financial commitments continue to make me very financially comfortable both with leaving Anthropic, and with Anthropic’s equity (and also: the AI industry more broadly – I already hold various public AI-correlated stocks) losing value, but I recognize some ongoing risk of distorting incentives, here.

A final concern about AI safety people working for AI companies is that their doing so will signal an inaccurate degree of endorsement of the company’s behavior, thereby promoting wrongful amounts of trust in the company and its commitment to safety. Perhaps some of this is inevitable in a noisy epistemic environment, but part of why I’m writing this post is in an effort to at least make it easier for those who care to understand the degree of endorsement that my choice to work at Anthropic reflects. And to be clear: there is in fact some signal here. That is: I feel more comfortable going to work at Anthropic than I would working at some of its competitors, specifically because I feel better about Anthropic’s attitudes towards safety and its alignment with my views and values more generally. That said: it’s not the case that I endorse all of Anthropic’s past behavior or stated views, nor do I expect to do so going forward. For example: my current impression is that relative to some kind of median Anthropic view, both amongst the leadership and the overall staff, I am substantially more worried about classic existential risk from misalignment; I expect this disagreement (along with other potential differences in worldview) to also lead to differences in how much I’d emphasize misalignment risk relative to other threats, like AI-powered authoritarianism (though: I care about that threat, too); and while I don’t know the details of Anthropic’s policy advocacy, I think it’s plausible that I would be pushing harder in favor of various forms of AI regulation, and/or would’ve pushed harder in the past, and that I would be more vocal and explicit about risks from loss of control more generally (though I think some of the considerations here get complicated^[13]). For those interested, I’ve also included a footnote with some quick takes on some more specific Anthropic-related public controversies/criticisms from the AI safety community over the years – e.g., about pushing the frontier, revising the Responsible Scaling Policy, secret non-disparagement agreements, epistemic culture, and accelerating capabilities – though I don’t claim to have thought about them each in detail.^[14] And in general, I’m not going to see myself as needing to defend Anthropic’s conduct and stated views going forwards (though: I’m also not going to see it as my duty to speak out every time Anthropic does or says something I disagree with).

Also, in case there is any unclarity about this despite all my public writing on the topic (and of course speaking only for myself and not for Anthropic): I think that the technology being built by companies like Anthropic has a significant (read: double-digit) probability of destroying the entire future of the human species. What’s more, I do not think that Anthropic is at all immune from the sorts of concerns that apply to other companies building this technology – and in particular, concerns about race dynamics and other incentives leading to catastrophically dangerous forms of AI development. This means that I think Anthropic itself has a serious chance of causing or playing an important role in the extinction or full-scale disempowerment of humanity – and for all the good intentions of Anthropic’s leadership and employees, I think everyone who chooses to work there should face this fact directly.^[15] What’s more, I think no private company should be in a position to impose this kind of risk on every living human, and I support efforts to make sure that no company ever is.^[16]

Further: I do not think that Anthropic or any other actor has an adequate plan for building superintelligence in a manner that brings the risk of catastrophic, civilization-ending misalignment to a level that a prudent and coordinated civilization would accept.^[17] I say this as someone who has spent a good portion of the past year trying to think through and write up what I see as the most promising plan in this respect – namely, the plan (or perhaps, the “concept of a plan”) described here. I think this plan is quite a bit more promising than some of its prominent critics do. But it is nowhere near good enough, and thinking it through in such detail has increased my pessimism about the situation. Why? Well, in brief: the plan is to either get lucky, or to get the AIs to solve the problem for us. Lucky, here, means that it turns out that we don’t need to rapidly make significant advances in our scientific understanding in order to learn how to adequately align and control superintelligent agents that would otherwise be in a position to disempower humanity – luck that, for various reasons, I really don’t think we can count on. And absent such luck, as far as I can tell, our best hope is to try to use less-than-superintelligent AIs – with which we will have relatively little experience, whose labor and behavior might have all sorts of faults and problems, whose output we will increasingly struggle to evaluate directly, and which might themselves be actively working to undermine our understanding and control – to rapidly make huge amounts of scientific progress in a novel domain that does not allow for empirical iteration on safety-critical failures, all in the midst of unprecedented commercial and geopolitical pressures. True, some combination of “getting lucky” and “getting AI help” might be enough for us to make it through. But we should be trying extremely hard not to bet the lives of every human and the entire future of our civilization on this. And as far as I can tell, any actor on track to build superintelligence, Anthropic included, is currently on track to make either this kind of bet, or something worse.

More specifically: I do not believe that the object-level benefits of advanced AI^[18] – serious though they may be – currently justify the level of existential risk at stake in any actor, Anthropic included, developing superintelligence given our current understanding of how to do so safely.^[19] Rather, I think the only viable justifications for trying to develop superintelligence appeal to the possibility that someone else will develop it anyways instead.^[20] But there is, indeed, a clear solution to this problem in principle: namely, to use various methods of capability restraint (coordination, enforcement, etc) to ensure that no one develops superintelligence until we have a radically better understanding of how to do so safely. I think it’s a complicated question how to act in the absence of this kind of global capability restraint; complicated, too, how to prioritize efforts to cause this kind of restraint vs. improving the situation in other ways; and complicated, as well, how to mitigate other risks that this kind of restraint could exacerbate (e.g., extreme concentrations of power). But I support the good version of this kind of capability restraint regardless, and while it’s not the current focus of my work, I aspire to do my part to help make it possible.

All this is to say: I think that in a wiser, more prudent, and more coordinated world, no company currently aiming to develop superintelligence – Anthropic included – would be allowed to do so given the state of current knowledge. But this isn’t the same as thinking that in the actual world, Anthropic itself should unilaterally shut down;^[21] and still less, that no one concerned about AI safety should work there. I do believe, though, that Anthropic should be ready to support and participate in the right sorts of efforts to ensure that no one builds superintelligence until we have a vastly better understanding of how to do so safely. And it implies, too, that even in the absence of any such successful effort, Anthropic should be extremely vigilant about the marginal risk of existential catastrophe that its work creates. Indeed, I think it’s possible that there will, in fact, come a time when Anthropic should basically just unilaterally drop out of the race – pivoting, for example, entirely to a focus on advocacy and/or doing alignment research that it then makes publicly available. And I wish I were more confident that in circumstances where this is the right choice, Anthropic will do it despite all the commercial and institutional momentum to the contrary.

I say all this so as to be explicit about what my choice to work at Anthropic does and doesn’t mean about my takes on the organization itself, the broader AI safety situation, and the ethical dynamics at stake in AI-safety-focused people going to work at AI companies. That said: it’s possible that my views in this respect will evolve over time, and I aspire to let them do so without defensiveness or attachment.^[22] And if, as a result, I end up concluding that working at Anthropic is a mistake, I aspire to simply admit that I messed up, and to leave.^[23]

In the meantime: I’m going to go and see if I can help Anthropic design Claude’s model spec in good ways.^[24] Often, starting a new role like this is exciting – and a part of me is indeed excited. Another part, though, feels heavier. When I think ahead to the kind of work that this role involves, especially in the context of increasingly dangerous and superhuman AI agents, I have a feeling like: this is not something that we are ready to do. This is not a game humanity is ready to play. A lot of this concern comes from intersections with the sorts of misalignment issues I discussed above. But the AI moral patienthood piece looms large for me as well, as do the broader ethical and political questions at stake in our choices about what sorts of powerful AI agents to bring into this world, and about who has what sort of say in those decisions. I’ve written, previously, about the sort of otherness at stake in these new minds we are creating; and about the ethical issues at stake in “designing” their values and character. I hope that the stakes are lower than this; that AI is, at least for the near-term, something more “normal.”^[25] But what if it actually isn’t? In that case, it seems to me, we are moving far too fast, with far too little grip on what we are doing.

^{^}
I also did a three month trial period before that.
^{^}
Earlier work at Open Phil, like Luke Muehlhauser’s report on consciousness and moral patienthood, can also be viewed as part of a similar aspiration – though, less officially codified at the time.
^{^}
Roodman wasn’t working officially with the worldview investigations team, but this report was spurred by a similar impulse within the organization.
^{^}
The AI-enabled coups work was eventually published via Forethought, where Tom went to work in early 2025, but much of the initial ideation occurred at Open Phil.
^{^}
Some of these were published after Lukas left Open Phil for Redwood Research in summer of this year, but most of the initial ideation occurred during his time at Open Phil. See also Lukas Finnveden’s list here for a sampling of other topics we considered or investigated.
^{^}
For example, on threat modeling, safety cases, model welfare, AI behavioral science, automated alignment research (especially conceptual alignment research), and automating other forms of philosophical/conceptual reflection.
^{^}
Good actions here include: modeling and pushing for good industry norms/practices/etc, conducting good alignment research on frontier models and sharing the results as public good, studying and sharing demonstrations of scary model behaviors, pivoting to doing a ton of automated alignment research at the right time, advocating for the right type of regulations and pauses, understanding the technical situation in detail and sharing this information with the public and with relevant decision-makers, freaking out at the right time and in the right way (if appropriate), generally pushing AI development in good/wise directions, etc. That said, I am wary of impact stories that rely on Anthropic taking actions like these when doing so will come at significant (and especially: crippling) costs to its commercial success.
^{^}
I also think that some parts of the AI safety community has in the past been overly purist/deontological/fastidious about the possibility of safety-focused work accelerating AI capabilities development, but this is a somewhat separate discussion, and I do think there are arguments on both sides.
^{^}
Though: it’s important, in considering a thought experiment like this, to try to imagine what all of Anthropic’s current staff might be doing instead.
^{^}
At a high level, from a consequentialist perspective, the most central reason not to work at a net negative institution is that to the first approximation, you should expect to be an additional multiplier/strengthener of whatever vector that institution represents. So: if that vector is net negative, then you should expect to be net negative. But this consideration, famously, can be outweighed by ways in which the overall vector of your work in particular can be pushing in a positive direction – though of course, one needs to look at that case by case, and to adjust for biases, uncertainties, time-worn heuristics, and so on. Even if you grant that it’s consequentialist-good to work at a net-negative institution, though, there remains the further question whether it’s deontologically permissible (and/or, compatible with a more sophisticated decision-theoretic approach to consequentialism – i.e., one which directs you to incorporate possible acausal correlations between your choice and the choices of others, which directs you to act in line with some broader policy you would’ve decided on from some more ignorant epistemic position, and so on – see here for more on my takes on decision theories of this kind). I won’t try to litigate this overall calculus in detail here. But as I discuss in the main text, I have the reasonably strong intuition that it is both good and deontologically/decision-theoretically right for at least some of the people I know who are working at AI companies (and also, at other institutions that I think more likely to be net negative than Anthropic) to do so. And if such an intuition is reliable, this means that at the least, “Anthropic is net negative, and one shouldn’t work at institutions like that” isn’t enough of an argument on its own.
^{^}
It’s also one of the arguments for thinking that Anthropic might be net negative, and a reason that thought experiments like “imagine the current landscape without Anthropic” might mislead.
^{^}
In particular, actually being at an AI company – and especially, in a position of influence over its safety-relevant decision-making – puts you in a position to much more directly affect the trade-offs it makes with respect to safety vs. the value of its equity in particular.
^{^}
For example: insofar as Anthropic’s technical takes about the risk of misalignment are unusually credible given its position as an industry leader, I think it is in fact important for Anthropic to spend its “crying danger” points wisely.
^{^}
Briefly:
- There is at least some evidence that early investors in Anthropic got the impression that Anthropic was initially committed to not pushing the frontier – a commitment that would be odds with their current policy and behavior (though: I think Anthropic has in fact taken costly steps in the past to not push the frontier – see e.g. discussion in this article). If Anthropic made and then broke commitments in this respect, I do think this is bad and a point against expecting them to keep safety-relevant commitments in the future. And it’s true, regardless, that some of Anthropic’s public statements suggested reticence about pushing the frontier (see e.g. quotes here), and it seems plausible that the company’s credibility amongst safety-focused people and investors benefited from cultivating this impression. That said, the fact that Anthropic in fact took costly steps not to push the frontier suggests that this reticence was genuine – albeit, defeasible. And I think benefiting from stated and genuine reticence that ended up defeated is different from breaking a promise.
- People have expressed concerns about Anthropic quietly revising/weakening the commitments in its Responsible Scaling Policy (see e.g. here on failing to define “warning sign evaluations” by the time they trained ASL-3 models, and here on weakening ASL-3 weight-theft security requirements so that they don’t cover employees with weight-access). I haven’t looked into this in detail, and I think it’s plausible that Anthropic’s choices here were reasonable, but I do think that the possibility of AI companies revising RSP-like policies, even in a manner that abides by the amendment procedure laid out in those policies (e.g., getting relevant forms of board/LTBT approval), highlights the limitations of relying on these sorts of voluntary policies to ensure safe behavior, especially as the stakes of competition increase.
- I think it was bad that Anthropic used to have secret non-disparagement agreements (though: these have been discontinued and previous agreements are no longer being enforced). It also looks to me like Sam McCandlish’s comment on behalf of Anthropic here suggested a misleading picture in this respect, though he has since clarified.
- I’ve heard concerns that Anthropic’s epistemic culture involves various vices – e.g. groupthink, over-confidence about how much the organization is likely to prioritize safety when it deviates importantly from standard commercial incentives, over-confidence about the degree of safety the organization’s RSP is likely to ultimately afford, general miscalibration about the extent to which Anthropic is especially ethically-driven vs. more of a standard company – and that the leadership plays an important role in causing this. This one feels hard for me to assess from the outside (and if true, some of the vices at stake are hardly unique to Anthropic in particular). I’m planning to see what I think once I actually see the culture up close.
- I also think it’s true, in general, that Anthropic’s researchers have played a meaningful role in accelerating capabilities in the past – e.g. Dario’s work on early GPTs.
^{^}
At least assuming they place significant probability on existential catastrophe from advanced AI in general, which I also think they should.
^{^}
I also think that in an ideal world, no single government or multi-lateral project would ever be in this position, but it’s less clear that this is a feasible policy goal, at least in worlds where superintelligent AIs ever get developed at all.
^{^}
Here I am assuming some constraints on the realism of the plan in question. And I’m more confident about this if we make further assumptions about the degree to which the civilization in question cares about its long-term future in addition to the purely near-term.
^{^}
By object-level benefits, I mean things like medical benefits, economic benefits, etc – and not the sorts of benefits that are centrally beneficial because of how they interact with the fact that other actors might build superintelligence as well.
^{^}
I think this is likely true even if you are entirely selfish, and/or if you only care about the near-term benefits and harms (e.g., the direct risk of death/disempowerment for present-day humans, vs. the potential benefits for present-day humans), because these near-term goals would likely be served better by delaying superintelligence at least a few years in order to improve our safety understanding. But I think it is especially true if, like me, you care a lot about the long-term future of human civilization as well.
^{^}
To be clear, it is also extremely possible to give bad justifications of this form – for example, “other people will build it anyways, and I want to be part of the action.”
^{^}
I think this is true even from a more complicated decision-theoretic perspective, which views the AI race as akin to a prisoner’s dilemma that all participants should coordinate to avoid, and which might therefore direct Anthropic to act in line with the policy it wants all participants to obey. The problem with this argument is that some actors in the race (and some potential entrants to it) profess beliefs, values, and intentions that suggest they would be unwilling to participate even in a coordinated policy of avoiding the race – i.e., they plan to charge ahead regardless of what anyone else does. And in such a context, even from a fancier decision-theoretic perspective that aspires to act in line with the policy you hope that everyone whose decision-procedure is suitably correlated with your own will adopt, the “I’ll just charge ahead regardless” actors aren’t suitably correlated with you and hence aren’t suitably influence-able. (Perhaps some decision-theories would direct you to act in accordance with the policy that these actors would adopt if they had better/more-idealized views/intentions, but this seems to me less natural as a first-pass approach.)
^{^}
Though: there are limits to the energy I’m going to devote to re-litigating the issue.
^{^}
Though per my comments about opportunity cost above, I think the most likely reason I’d leave Anthropic has to do with the possibility that I could be doing better work elsewhere, rather than something about the ethics of working at a company developing advanced AI in particular.
^{^}
And/or, to see if I can be suitably helpful elsewhere.
^{^}
I do think that eventually, realizing anywhere near the full potential of human civilization will require access to advanced AI or something equivalently capable.

LESSWRONG
LW

LESSWRONG
LW

88

Leaving Open Philanthropy, going to Anthropic

88

88

On my time at Open Philanthropy

On going to Anthropic