AI strategy & governance. ailabwatch.org. Looking for new projects.
There's a selection effect in what gets posted about. Maybe someone should write the "ways Anthropic is better than others" list to combat this.
Edit: there’s also a selection effect in what you see, since negative stuff gets more upvotes…
I was recently surprised to notice that Anthropic doesn't seem to have a commitment to publish its safety research.[1] It has a huge safety team but has only published ~5 papers this year (plus an evals report). So probably it has lots of safety research it's not publishing. E.g. my impression is that it's not publishing its scalable oversight and especially adversarial robustness and inference-time misuse prevention research.
Not-publishing-safety-research is consistent with Anthropic prioritizing the win the race goal over the help all labs improve safety goal, insofar as the research is commercially advantageous. (Insofar as it's not, not-publishing-safety-reseach is baffling.)
Maybe it would be better if Anthropic published ~all of its safety research, including product-y stuff like adversarial robustness such that others can copy its practices.
(I think this is not a priority for me to investigate but I'm interested in info and takes.)
I failed to find good sources saying Anthropic publishes its safety research. I did find:
Edit: also cofounder Chris Olah said "we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk." But he seems to be saying that safety benefit > social cost is a necessary condition for publishing, not necessarily that the policy is to publish all such research.
OpenAI has never (to my knowledge) made public statements about not using AI to automate AI research
I agree.
OpenAI intends to use Strawberry to perform research. . . .
Among the capabilities OpenAI is aiming Strawberry at is performing long-horizon tasks (LHT), the document says, referring to complex tasks that require a model to plan ahead and perform a series of actions over an extended period of time, the first source explained.
To do so, OpenAI is creating, training and evaluating the models on what the company calls a “deep-research” dataset, according to the OpenAI internal documentation. . . .
OpenAI specifically wants its models to use these capabilities to conduct research by browsing the web autonomously with the assistance of a “CUA,” or a computer-using agent, that can take actions based on its findings, according to the document and one of the sources. OpenAI also plans to test its capabilities on doing the work of software and machine learning engineers.
This is just the paralysis argument. (Maybe any sophisticated non-consequentialists will have to avoid this anyway. Maybe this shows that non-consequentialism is unappealing.)
[Edit after Buck's reply: I think it's weaker because most Anthropic employees aren't causing the possible-deaths, just participating in a process that might cause deaths.]
(The LTBT got the power to appoint one board member in fall 2023, but didn't do so until May. It got power to appoint a second in July, but hasn't done so yet. It gets power to appoint a third in November. It doesn't seem to be on track to make a third appointment in November.)
(And the LTBT might make non-independent appointments, in particular keeping Daniela.)
Sure, good point. But it's far from obvious that the best interventions long-term-wise are the best short-term-wise, and I believe people are mostly just thinking about short-term stuff. I'd feel better if people talked about training data or whatever rather than just "protect any interests that warrant protecting" and "make interventions and concessions for model welfare."
(As far as I remember, nobody's published a list of how short-term AI welfare stuff can boost long-term AI welfare stuff that includes the training-data thing you mention. This shows that people aren't thinking about long-term stuff. Actually there hasn't been much published on short-term stuff either, so: shrug.)
tl;dr: I think Anthropic is on track to trade off nontrivial P(win) to improve short-term AI welfare,[1] and this seems bad and confusing to me. (This worry isn't really based on this post; the post just inspired me to write something.)
Anthropic buys carbon offsets to be carbon-neutral. Carbon-offset mindset involves:
I'm worried that Anthropic will be in carbon-offset mindset with respect to AI welfare.
There are several stories you can tell about how working on AI welfare soon will be a big deal for the long-term future (like, worth >>10^60 happy human lives):
But when I hear Anthropic people (and most AI safety people) talk about AI welfare, the vibe is like it would be unacceptable to incur a [1% or 5% or so] risk of a deployment accidentally creating AI suffering worse than 10^[6 or 9 or so] suffering-filled human lives. Numbers aside, the focus is we should avoid causing a moral catastrophe in our own deployments and on merely Earth-scale stuff, not we should increase the chance that long-term AI welfare and the cosmic endowment go well. Likewise, this post suggests efforts to "protect any interests that warrant protecting" and "make interventions and concessions for model welfare" at ASL-4. I'm very glad that this post mentions that doing so could be too costly, but I think very few resources (that trade off with improving safety) should go into improving short-term AI welfare (unless you're actually trying to improve the long-term future somehow) and most people (including most of the Anthropic people I've heard from) aren't thinking through the tradeoff. Shut up and multiply; treat the higher-stakes thing as proportionately more important.[4] (And notice inaction risk.)
(Plucking low-hanging fruit for short-term AI welfare is fine as long as it isn't so costly and doesn't crowd out more important AI welfare work.)
I worry Anthropic is both missing an opportunity to do astronomical good in expectation via AI welfare work and setting itself up to sacrifice a lot for merely-Earth-scale AI welfare.
One might reply: Zach is worried about the long-term, but Sam is just talking about decisions Anthropic will have to make short-term; this is fine. To be clear, my worry is Anthropic will be much too concerned with short-term AI welfare, and so it will make sacrifices (labor, money, interfering with deployments) for short-term AI welfare, and these sacrifices will make Anthropic substantially less competitive and slightly worse on safety, and this increases P(doom).
I wanted to make this point before reading this post; this post just inspired me to write it, despite not being a great example of the attitude I'm worried about since it mentions how the costs of improving short-term AI welfare might be too great. (But it does spend two subsections on short-term AI welfare, which suggests that the author is much too concerned with short-term AI welfare [relative to other things you could invest effort into], according to me.)
I like and appreciate this post.
Or—worse—to avoid being the ones to cause short-term AI suffering.
E.g. failing to take seriously the possibility that you make yourself uncompetitive and your influence and market share just goes to less scrupulous companies.
Related to this and the following bullet: Ryan Greenblatt's ideas.
For scope-sensitive consequentialists—at least—short-term AI welfare stuff is a rounding error and thus a red herring, except for its effects on the long-term future.
Yes for one mechanism. It's unclear but it sounds like "the Trust Agreement also authorizes the Trust to be enforced by the company and by groups of the company’s stockholders who have held a sufficient percentage of the company’s equity for a sufficient period of time" describes a mysterious separate mechanism for Anthropic/stockholders to disempower the trustees.
This shortform is relevant to e.g. understanding what's going on and considerations on the value of working on safety at Anthropic, not just pressuring Anthropic.
@Neel Nanda