Zach Stein-Perlman

AI strategy & governance. ailabwatch.org. Looking for new projects.

Sequences

Slowing AI

Wiki Contributions

Load More

Comments

Sorted by

This shortform is relevant to e.g. understanding what's going on and considerations on the value of working on safety at Anthropic, not just pressuring Anthropic.

@Neel Nanda 

There's a selection effect in what gets posted about. Maybe someone should write the "ways Anthropic is better than others" list to combat this.

Edit: there’s also a selection effect in what you see, since negative stuff gets more upvotes…

  1. I tentatively think this is a high-priority ask
  2. Capabilities research isn't a monolith and improving capabilities without increasing spooky black-box reasoning seems pretty fine
  3. If you're right, I think the upshot is (a) Anthropic should figure out whether to publish stuff rather than let it languish and/or (b) it would be better for lots of Anthropic safety researchers to instead do research that's safe to share (rather than research that only has value if Anthropic wins the race)

I was recently surprised to notice that Anthropic doesn't seem to have a commitment to publish its safety research.[1] It has a huge safety team but has only published ~5 papers this year (plus an evals report). So probably it has lots of safety research it's not publishing. E.g. my impression is that it's not publishing its scalable oversight and especially adversarial robustness and inference-time misuse prevention research.

Not-publishing-safety-research is consistent with Anthropic prioritizing the win the race goal over the help all labs improve safety goal, insofar as the research is commercially advantageous. (Insofar as it's not, not-publishing-safety-reseach is baffling.)

Maybe it would be better if Anthropic published ~all of its safety research, including product-y stuff like adversarial robustness such that others can copy its practices.

(I think this is not a priority for me to investigate but I'm interested in info and takes.)

  1. ^

    I failed to find good sources saying Anthropic publishes its safety research. I did find:

    1. https://www.anthropic.com/research says "we . . . share what we learn [on safety]."
    2. President Daniela Amodei said "we publish our safety research" on a podcast once.
    3. Cofounder Nick Joseph said this on a podcast recently (seems false but it's just a podcast so that's not so bad):
      > we publish our safety research, so in some ways we’re making it as easy as we can for [other labs]. We’re like, “Here’s all the safety research we’ve done. Here’s as much detail as we can give about it. Please go reproduce it.”

    Edit: also cofounder Chris Olah said "we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk." But he seems to be saying that safety benefit > social cost is a necessary condition for publishing, not necessarily that the policy is to publish all such research.

OpenAI has never (to my knowledge) made public statements about not using AI to automate AI research

I agree.

Another source:

OpenAI intends to use Strawberry to perform research. . . .

Among the capabilities OpenAI is aiming Strawberry at is performing long-horizon tasks (LHT), the document says, referring to complex tasks that require a model to plan ahead and perform a series of actions over an extended period of time, the first source explained.

To do so, OpenAI is creating, training and evaluating the models on what the company calls a “deep-research” dataset, according to the OpenAI internal documentation. . . .

OpenAI specifically wants its models to use these capabilities to conduct research by browsing the web autonomously with the assistance of a “CUA,” or a computer-using agent, that can take actions based on its findings, according to the document and one of the sources. OpenAI also plans to test its capabilities on doing the work of software and machine learning engineers.

This is just the paralysis argument. (Maybe any sophisticated non-consequentialists will have to avoid this anyway. Maybe this shows that non-consequentialism is unappealing.)

[Edit after Buck's reply: I think it's weaker because most Anthropic employees aren't causing the possible-deaths, just participating in a process that might cause deaths.]

(The LTBT got the power to appoint one board member in fall 2023, but didn't do so until May. It got power to appoint a second in July, but hasn't done so yet. It gets power to appoint a third in November. It doesn't seem to be on track to make a third appointment in November.)

(And the LTBT might make non-independent appointments, in particular keeping Daniela.)

Sure, good point. But it's far from obvious that the best interventions long-term-wise are the best short-term-wise, and I believe people are mostly just thinking about short-term stuff. I'd feel better if people talked about training data or whatever rather than just "protect any interests that warrant protecting" and "make interventions and concessions for model welfare."

(As far as I remember, nobody's published a list of how short-term AI welfare stuff can boost long-term AI welfare stuff that includes the training-data thing you mention. This shows that people aren't thinking about long-term stuff. Actually there hasn't been much published on short-term stuff either, so: shrug.)

tl;dr: I think Anthropic is on track to trade off nontrivial P(win) to improve short-term AI welfare,[1] and this seems bad and confusing to me. (This worry isn't really based on this post; the post just inspired me to write something.)


Anthropic buys carbon offsets to be carbon-neutral. Carbon-offset mindset involves:

  • Doing things that feel good—and look good to many ethics-minded observers—but are more motivated by purity than seeking to do as much good as possible and thus likely to be much less valuable than the best way to do good (on the margin)
  • Focusing on avoiding doing harm yourself, rather than focusing on net good or noticing how your actions affect others[2] (related concept: inaction risk)

I'm worried that Anthropic will be in carbon-offset mindset with respect to AI welfare.

There are several stories you can tell about how working on AI welfare soon will be a big deal for the long-term future (like, worth >>10^60 happy human lives):

  • If we're accidentally torturing AI systems, they're more likely to take catastrophic actions. We should try to verify that AIs are ok with their situation and take it seriously if not.[3]
  • It would improve safety if we were able to pay or trade with near-future potentially-misaligned AI, but we're not currently able to, likely including because we don't understand AI-welfare-adjacent stuff well enough.
  • [Decision theory mumble mumble.]
  • Also just "shaping norms in a way that leads to lasting changes in how humanity chooses to treat digital minds in the very long run," somehow.
  • [More.]

But when I hear Anthropic people (and most AI safety people) talk about AI welfare, the vibe is like it would be unacceptable to incur a [1% or 5% or so] risk of a deployment accidentally creating AI suffering worse than 10^[6 or 9 or so] suffering-filled human lives. Numbers aside, the focus is we should avoid causing a moral catastrophe in our own deployments and on merely Earth-scale stuff, not we should increase the chance that long-term AI welfare and the cosmic endowment go well. Likewise, this post suggests efforts to "protect any interests that warrant protecting" and "make interventions and concessions for model welfare" at ASL-4. I'm very glad that this post mentions that doing so could be too costly, but I think very few resources (that trade off with improving safety) should go into improving short-term AI welfare (unless you're actually trying to improve the long-term future somehow) and most people (including most of the Anthropic people I've heard from) aren't thinking through the tradeoff. Shut up and multiply; treat the higher-stakes thing as proportionately more important.[4] (And notice inaction risk.)

(Plucking low-hanging fruit for short-term AI welfare is fine as long as it isn't so costly and doesn't crowd out more important AI welfare work.)

I worry Anthropic is both missing an opportunity to do astronomical good in expectation via AI welfare work and setting itself up to sacrifice a lot for merely-Earth-scale AI welfare.


One might reply: Zach is worried about the long-term, but Sam is just talking about decisions Anthropic will have to make short-term; this is fine. To be clear, my worry is Anthropic will be much too concerned with short-term AI welfare, and so it will make sacrifices (labor, money, interfering with deployments) for short-term AI welfare, and these sacrifices will make Anthropic substantially less competitive and slightly worse on safety, and this increases P(doom).


I wanted to make this point before reading this post; this post just inspired me to write it, despite not being a great example of the attitude I'm worried about since it mentions how the costs of improving short-term AI welfare might be too great. (But it does spend two subsections on short-term AI welfare, which suggests that the author is much too concerned with short-term AI welfare [relative to other things you could invest effort into], according to me.)

I like and appreciate this post.

  1. ^

    Or—worse—to avoid being the ones to cause short-term AI suffering.

  2. ^

    E.g. failing to take seriously the possibility that you make yourself uncompetitive and your influence and market share just goes to less scrupulous companies.

  3. ^

    Related to this and the following bullet: Ryan Greenblatt's ideas.

  4. ^

    For scope-sensitive consequentialists—at least—short-term AI welfare stuff is a rounding error and thus a red herring, except for its effects on the long-term future.

Yes for one mechanism. It's unclear but it sounds like "the Trust Agreement also authorizes the Trust to be enforced by the company and by groups of the company’s stockholders who have held a sufficient percentage of the company’s equity for a sufficient period of time" describes a mysterious separate mechanism for Anthropic/stockholders to disempower the trustees.

Load More