We founded Anthropic because we believe the impact of AI might be comparable to that of the industrial and scientific revolutions, but we aren’t confident it will go well. And we also believe this level of impact could start to arrive soon – perhaps in the coming decade.This view may sound implausible or grandiose, and there are good reasons to be skeptical of it. For one thing, almost everyone who has said “the thing we’re working on might be one of the biggest developments in history” has been wrong, often laughably so. Nevertheless, we believe there is enough evidence to seriously prepare for a world where rapid AI progress leads to transformative AI systems.At Anthropic our motto has been “show, don’t tell”, and we’ve focused on releasing a steady stream of safety-oriented research that we believe has broad value for the AI community. We’re writing this now because as more people have become aware of AI progress, it feels timely to express our own views on this topic and to explain our strategy and goals. In short, we believe that AI safety research is urgently important and should be supported by a wide range of public and private actors.So in this post we will summarize why we believe all this: why we anticipate very rapid AI progress and very large impacts from AI, and how that led us to be concerned about AI safety. We’ll then briefly summarize our own approach to AI safety research and some of the reasoning behind it. We hope by writing this we can contribute to broader discussions about AI safety and AI progress.
We founded Anthropic because we believe the impact of AI might be comparable to that of the industrial and scientific revolutions, but we aren’t confident it will go well. And we also believe this level of impact could start to arrive soon – perhaps in the coming decade.
This view may sound implausible or grandiose, and there are good reasons to be skeptical of it. For one thing, almost everyone who has said “the thing we’re working on might be one of the biggest developments in history” has been wrong, often laughably so. Nevertheless, we believe there is enough evidence to seriously prepare for a world where rapid AI progress leads to transformative AI systems.
At Anthropic our motto has been “show, don’t tell”, and we’ve focused on releasing a steady stream of safety-oriented research that we believe has broad value for the AI community. We’re writing this now because as more people have become aware of AI progress, it feels timely to express our own views on this topic and to explain our strategy and goals. In short, we believe that AI safety research is urgently important and should be supported by a wide range of public and private actors.
So in this post we will summarize why we believe all this: why we anticipate very rapid AI progress and very large impacts from AI, and how that led us to be concerned about AI safety. We’ll then briefly summarize our own approach to AI safety research and some of the reasoning behind it. We hope by writing this we can contribute to broader discussions about AI safety and AI progress.
As a high level summary of the main points in this post:
AI will have a very large impact, possibly in the coming decadeRapid and continuing AI progress is a predictable consequence of the exponential increase in computation used to train AI systems, because research on “scaling laws” demonstrates that more computation leads to general improvements in capabilities. Simple extrapolations suggest AI systems will become far more capable in the next decade, possibly equaling or exceeding human level performance at most intellectual tasks. AI progress might slow or halt, but the evidence suggests it will probably continue.We do not know how to train systems to robustly behave wellSo far, no one knows how to train very powerful AI systems to be robustly helpful, honest, and harmless. Furthermore, rapid AI progress will be disruptive to society and may trigger competitive races that could lead corporations or nations to deploy untrustworthy AI systems. The results of this could be catastrophic, either because AI systems strategically pursue dangerous goals, or because these systems make more innocent mistakes in high-stakes situations.We are most optimistic about a multi-faceted, empirically-driven approach to AI safety We’re pursuing a variety of research directions with the goal of building reliably safe systems, and are currently most excited about scaling supervision, mechanistic interpretability, process-oriented learning, and understanding and evaluating how AI systems learn and generalize. A key goal of ours is to differentially accelerate this safety work, and to develop a profile of safety research that attempts to cover a wide range of scenarios, from those in which safety challenges turn out to be easy to address to those in which creating safe systems is extremely difficult.
The full post goes into considerably more detail, and I'm really excited that we're sharing more of our thinking publicly.
I think the worldview here seems cogent. It's very good for Anthropic folk to be writing up their organizational-beliefs publicly. I'm pretty sympathetic to "man, we have no idea how to make real progress without empirical iteration, so we just need to figure out how to make empirical iteration work somehow."
I have a few disagreements. I think the most important ones route through "how likely is this to accelerate race dynamics and how bad is that?".
We've subsequently begun deploying Claude now that the gap between it and the public state of the art is smaller.
It sounds like this means Claude is still a bit ahead of the public-state-of-the-art (but not much). But I'm not sure I'm interpreting it correctly.
I want to flag that an Anthropic employee recently told me something like "Anthropic wants to stay near the front of the pack at AI capabilities so that their empirical research is relevant, but not at the actual front of the pack to avoid accelerating race-dynamics." That would be a plausibly reasonable strategy IMO (although I'd still be skeptical about how likely it was to exacerbate race dynamics in a net-negative way). But it sounds like Claude was released while it was an advance over the public sota.
I guess I could square this via "Claude was ahead of the public SOTA, but not an advance over privately available networks?".
But, it generally looks to me like OpenAI and Anthropic, the two ostensibly safety-minded orgs, are nontrivially accelerating AI hype and progress due to local races between the two of them, and I feel quite scared about that.
I realize they're part of some dynamics that extend beyond them, and I realize there are a lot of difficult realities like "we really do believe we need to work on LLMs, those really are very expensive to train, we really need to raise money, the money really needs to come from somewhere, and doing some releases and deals with Google/Microsoft etc seem necessary." But, it sure looks like the end result of all of this is an accelerated race, and even if you're only on 33%ish likelihood of "a really pessimistic scenario", that's a pretty high likelihood scenario to be accelerating towards.
My guess is that from the inside of Anthropic-decisionmaking, the race feels sort of out-of-their-control, and it's better to ride the wave that to sit doing nothing. But it seems to me like "figure out how to slow down the race dynamics here" should be a top organizational priority, even within the set of assumptions outlined in this post.
I both agree that the race dynamic is concerning (and would like to see Anthropic address them explicitly), and also think that Anthropic should get a fair bit of credit for not releasing Claude before ChatGPT, a thing they could have done and probably gained a lot of investment / hype over. I think Anthropic's "let's not contribute to AI hype" is good in the same way that OpenAI's "let's generate massive" hype strategy is bad.Like definitely I'm worried about the incentive to stay competitive, especially in the product space. But I think it's worth highlighting that Anthropic (and Deepmind and Google AI fwiw) have not rushed to product when they could have. There's still the relevant question "is building SOTA systems net positive given this strategy", and it's not clear to me what the answer is, but I want to acknowledge that "building SOTA systems and generating hype / rushing to market" is the default for startups and "build SOTA systems and resist the juicy incentive" is what Anthropic has done so far & that's significant.
Yeah I agree with this.
To be clear, I think Anthropic has done a pretty admirable job of showing some restraint here. It is objectively quite impressive. My wariness is "Man, I think the task here is really hard and even a very admirably executed company may not be sufficient."
Yeah I think we should all be scared of the incentives here.
I both agree that the race dynamic is concerning (and would like to see Anthropic address them explicitly), and also think that Anthropic should get a fair bit of credit for not releasing Claude before ChatGPT, a thing they could have done and probably gained a lot of investment / hype over.
I mean, didn't the capabilities of Claude leak specifically to OpenAI employees, so that it's pretty unclear that not releasing actually had much of an effect on preventing racing? My current best guess, though I am only like 30% of this hypothesis since there are many possible hypotheses here, is that Chat-GPT was developed in substantial parts because someone saw or heard about a demo of Claude and thought it was super impressive.
Yeah I think it can both be true that OpenAI felt more pressure to release products faster due to perceived competition risk from Anthropic, and also that Anthropic showed restraint in not trying to race them to get public demos or a product out. In terms of speeding up AI development, not building anything > building something and keeping it completely secret > building something that your competitors learn about > building something and generating public hype about it via demos > building something with hype and publicly releasing it to users & customers. I just want to make sure people are tracking the differences.
so that it's pretty unclear that not releasing actually had much of an effect on preventing racing
It seems like if OpenAI didn't publicly release ChatGPT then that huge hype wave wouldn't have happened, at least for a while, since Anthropic sitting on Claude rather than release. I think it's legit to question whether any group scaling SOTA models is net positive but I want to be clear about credit assignment, and the ChatGPT release was an action taken by OpenAI.
In terms of speeding up AI development, not building anything > building something and keeping it completely secret > building something that your competitors learn about > building something and generating public hype about it via demos > building something with hype and publicly releasing it to users & customers.
I think it is very helpful, and healthy for the discourse, to make this distinction. I agree that many of these things might get lumped together.But also, I want to flag the possibility that something can be very very bad to do, even if there are there other things that would have been progressively worse to do.I want to make sure that groups get the credit that is due to them when they do good things against their incentives.
I also want to avoided falling into a pattern of thinking "well they didn't do the worst thing, or the second worst thing, so that's pretty good!" if in isolation I would have thought that action was pretty bad / blameworthy.As of this moment, I don't have a particular opinion one way or the other about how good or bad Anthropic's release policy is. I'm merely making the abstract point at this time.
Yeah, I agree with all of this, seems worth saying. Now to figure out the object level... 🤔
That's the hard part.
My guess is that training cutting edge models, and not releasing them is a pretty good play, or would have been, if there wasn't huge AGI hype.
As it is, information about your models is going to leak, and in most cases the fact that something is possible is most of the secret to reverse engineering it (note: this might be true in the regime of transformer models, but it might not be true for other tasks or sub-problems).
But on the other hand, given the hype, people are going to try to do the things that you're doing anyway, so maybe leaks about your capabilities don't make that much difference? This does point out an important consideration, which is "how much information needs to leak from your lab to enable someone else to replicate your results?"
It seems like, in many cases, there's an obvious way to do some task, and the mere fact that you succeeded is enough info to recreate your result. But presumably there are cases, where you figure out a clever trick, and even if the evidence of your model's performance leaks, that doesn't tell the world how to do it (though it does cause maybe hundreds of smart people to start looking for how you did it, trying to discover how to do it themselves).I think I should regard the situation differently depending on the status of that axis.
For comparison, others might want to see the DeepMind alignment team's strategy: https://www.lesswrong.com/posts/a9SPcZ6GXAg9cNKdi/linkpost-deepmind-alignment-team-s-strategy I think this is the equivalent post for OpenAI but someone feel free to correct me:https://www.lesswrong.com/posts/28sEs97ehEo8WZYb8/openai-s-alignment-plans
My summary to augment the main one:
Broadly human level AI may be here soon and will have a large impact. Anthropic has a portfolio approach to AI safety, considering both: optimistic scenarios where current techniques are enough for alignment, intermediate scenarios where substantial work is needed, and pessimistic scenarios where alignment is impossible; they do not give a breakdown of probability mass in each bucket and hope that future evidence will help figure out what world we're in (though see the last quote below). These buckets are helpful for understanding the goal of developing: better techniques for making AI systems safer, and better ways of identifying how safe or unsafe AI systems are. Scaling systems is required for some good safety research, e.g., some problems only arise near human-level, Debate and Constitutional AI need big models, need to understand scaling to understand future risks, if models are dangerous, compelling evidence will be needed.
They do three kinds of research: Capabilities which they don’t publish, Alignment Capabilities which seems mostly about improving chat bots and applying oversight techniques at scale, and Alignment Science which involves interpretability and red-teaming of the approaches developed in Alignment Capabilities. They broadly take an empirical approach to safety, and current research directions include: scaling supervision, mechanistic interpretability, process-oriented learning, testing for dangerous failure modes, evaluating societal impacts, and understanding and evaluating how AI systems learn and generalize.
I'll note that I'm confused about the Optimistic, Intermediate, and Pessimistic scenarios: how likely does Anthropic think each is? What is the main evidence currently contributing to that world view? How are you actually preparing for near-pessimistic scenarios which "could instead involve channeling our collective efforts towards AI safety research and halting AI progress in the meantime?"
how likely does Anthropic think each is? What is the main evidence currently contributing to that world view?
I wouldn't want to give an "official organizational probability distribution", but I think collectively we average out to something closer to "a uniform prior over possibilities" without that much evidence thus far updating us from there. Basically, there are plausible stories and intuitions pointing in lots of directions, and no real empirical evidence which bears on it thus far.
(Obviously, within the company, there's a wide range of views. Some people are very pessimistic. Others are optimistic. We debate this quite a bit internally, and I think that's really positive! But I think there's a broad consensus to take the entire range seriously, including the very pessimistic ones.)
This is pretty distinct from how I think many people here see things – ie. I get the sense that many people assign most of their probability mass to what we call pessimistic scenarios – but I also don't want to give the impression that this means we're taking the pessimistic scenario lightly. If you believe there's a ~33% chance of the pessimistic scenario, that's absolutely terrifying. No potentially catastrophic system should be created without very compelling evidence updating us against this! And of course, the range of scenarios in the intermediate range are also very scary.
How are you actually preparing for near-pessimistic scenarios which "could instead involve channeling our collective efforts towards AI safety research and halting AI progress in the meantime?"
At a very high-level, I think our first goal for most pessimistic scenarios is just to be able to recognize that we're in one! That's very difficult in itself – in some sense, the thing that makes the most pessimistic scenarios pessimistic is that they're so difficult to recognize. So we're working on that.
But before diving into our work on pessimistic scenarios, it's worth noting that – while a non-trivial portion of our research is directed towards pessimistic scenarios – our research is in some ways more invested in optimistic scenarios at the present moment. There are a few reasons for this:
We also have a lot of work that I might describe as trying to move from optimistic scenarios towards more intermediate scenarios. This includes our process-oriented learning and scalable supervision agendas.
But what are we doing now to address pessimistic scenarios? (Again, remember that our primary goal for pessimistic scenarios is just to recognize that we're in one and generate compelling evidence that can persuade the world.)
To be clear, we think pessimistic scenarios are, well, pessimistic and hard! These are our best preliminary attempts at agendas for addressing them, and we expect to change and expand as we learn more. Additionally, as we make progress on the more optimistic scenarios, I expect the number of projects we have targeted on pessimistic scenarios to increase.
There's a unilateralism problem with doing risky stuff in the name of picking up probability mass, where one has knightian uncertainty about whether that probability mass is actually there. If people end up with a reasonable distribution over alignment difficulty, plus some noise, then the people with noise that happened to make their distribution more optimistic will view it as more worth while to trade off accelerated timelines for alignment success in medium-difficulty worlds. Mostly people should just act on their inside view models, but it's pretty concerning to have another major org trying to have cutting-edge capabilities. The capabilities are going to leak out one way or another and are going to contribute to races.
What are the strategic reasons for prioritizing work on intermediate difficulty problems and "easy safety techniques" at this time?
Doesn't this part of the comment answer your question?
We can very easily "grab probability mass" in relatively optimistic worlds. From our perspective of assigning non-trivial probability mass to the optimistic worlds, there's enormous opportunity to do work that, say, one might think moves us from a 20% chance of things going well to a 30% chance of things going well. This makes it the most efficient option on the present margin.
It sounds like they think it's easier to make progress on research that will help in scenarios where alignment ends up being not that hard. And so they're focusing there because it seems to be highest EV.
Seems reasonable to me. (Though noting that the full EV analysis would have to take into account how neglected different kinds of research are, and many other factors as well.)
Thanks Zac. My high level take is I found this very useful for understanding Anthropic broader strategy and think that I agree with a lot of the thinking. It definitely seems like some of this research could backfire but Anthropic is aware of that. The rest of my thoughts are below. I found a lot of value in the examination of different scenarios. I think this provides the clearest explanations for why Anthropic is taking an empirical/portfolio approach. My mental models of people disagreeing with this approach involves them being either more confident about pessimistic (they would say realistic scenarios) or that they disagree with specific research agendas/have favorites. I'm very uncertain about which scenario we live in but in the context of that uncertainty, the portfolio approach seems reasonable. I think the most contentious part of this post will probably be the arguments in favor of working with frontier models. It seems to me that while this is dangerous, the knowledge required to correctly assess a) whether this is necessary, b) what, if any, results that arise from such research should be published seems closely tied to that work itself (ie: questions like how many safety relevant phenomena just don't exist in smaller models and how redundant work on small models becomes). Writing this comment, I feel a strong sense of, "gee, I feel like if anyone would have the insights to know whether this stuff is a safe bet, it would be the teams at Anthropic" and that feels kind of dangerous. Independent oversight such as ARC evals might help us but a strong internal culture of red-teaming different strategies would also be good. Quoting from the main article, I wanted to highlight some points:
Furthermore, we think that in practice, doing safety research isn’t enough – it’s also important to build an organization with the institutional knowledge to integrate the latest safety research into real systems as quickly as possible.
I think this is a really good point. The actual implementation of many alignment strategies might be exceedingly technically complicated and it seems unlikely that we could attain that knowledge quickly as opposed to over years of working with frontier models.
In a sense one can view alignment capabilities vs alignment science as a “blue team” vs “red team” distinction, where alignment capabilities research attempts to develop new algorithms, while alignment science tries to understand and expose their limitations.
This distinction also seems good to me. If there is work that can't be published or until functional independent evaluation is working well, then high quality internal red-teaming seems essential.
[responded to wrong comment!]
Thanks for writing this, Zac and team, I personally appreciate more transparency about AGI labs' safety plans!
Something I find myself confused about is how Anthropic should draw the line between what to publish or not. This post seems to draw the line between "The Three Types of AI Research at Anthropic" (you generally don't publish capabilities research but do publish the rest), but I wonder if there's a case to be more nuanced than that.
To get more concrete, the post brings up how "the AI safety community often debates whether the development of RLHF – which also generates economic value – 'really' was safety research" and says that Anthropic thinks it was. However, the post also states Anthropic "decided to prioritize using [Claude] for safety research rather than public deployments" in the spring of 2022. To me, this feels slightly dissonant—insofar as much of the safety and capabilities benefits from Claude came from RLHF/RLAIF (and perhaps more data, though that's perhaps less infohazardous), this seems like Anthropic decided not to publish "Alignment Capabilities" research for (IMO justified) fear of negative externalities. Perhaps the boundaries around what should generally not be published should also extend to some Alignment Capabilities research then, especially research like RLHF that might have higher capabilities externalities.
Additionally, I've also been thinking that more interpretability research maybe should be kept private. As a recent concrete example, Stanford's Hazy Research lab just published Hyena, a convolutional architecture that's meant to rival transformers by scaling to much longer context lengths while taking significantly fewer FLOPs to train. This is clearly very much public research in the "AI Capabilities" camp, but they cite several results from Transformer Circuits for motivating the theory and design choices behind their new architecture, and say "This work wouldn’t have been possible without inspiring progress on ... mechanistic interpretability." That's all to say that some of the "Alignment Science" research might also be useful as ML theory research and then motivate advances in AI capabilities.
I'm curious if you have thoughts on these cases, and how Anthropic might draw a more cautious line around what they choose to publish.
Unfortunately, I don't think a detailed discussion of what we regard as safe to publish would be responsible, but I can share how we operate at a procedural level. We don't consider any research area to be blanket safe to publish. Instead, we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk. In the case of difficult scenarios, we have a formal infohazard review procedure.
Thanks for the response, Chris, that makes sense and I'm glad to read you have a formal infohazard procedure!
Could you share more about how the Anthropic Policy team fits into all this? I felt that a discussion of their work was somewhat missing from this blog post.
(Zac's note: I'm posting this on behalf of Jack Clark, who is unfortunately unwell today. Everything below is his words.)
Hi there, I’m Jack and I lead our policy team. The primary reason it’s not discussed in the post is that the post was already quite long and we wanted to keep the focus on safety - I did some help editing bits of the post and couldn’t figure out a way to shoehorn in stuff about policy without it feeling inelegant / orthogonal.
You do, however, raise a good point, in that we haven’t spent much time publicly explaining what we’re up to as a team. One of my goals for 2023 is to do a long writeup here. But since you asked, here’s some information:
You can generally think of the Anthropic policy team as doing three primary things:
More broadly, we try to be transparent on the micro level, but haven’t invested yet in being transparent on the macro. What I mean by that is many of our RFIs, talks, and ideas are public, but we haven’t yet done a single writeup that gives an overview of our work. I am hoping to do this with the team this year!
Some other desiderata that may be useful:
Our wonderful colleagues on the ‘Societal Impacts’ team led this work on Red Teaming and we (Policy) helped out on the paper and some of the research. We generally think red teaming is a great idea to push to policymakers re AI systems; it’s one of those things that is ‘shovel ready’ for the systems of today but, we think, has some decent chance of helping out in future with increasingly large models.
I don't think that team (if such a team exists) is missing from this post, and if it was missing, then it would be for some pretty good reasons. You can see Jack Clark's thread or DM me if you're interested in that sort of thing.
There is a Policy team listed here. So it presumably exists. I don't think omitting its work from the post has to be for good reasons, it could just be because the post is already quite long. An example of something Anthropic could say which would give me useful information on the policy front; I am making this up, but seems good if true:
In pessimistic and intermediate difficulty scenarios, it may be quite important for AI developers to avoid racing. In addition to avoiding contributing to such racing dynamics ourselves, we are also working to build safety-collaborations among researchers at leading AI safety organizations. If an AI lab finds compelling evidence about dangerous systems, it is paramount that such evidence is disseminated to relevant actors in industry and government. We are building relationships and secure information sharing systems between major AI developers and working with regulators to remain in compliance with relevant laws (e.g., anti-trust).
Again, I have no idea what the policy team is doing, but they could plausibly be doing something like this and could say so, while there may be some things they don't want to talk about.
As Jack notes here, the Policy team was omitted for brevity and focus. You can read that comment for more about the Policy team, including how we aim to give impartial, technically informed advice and share insights with policymakers.
Thank you for this post. It looks like the people at Anthropic have put a lot of thought into this which is good to see.
You mention that there are often surprising qualitative differences between larger and smaller models. How seriously is Anthropic considering a scenario where there is a sudden jump in certain dangerous capabilities (in particular deception) at some level of model intelligence? Does it seem plausible that it might not be possible to foresee this jump from experiments on even slighter weaker models?
We certainly think that abrupt changes of safety properties are very possible! See discussion of how the most pessimistic scenarios may seem optimistic until very powerful systems are created in this post, and also our paper on Predictability and Surprise.
With that said, I think we tend to expect a bit of continuity. Empirically, even the "abrupt changes" we observe with respect to model size tend to take place over order-of-magnitude changes in compute. (There are examples of things like the formation of induction heads where qualitative changes in model properties can happen quite fast over the course of training).
But we certainly wouldn't claim to know this with any confidence, and wouldn't take the possibility of extremely abrupt changes off the table!
In a larger picture, you should also factor in the probability that the oversight (over the breeding of misaligned tendencies) will always be vigilant. The entire history of safety science tells us that this is unlikely, or downright impossible. Mistakes will happen, "obligatory checks" do get skipped, and entirely unanticipated failure modes do emerge. And we should convince ourselves that nothing of this will happen, with decent probability, from the moment of the first AGI deployment, until "the end of time" (or practically, we should show, theoretically, that the ensuing recursive self-improvement or quasi-self-improvement sociotechnical dynamics will only coverage to more resilience rather than less resilience). This is very hard to demonstrate, but it must be done to justify AGI deployment. I didn't see evidence in the post that Anthropic appreciates this angle of looking at the problem enough.
Great post. I'm happy to see these plans coming out, following OpenAI's lead.
It seems like all the safety strategies are targeted at outer alignment and interpretability. None of the recent OpenAI, Deepmind, Anthropic, or Conjecture plans seem to target inner alignment, iirc, even though this seems to me like the biggest challenge.
Is Anthropic mostly leaving inner alignment untouched, for now?
It seems like all the safety strategies are targeted at outer alignment and interpretability.
It seems like all the safety strategies are targeted at outer alignment and interpretability.
None of the recent OpenAI, Deepmind, Anthropic, or Conjecture plans seem to target inner alignment
None of the recent OpenAI, Deepmind, Anthropic, or Conjecture plans seem to target inner alignment
Less tongue-in-cheek: certainly it's unclear to what extent interpretability will be sufficient for addressing various forms of inner alignment failures, but I definitely think interpretability research should count as inner alignment research.
I mean, it's mostly semantics but I think of mechanical interpretability as "inner" but not alignment and think it's clearer that way, personally, so that we don't call everything alignment. Observing properties doesn't automatically get you good properties. I'll read your link but it's a bit too much to wade into for me atm.
Either way, it's clear how to restate my question: Is mechanical interpretability work the only inner alignment work Anthropic is doing?
Evan and others on my team are working on non-mechanistic-interpretability directions primarily motivated by inner alignment:
There's also ongoing work on other teams related to (automated) red teaming of models and understanding how models generalize, which may also turn out to be relevant/helpful for inner alignment. It's pretty unclear to me how useful any of these directions will turn out to be for inner alignment in the end, but we've chosen these directions in large part because we're very concerned about inner alignment, and we're actively looking for new directions that seem useful for mitigating inner misalignment risks.
Thanks for the links and explanation, Ethan.
Thanks for explaining your thoughts on AI safety, it's much appreciated.
I think in general when trying to do good in the world, we should strive for actions that have a high expected value and a low potential downside risk.
I can imagine a high expected value case for Anthropic. But I don't see how Anthropic has few potential downsides. I'm very worried that by participating in the race to AGI, p(doom) might increase.
For an example pointed out in the comments here by habryka:
Could you explain to me why you think there are no large potential downsides to Anthropic? I'm extremely worried the EA/LessWrong community has so far only increased AI risk, and the creation of Anthropic doesn't exactly soothe these worries.
PS: You recently updated your website and it finally has a lot more information about your company and you also finally have a contact email listed, which is great! But I just wanted to point out that when emailing hello [at] anthropic.com I get an email back saying the address wasn't found. I've tried contacting your company about my worries before, but it seems really difficult to reach you.
Good to see Anthropic's serious and seem better then OpenAI.
A few general questions that don't seem to be addressed:
I have some more detailed comments/questions but I want to be sure there aren't obvious answers to these first.
"AI is more dangerous the more different it is from us" seems wrong to me: it is very different and likely to be very dangerous, but that doesn't imply that making it somewhat more like us would make it less dangerous. I don't think brain emulation can be developed in time, replaying evolution seems unhelpful to me, and both seem likely to cause enormous suffering (aka mindcrime).
See my colleague Ethan Perez's comment here on upcoming research, including studying situational awareness as a risk factor for deceptive misalignment.
Thanks. OK I will put some more general thoughts, have to go back a few steps.
To me the more general alignment problem is AI gives humanity ~10,000 years of progress and probably irreversible change in ~1-10 years. To me the issue is how do you raise humans intelligence from that given by biology to that given by the limits of physics in a way that is identify preserving as much as possible. Building AI seems to be the worst way to do that. If I had a fantasy way it would be say increase everyone's IQ by 10 points per year for 100+ years until we reach the limit.
We can't do that but that is why I mentioned WBE, my desire would be to stop AGI, get human mind uploading to work, then let those WBE raise their IQ in parallel. Their agreed upon values would be humanities values by definition then.
If our goal is Coherent Extrapolated Volition or something similar for humanity then how can we achieve that if we don't increase the IQ of humans (or descendants they identify with)? How can we even know what our own desires/values are at increasing IQ's if we don't directly experience them.
I have an opinion what successful alignment looks like to me but is it very different for other people? We can all agree what bad is.
Overall, I like the posted strategy much more than OpenAI’s (in the form of Sam Altman’s post) and Conjecture’s.
I like that the strategy takes some top-down factors into account, namely the scenario breakdown.
Things that still seem to me missing or mistaken in the presented strategy:
So far, no one knows how to train very powerful AI systems to be robustly helpful, honest, and harmless.
AI couldn’t be robustly HHH to everyone - frustrations/conflicts are inevitable, as per Vanchurin et al.'s theory of evolution as multilevel learning, and other related works. Since there are conflicts, and due to the boundedness of computational and other resources for alignment (not alignment as an R&D project, but alignment as a practical task: when a burglar runs towards you, it's impossible to "align" with them on values), AI must be unhelpful, dishonest, and hostile to some actors at some times.
First, it may be tricky to build safe, reliable, and steerable systems when those systems are starting to become as intelligent and as aware of their surroundings as their designers.
“Steerable” → persuadable, per Levin's "Technological Approach to Mind Everywhere".
Some scary, speculative problems might only crop up once AI systems are smart enough to understand their place in the world, to successfully deceive people, or to develop strategies that humans do not understand.
AI is already situationally aware, including as per Anthropic’s own research, "Discovering Language Model Behaviors with Model-Written Evaluations".
Pessimistic scenarios: AI safety is an essentially unsolvable problem – it’s simply an empirical fact that we cannot control or dictate values to a system that’s broadly more intellectually capable than ourselves – and so we must not develop or deploy very advanced AI systems.
Different phrases are mixed here badly. “Controlling and dictating values to AI systems” is neither a synonym for “solving AI safety”, nor desirable, and should not lead to the conclusion that advanced AI systems are not developed. “Solving AI safety” should be synonymous with having alignment processes, systems, and law codes developed and functioning reliably. This is not the same as “controlling and dictating values”. Alignment mechanisms should permit the evolution of shared values.
However, I do think that developing and implementing such alignment processes “in time” (before the capability research is ready to deliver superhuman AI systems) is almost certainly out of the empirical possibility, especially considering the coordination it would require (including the coordination with many independent actors and hackers because SoTA AI developments get open-sourced and democratised rapidly). “Pivotal acts” lessen the requirement for coordination, but introduce more risk in themselves.
So, I do think we must not develop and deploy very advanced AI systems, and instead focus on merging with AI and doing something like mind upload on a slower timescale.
If we’re in a “near-pessimistic” scenario, this could instead involve channeling our collective efforts towards AI safety research and halting AI progress in the meantime.
I think the precautionary principle dictates that we should do this in any scenarios apart from “obviously optimistic” or “nearly obviously optimistic”. In the “AGI Ruin” post, Yudkowsky explained well that any alignment protocol (given certain objective properties of the alignment problem contingent on the systems that we build and the wider socio-technical reality) that “narrowly works” almost definitely will fail due to unknown unknowns. This is an unacceptably high level of risk (unless we judge that stopping short of the deployment of superhuman AGI systems is even riskier, for some reason).
Indications that we are in a pessimistic or near-pessimistic scenario may be sudden and hard to spot. We should therefore always act under the assumption that we still may be in such a scenario unless we have sufficient evidence that we are not.
It’s worse: in the mindset where the alignment is a continuous process between humans and AIs, rather than a problem to “solve”, we shouldn’t count on sudden indications that it will fail, at all. It’s often not how Drift into Failure happens. So, in the latter sentence, I would replace “evidence” with “theoretical explanation”: the explanation that the proposed alignment mechanism is “safe” must be constructive and prospective rather than empirical and inductive.
If it turns out that AI safety is quite tractable, then our alignment capabilities work may be our most impactful research. Conversely, if the alignment problem is more difficult, then we will increasingly depend on alignment science to find holes in alignment capabilities techniques. And if the alignment problem is actually nearly impossible, then we desperately need alignment science in order to build a very strong case for halting the development of advanced AI systems.
I think this paragraph puts everything upside down. “Alignment science” (which I see as a multi-disciplinary research area which blends cognitive science, including epistemology, ethics, rationality, consciousness science, game theory, control theory, resilience theory, and more disciplines) is absolutely needed as a foundation for alignment capabilities work, even if the latter appears to us very successful and unproblematic. And it is the alignment science that should show that the proposed alignment process is safe, in a prospective and constructive way, as described above.
If our work on Scalable Supervision and Process-Oriented Learning produce promising results (see below), we expect to produce models which appear aligned according to even very hard tests. This could either mean we're in a very optimistic scenario or that we're in one of the most pessimistic ones.
This passage doesn’t use the terms “optimistic” and “pessimistic scenarios” in the way they are defined above. The AI system could appear aligned according to very hard tests, but recognising that alignment is a never-ending process, and insights from broader alignment science (rather than rather narrowly technical “branch” of it, and moreover, implying a certain ontology of cognition where “values” and “processes” have specific meaning) could still show that aligning AI systems on a longer horizon is nevertheless bound to fail.
Our hope is that this may eventually enable us to do something analogous to a "code review", auditing our models to either identify unsafe aspects or else provide strong guarantees of safety.
Interpretability of a system post-training (and fine-tuning) couldn’t provide a “strong guarantee” for safety, since it’s a complex, dynamical system, which could fail in surprising and unexpected ways during deployment. Only always-on oversight (including during deployment) and monitoring, together with various theoretical dynamical models of the system, could (at least in principle) provide some guarantees.
The hope is that we can use scalable oversight to train more robustly safe systems.
Not clear where the “more robustly” comes from. Scalable oversight during training could produce safer/more aligned models, like “manual” oversight would (if it was humanly possible and less costly), ok. But the robustness of safety (alignment) is quite another matter, especially in a broader sense, and it’s not clear how “scalable oversight during training” aids with that.
Learning Processes Rather than Achieving Outcomes
Sounds like it either refers to imitation learning, in which case not sure why give a new name to this concept, or a sort of process-oriented GOFAI which will probably not practically work on the SoTA levels that AI already achieves, let alone superhuman levels. Humans don’t know know how to decompose their activity into processes, evidenced by the fact that processes standards (like from the ISO 9000 series) could not be followed to the letter to achieve practical results. Would be interested to learn more about this Anthropic’s agenda.
When a model displays a concerning behavior such as role-playing a deceptively aligned AI, is it just harmless regurgitation of near-identical training sequences? Or has this behavior (or even the beliefs and values that would lead to it) become an integral part of the model’s conception of AI Assistants which they consistently apply across contexts? We are working on techniques to trace a model’s outputs back to the training data, since this will yield an important set of clues for making sense of it.
I really like this, I think this is one of the most interesting pieces of the write-up.