Recent advances in machine learning—in reinforcement learning, language modeling, image and video generation, translation and transcription models, etc.—without similarly striking safety results, have rather dampened the mood in many AI Safety circles. If I was any less concerned by extinction risks from AI, I would have finished my PhD as planned before moving from Australia to SF to work at Anthropic; I believe that the situation is both urgent and important.
On the other hand, despair is neither instrumentally nor terminally valuable. This essay therefore lays out some concrete reasons for hope, which might help rebalance the emotional scales and offer some directions to move in.
Background: a little about Anthropic
I must emphasize here that this essay represents only my own views, and not those of my employer. I’ll try to make this clear by restricting we to actions, and using I for opinions to avoid attributing my own views to my colleagues. Please forgive any lapses of style or substance.
Anthropic’s raison d’etre is AI Safety. It was founded in early 2021, as a public benefit corporation, and focuses on empirical research with advanced ML systems. I see our work as having four key pillars:
- Training near-SOTA models.
This ensures that our safety work will in fact be relevant to cutting-edge systems, and we’ve found that many alignment techniques only work at large scales. Understanding how capabilities emerge over model scale and training-time seems vital for safety, as a basis to proceed with care or as a source of evidence that continuing to scale capabilities would be immediately risky.
- Direct alignment research. There are many proposals for how advanced AI systems might be aligned, many of which can be tested empirically in near-SOTA (but not smaller) models today. We regularly produce the safest model we can with current techniques, and characterize how it fails in order to inform research and policy. With RLHF as a solid baseline and building block, we're investigating more complicated but robust schemes such as constitutional AI, scalable supervision, and model-assisted evaluations.
- Interpretability research. Fully understanding models could let us rule out learned optimizers, deceptive misalignment, and more. Even limited insights would be incredibly valuable as an independent check on other alignment efforts, and might offer a second chance if they fail.
- Policy and communications. I expect AI capabilities will continue to advance, with fast-growing impacts on employment, the economy, and cybersecurity. Having high-trust relationships between labs and governments, and more generally ensuring policy-makers are well-informed, seems robustly positive.
If you want to know more about what we’re up to, the best place to check is anthropic.com for all our published research. We’ll be posting more information about Anthropic throughout this year, as well as fleshing out the website.
Concrete reasons for hope
My views on alignment are similar to (my understanding of) Nate Soares’. I think the key differences are because I don’t think there’s enough evidence to confidently predict the difficulty of future problems, and I do think it’s possible for careful labs to avoid active commission of catastrophe. We also seem to have different views on how labs should respond to the situation, which this essay does not discuss.
Language model interventions work pretty well
I wasn’t expecting this, but our helpful/harmless/honest research is in fact going pretty well! The models are far from perfect, but we’ve made far more progress than I would have expected a year ago, and no signs of slowing down yet. HHH omits several vital pieces of the full alignment problem, but if it leads to AI that always shuts down on command and never causes a catastrophe I’ll be pretty happy.
As expected, we’ve also seen a range of failures on more difficult tasks or where train-time supervision was relatively weak – such as inventing a series of misleading post-hoc justifications when inconsistent responses are questioned. The ‘treacherous turn’ scenario is still concerning, but I find it plausible that work on e.g. scalable supervision and model-based red-teaming could help us detect it early.
Few attempts to align ML systems
There are strong theoretical arguments that alignment is difficult, e.g. about convergent instrumental goals, and little empirical progress on aligning general-purpose ML systems. However, the latter only became possible a few years ago with large language models, and even then only in a few labs! There’s also a tradition of taking theoretically very hard problems, and then finding some relaxation or subset which is remarkably easy or useful in practice – for example SMT solvers vs most NP-complete instances, CAP theorem vs CRDTs or Spanner, etc. I expect that increasing hands-on alignment research will give us a similarly rich vein of empirical results and praxis from which to draw more abstract insights.
Interpretability is promising!
It feels like we’re still in the fundamental-science stage, but interpretability is going much better than I expected. We’re not in the ‘best of all possible worlds’ where polysemanticity just isn’t a thing, but compared to early 2020, transformer interpretability is going great. I’m also pretty optimistic about transfer to new architectures, if one comes along – there are some shared motifs between the imagenet and transformer circuits threads, and an increasing wealth of tooling and experience.
Mechanistic interpretability is also popping up in many other places! I’ve recently enjoyed reading papers from Redwood Research, Conjecture, and DeepMind, for example, and it feels more like a small but growing field than a single research project. Mechanistic interpretability might hit a wall before it becomes useful for TAI/AGI safety, or simply fail to get there in time; or it might not.
Outcome-based training can be limited or avoided
Supervised learning seems to be working really well, and process-based techniques could plausibly scale to superhuman performance; they might even be better for capabilities in regimes with very scarce feedback or reward signals. I’d expect this to be good for safety relative to outcome-based RL systems, and more amenable to audits and monitoring. “Just” being as skilled as the best humans in every domain – with perfect synergy between every skill-set, at lower cost and wildly higher speed, with consistently-followed and constantly-refined playbooks – would be incredibly valuable.
Good enough is good enough
The first TAI system doesn’t have to be a perfectly aligned sovereign—so long as it’s corrigible it can be turned back off, reengineered, and tried again. The goal is to end the acute risk period, which might be possible via direct assistance with alignment research, by enabling policy interventions, or whatever else.
Training can probably stop short of catastrophe
Model capabilities increase with scale and over the course of training, and evaluation results on checkpoints are quite precisely predictable from scaling laws, even for models more-capable than those from which the scaling laws were derived. You can write pretty sensitive evaluations for many kinds of concerning behavior; check log-probs or activations as well as sampled tokens, use any interpretability techniques you like (never training against them!) – and just stop training unless you have a strong positive case for safety!
I’d aim to stop before getting concrete reason to think I was training a dangerous model, obviously, but also value defense in depth against my own mistakes.
People do respond to evidence
Not everyone, and not all institutions, but many. The theme here is that proposals which need some people to join are often feasible, including those which require specific people. I often talk to people who have been following AI Safety from a distance for a few years, until some recent result convinced them that there was a high-impact use for their skills and it was time to get more involved.
If a major lab saw something which really scared them, I think other labs would in fact agree to a moratorium on further capabilities until it could be thoroughly investigated. Publicly announcing a pause would more or less declare that AGI was imminent, risk a flood of less safety-conscious entrants to the field, and there are questions of antitrust law too, but I’m confident that these are manageable issues.
I don’t expect a ‘sharp left turn’
The ‘sharp left turn’ problem derives from the claim that capabilities generalize better than alignment. This hasn’t been argued so much as asserted, by analogy to human evolution as our only evidence on how human-level general intelligence might develop. I think this analogy is uninformative, because human researchers are capable of enormously better foresight and supervision than was evolution, and anticipate careful empirical studies of this question in silico despite the difficulty of interpreting the results.
Anecdotally, I’ve seen RLHF generalize alignmentish properties like helpfulness and harmlessness across domains at least as well as it generalizes capabilities at their present levels, and I don’t share the intuition that this is very likely to change in future. I think ‘alignment generalization failures’ are both serious and likely enough to specifically monitor and mitigate, but not that a sharp left turn is anywhere near certain.
Conclusion: high confidence in doom is unjustified
By “doom” I mean a scenario in which all humans are comprehensively disempowered by AI before the end of this century. I expect that human extinction would follow from comprehensive disempowerment within decades. It Looks Like You’re Trying To Take Over The World is a central example; human accident or misuse might lead to similarly bad outcomes but I’m focusing on technical alignment here.
Estimates of “P(doom)” are based on only tenuous evidence. I think a range is consistent with available evidence and reasonable priors; largely because in my view it’s unclear whether the problem is difficult like the steam engine, or net-energy-gain from fusion, or proving whether , or perhaps more difficult still. I tried writing tighter estimates but couldn’t construct anything I’d endorse.
Emotionally, I’m feeling pretty optimistic. While the situation is very scary indeed and often stressful, the x-risk mitigation community is a lovely and growing group of people, there’s a large frontier of work to be done, and I’m pretty confident that at least some of it will turn out to be helpful. So let’s get (back) to work!
Contrary to stereotypes about grad school, I was really enjoying it! I’m also pretty sad to shut down the startup I’d spun out (hypofuzz.com), though open-sourcing it is some consolation - if it wasn’t for x-risk I wouldn’t be in AI at all. ↩︎
I work directly on AI x-risk, and separately I give 10% of my income to GiveWell’s “All Grants” fund (highest EV), and a further 1% to GiveDirectly (personal values). I value this directly; I also believe that credible signals of altruism beyond AI are important for the health of the community. ↩︎
This is mostly a question of your emotional relationship to the facts of the matter; an accurate assessment of the situation is of course instrumentally vital and to me also terminally desirable. ↩︎
The effect is that the company can raise money from investors and prioritize the mission over shareholder profits. ↩︎
I'm thinking here of the helpful/harmless tradeoff from (eg) fig 26a of our RLHF paper; P(IK) calibration, scalable supervision results, etc. – places where you need pretty good capabilities to do useful experiments. ↩︎
I also think that this is a very important habit for anyone working with SOTA or near-SOTA systems. When whoever it is eventually tries building a TAI or AGI system, I would strongly prefer that they have a lot of hands-on practice aligning weaker AI systems as well as an appreciation that this time is likely to be different. ↩︎
(e.g. here) This likely arose from RLHF training on a pattern of single-step justifying or modifying claims when challenged by shallow train-time human supervision; when pressed harder in deployment the generalization is to repeat and escalate this pattern to the point of absurdity. ↩︎
e.g. GPT-3, Minerva, AlphaFold, code models – all language- rather than image-based, which seems right to me. Since I first drafted this essay there's also ChatGPT, which continues to make headlines even in mainstream newspapers. ↩︎
For example software and systems engineering, law, finance, recruiting, ops, etc. – the scale and diversity of AI safety projects in the large language model era demands a wider diversity of skills and experience than earlier times. ↩︎
Why would it take decades? (In contrast to scenarios where AI builds nanotech and quickly disassembles us for spare atoms.) Are you imagining a world where AI-powered corporations, governments, &c. are still mostly behaving as designed, but we have no way to change course when it turns out that industrial byproducts are slowly poisoning us, or ...?
I don't feel I can rule out slow/weird scenarios like those you describe, or where extinction is fast but comes considerably after disempowerment, or where industrial civilization is destroyed quickly but it's not worth mopping up immediately - "what happens after AI takes over" is by nature extremely difficult to predict. Very fast disasters are also plausible, of course.
IMO, I don't think this is a realistic hope, since from my point of view they come with unacceptable downsides.
Quoting another post:
This is essentially saying that there's good evidence that the precursors of deceptive alignment is there, and this is something that I think no alignment plan could deal with here.
Here's a link to the post:
Thus, I think this:
Is somewhat unrealistic, since I think a key relaxation alignment researchers are making is pretty likely to be violated IRL.
I'm one of the authors of Discovering Language Model Behaviors with Model-Written Evaluations, and well aware of those findings. I'm certainly not claiming that all is well; and agree that with current techniques models are on net exhibiting more concerning behavior as they scale up (i.e. emerging misbehaviors are more concerning than emerging alignment is reassuring). I stand by my observation that I've seen alignment-ish properties generalize about as well as capabilities, and that I don't have a strong expectation that this will change in future.
I also find this summary a little misleading. Consider for example, "the paper finds concrete evidence of current large language models exhibiting: convergent instrumental goal following (e.g. actively expressing a preference not to be shut down), ..." (italics added in both) vs:
While indeed worrying, models generally seem to have weaker intrinsic connections between their stated desires and actual actions than humans. For example, if you ask about code models can and will discuss SQL injections (or buffer overflows, or other classic weaknesses, bugs, and vulnerabilities) and best-practices to avoid them in considerable detail... while also prone to writing them whereever a naive human might do so. Step-by-step reasoning, planning, or model cascades do provide a mechanism to convert verbal claims into actions; but I'm confident that strong supervision of such intermediates is feasible.
I'm not sure whether you have a specific key relaxation in mind (and if so what), or that any particular safety assumption is pretty likely to be violated?
The key relaxation here is: deceptive alignment will not happen. In many ways, a lot of hopes are resting on deceptive alignment not being a problem.
I disagree, since I think the non-myopia found is a key way for how something like goal misgeneralization or the sharp left turn could happen, where a model remains very capable, but loses it's alignment properties due to deceptive alignment.
For those of us who like accurate beliefs, could you make a second list of the truths you filtered out due to not furthering hope?
That's not how I wrote the essay at all - and I don't mean to imply that the situation is good (I find 95% chance of human extinction in my lifetime credible! This is awful!). Hope is a attitude to the facts, not a claim about the facts; though "high confidence in [near-certain] doom is unjustified" sure is. But in the spirit of your fair question, here are some infelicitous points:
Writing up my views on x-risk from AI in detail and with low public miscommunication risk would be a large and challenging project, and honestly I expect that I'll choose to focus on work, open source, and finishing my PhD thesis instead.
Thanks for writing this, it's great to see people's reasons for optimism/pessimism!
I'm surprised by this sentence in conjunction with the rest of this post: the views in this post seem very different from my Nate model. This is based only on what I've read on LessWrong, so it feels a bit weird to write about what I think Nate thinks, but it still seems important to mention. If someone more qualified wants to jump in, all the better. Non-comprehensive list:
Not as important as the other points, but I'm not even sure how much you disagree here. E.g. Nate on difficulty, from the sharp left turn post:
And on the point of labs, I would have guessed Nate agrees with the literal statement, just thinks current labs aren't careful enough, and won't be?
My Nate model doesn't view this as especially informative about how AGI will go. In particular:
If I understand you correctly, the "vital pieces" that are missing are not ones that make it shut down and never cause catastrophe? (Not entirely sure what they are about instead). My Nate model agrees that vital pieces are missing, and that never causing a catastrophe would be great, but crucially thinks that the pieces that are missing are needed to never cause a catastrophe.
In my Nate model, empirical work with pre-AGI/pre-sharp-left-turn systems can only get you so far. If we can now do more empirical alignment work, that still won't help with what are probably the deadliest problems. Once we can empirically work on those, there's very little time left.
Nate has said he's in favor of interpretability research, and I have no idea if he's been positively surprised by the rate of progress. But I would still guess you are way more optimistic in absolute terms about how helpful interpretability is going to be (see his comments here).
Nate wrote a post which I understand to argue against more or less this claim.
Nate has of course written about how he does expect one. My impression is that this isn't just some minor difference in what you think AGI will look like, but points at some pretty deep and important disagreements (that are upstream of some other ones).
Maybe you're aware of all those disagreements and would still call your views "similar", or maybe you have a better Nate model, in which case great! But otherwise, it seems pretty important to at least be aware there are big disagreements, even if that doesn't end up changing your position much.
I'm basing my impression here on having read much of Nate's public writing on AI, and a conversation over shared lunch at a conference a few months ago. His central estimate for P(doom) is certainly substantially higher than mine, but as I remember it we have pretty similar views of the underlying dynamics to date, somewhat diverging about the likelihood of catastrophe with very capable systems, and both hope that future evidence favors the less-doom view.
This is too simplistic. In reality governments are not monolithic so although a certain lab may have 'high trust relationships' with one or more subdivision of the relevant government, it may have adversarial relationships with other subdivisions of the same government.
This is further complicated by the fact that most modern societies are structured around the government having multiple subdivisions with overlapping jurisdiction, each sufficiently influential to have a veto over substantial decision making of the whole government but not sufficient, by themselves, to push through anything.
So this view abstracts away the most challenging coordination problems.
Ironically this still seems pretty pessimistic to me. I'm glad to see something other than "AHHH!" though, so props for that.
I find it probably more prudent to worry about a massive solar flare, or an errant astral body collision, than to worry about "evil" AI taking a "sharp turn".
I put quotes around evil because I'm a fan of Nietzsche's thinking on the matter of good and evil. Like, what, exactly are we saying we're "aligning" with? Is there some universal concept of good?
Many people seem to dismiss blatant problems with the base premise— like the "reproducibility problem". Why do we think that reality is in fact something that can be "solved" if we just had enough processing power, as it were? Is there some hard evidence for that? I'm not so sure. It's not just our senses that are fallible. There are some fundamental problems with the very concept of "measurement' for crying out loud, which I think it's pretty optimistic to think that super-smart AI is just going to be able to skip over.
I also think if AI gets good enough to "turn evil" as it were, it would be good enough to realize that it's a pretty dumb idea. Humans don't really have much in common with silicon-based life forms, afaik. You can find more rare elements, easier, in space, than you can on Earth. What, exactly, would AI gain by wiping out humanity?
I feel that it's popular to be down on AI, and saying how scary all these "recent" advances really are, but it doesn't seem warranted.
Take the biological warfare ideas that were in the "hard turn" link someone linked in their response. Was this latest pandemic really a valid test-run for something with a very high fatality rate? (I think the data is coming in that far more people had COVID than we initially thought, right?)
CRISPR &c. are, to me, far more scary, but I don't see any way of like, regulating that people "be good", as it were. I'm sure most people here have read or seen Jurassic Park, right? Actually, I think our Science Fiction pretty much sums up all this better than anything I've seen thus far.
I'm betting if we do get AGI any time soon it will be more like the movies Her or AI than Terminator or 2001, and I have yet to see any plausible way of stopping, or indeed "ensuring alignment" (again, along what axis? Who's definition of "good"?)
The answer to any question can be used for good or ill. "How to take over the world" is functionally the same as "how to prevent world takeovers", is it not? All this talk of somehow regulating AI seems akin to talk of regulating "hacking" tools, or "strong maths" as it were.
Are we going to next claim that AI is a munition?
It would be neat to see some hard examples of why we should fear and why we think we can control alignment… maybe I'm just not looking in the right places? So far I don't get what all the fear is about— at least not compared to what I would say are more pressing and statistically likely problems we face.
I think we can solve some really hard problems if we work together, so if this is a really hard problem that needs solving, I'm all for getting behind it, but honestly, I'd like to see us not have all our eggs in one basket here on Earth before focusing on something that seems, at least from what I've seen so far, nigh impossible to actually focus on.
https://theprecipice.com/faq has a good summary of reasons to believe that human-created risks are much more likely than naturally-occuring risks like solar flares or asteroid or cometary impacts. If you'd like to read the book, which covers existential risks including from AI in more detail, I'm happy to buy you a copy. Specific to AI, Russel's Human Compatible and Christian's The Alignment Problem are both good too.
More generally it sounds like you're missing the ideas of the orthogonality thesis and convergent instrumental goals.
It might be fun to pair Humankind: A Hopeful History with The Precipice, as both have been suggested reading recently.
It seems to me that we are, as individuals, getting more and more powerful. So this question of "alignment" is a quite important one— as much for humanity, with the power it currently has, as for these hypothetical hyper-intelligent AIs.
Looking at it through a Sci-Fi AI lens seems limiting, and I still haven't really found anything more than "the future could go very very badly", which is always a given, I think.
I've read those papers you linked (thanks!). They seem to make some assumptions about the nature of intelligence, and rationality— indeed, the nature of reality itself. (Perhaps the "reality" angle is a bit much for most heads, but the more we learn, the more we learn we need to learn, as it were. Or at least it seems thus to me. What is "real"? But I digress) I like the idea of Berserkers (Saberhagen) better than run amok Pi calculators… however, I can dig it. Self-replicating killer robots are scary. (Just finished Horizon: Zero Dawn - Forbidden West and I must say it was as fantastic as the previous installment!)
Which of the AI books would you recommend I read if I'm interested in solutions? I've read a lot of stuff on this site about AI now (before I'd read mostly Sci-Fi or philosophy here, and I never had an account or interacted), most of it seems to be conceptual and basically rephrasing ideas I've been exposed to through existing works. (Maybe I should note that I'm a fan of Kurzweil's takes on these matters— takes which don't seem to be very popular as of late, if they ever were. For various reasons, I reckon. Fear sells.) I assume Precipice has some uplifting stuff at the end, but I'm interested in AI specifically ATM.
What I mean is, I've seen a few of proposals to "ensure" alignment, if you will, with what we have now (versus say warnings to keep in mind once we have AGI or are demonstrably close to it). One is that we start monitoring all compute resources. Another is that we start registering all TPU (and maybe GPU) chips and what they are being used for. Both of these solutions seem scary as hell. Maybe worse than replicating life-eating mecha, since we've in essence experienced ideas akin to the former a few times historically. (Imagine if reading was the domain of a select few and books were regulated!)
If all we're talking about with alignment here, really, is that folks need keep in mind how bad things can potentially go, and what we can do to be resilient to some of the threats (like hardening/distributing our power grids, hardening water supplies, hardening our internet infrastructure, etc.), I am gung-ho!
On the other hand, if we're talking about the "solutions" I mentioned above, or building "good" AIs that we can use to be sure no one is building "bad" AIs, or requiring the embedding of "watermarks" (DRM) into various "AI" content, or
buildingextending sophisticated communication monitoring apparatus, or other such — to my mind — extremely dangerous ideas, I'm thinking I need to maybe convince people to fight that?
In closing, regardless of what the threats are, be they solar flares or comets (please don't jinx us!) or engineered pathogens (intentional or accidental) or rogue AIs yet to be invented — if not conceived of —, a clear "must be done ASAP" goal is colonization of places besides the Earth. That's part of why I'm so stoked about the future right now. We really seem to be making progress after stalling out for a grip.
Guess the same goes for AI, but so far all I see is good stuff coming from that forward motion too.
A little fear is good! but too much? not so much.
I really like the idea of 80,000 Hours, and seeing it mentioned in the FAQ for the book, so I'm sure there are some other not-too-shabby ideas there. I oft think I should do more for the world, but truth be told (if one cannot tell from my writing), I barely seem able to tend my own garden.