In fact, I am more worried about partial success than total failure in aligning AIs. In particular, I am concerned that we will end up in the “uncanny valley,” where we succeed in aligning AIs to a sufficient level for deployment, but then discover too late some “edge cases” in the real world that have a large negative impact.
I think it's pretty plausible that AIs which are obviously pretty misaligned are deployed (e.g. we've caught them trying to escape or sabotage research, or maybe we've caught an earlier model doing this and we don't have any strong reason to think we've resolved this in the current system). This is made more likely by an aggressive race to the bottom (possibly caused by an arms race over AI as you discuss). Part of my perspective here is that misalignment issues might be very difficult to resolve in time due to rapid AI progress and difficulty studying misalignment. (For instance, because the very AIs you're studying might not want to be studied!).
I also think it's plausible that very capable AIs will end up being schemers/alignment-fakers which look very aligned (sufficient for your deployment checks), but have misaligned long run aims. And, even if you found evidence of this in earlier AIs, this wouldn't suffice to prevent AIs where we haven't confidently ruled this out from being deployed (see prior paragraph). I also think it's plausible that you won't see smoking gun evidence of this before it's too late as I discuss in a prior post of mine.
The issues I'm worried about don't feel like edge cases or an uncanny valley to me (though I suppose you could think of alignment faking as an uncanny valley).
My understanding is that you disagree about the possibility of relatively worst case scenarios with respect to scheming and think that people would radically change their approach if we had clear evidence of (relatively consistent, across context) scheming without a strong resolution of this problem that generalizes to more capable AIs. I hope you're right.
To be clear, I agree that it would be better if AIs are obviously (seriously) misaligned than if they are instead scheming undetected.
I am not sure how much we actually disagree, but let me add some more clarificatoins.
My mental model is what happened with software security. It was never completely ignored, but for a long time I think many companies had the mental model of "do the minimal security work so that it's not obviously broken." For example, it took some huge scandals for Microsoft to make security its highest priority (see Bill Gates 2002 memo and retrospective). Ultimately the larger companies changed their mindset, but I would like us not to repeat this history with AI, especially given that it is likely to progress faster.
At the moment the way we deploy AIs are in settings (chat or using as coding agent for discrete tasks) where there very output is fed to a human that is ultimately responsible for the results. So this setting allows us to be quite tolerant of alignment failures. But I don't think this will last for long.
I do think that alignment faking is a form of "uncanny valley". There are some subtleties with the examples you mention since it is debatable whether we have "caught" models doing bad thing X or "entrapped" them to do so. This is why I like the sycophancy incident since it is one where (1) the bad thing occured in the wild, (2) it points out to a deeper issue which is the mixed objectives that AIs have, and in particular the objective to satisfy the user as well as the objective to follow the model spec / policies.
But, I agree that we should get to the point where it is impossible to even "entrap" the models to exhibit misaligned behavior. Again in cybersecurity there have been many examples of attacks that were initially dismissed by practitioners as being too "academic" but were eventually extended to realistic setting. I think these days people understand that even such "academic attacks" point out to real weaknesses, since (as people often say in cryptography) "attacks only get better".
I think if the attitude in AI was "there can't be any even slightly plausible routes to misalignment related catastrophe" and this was consistently upheld in a reasonable way, that would address my concerns. (So, e.g. by the time we're deploying AIs which could cause huge problems if they were conspiring against us there needs to be a robust solution to alignment faking / scheming which has broad consensus among researchers in the area.)
I don't expect this because we seem very far from success and this might happen rapidly. (Though we might end up here after eating a bunch of ex-ante risk for some period while using AIs to do safety work.)
I left some comments noting disagreements, but I thought it be helpful to note some areas of agreement:
Currently, it is very challenging to train AIs to achieve tasks that are too hard for humans to supervise. I believe that (a) we will need to solve this challenge to unlock “self-improvement” and superhuman AI, and (b) we will be able to solve it. I do not want to discuss here why I believe these statements. I would just say that if you assume (a) and not (b), then the capabilities of AIs, and hence their risks, will be much more limited.
Ensuring AIs accurately follow our instructions, even in settings too complex for direct human oversight, is one such hard-to-supervise objective. Getting AIs to be maximally honest, even in settings that are too hard for us to verify, is another such objective. I am optimistic about the prospects of getting AIs to maximize these objectives.
I agree that to train AIs which are generally very superhuman, you'll need to be able to make AIs highly capable on tasks that are too hard for humans to supervise. And, that if we have no ability to make AIs capable on tasks which are hard for humans to supervise, risks are much more limited.[1]
However, I don't think that making AIs highly capable on tasks which are too hard for humans to supervise necessarily requires being able to ensure AIs do what we want in these settings nor does it require being able to train AIs for specific objectives in these settings.
Instead, you could in principle create very superhuman AIs through transfer (as humans do in many cases) and this wouldn't require any ability to directly supervise on domains where the AI ends up being superhuman nevertheless. Further, you might be able to directly train AIs to be highly capable (as in, without depending on much transfer) using flawed feedback in a given domain (e.g. feedback which is often possible to reward hack but which still teaches the AI the relevant abilities).
So, I agree that the ability to make very superhuman AIs implies that we'll (very likely) be able to make AIs which are capable of following our instructions and which are capable of being maximally honest, but this doesn't imply that we'll be able to ensure these properties (e.g. the AI could intentionally disobey instructions or lie). Further, there is a difference between being able to supervise instruction following and honesty in any given task and being able to produce an AI which robustly instruction follows and is honest. (Things like online training using this supervision only give you average case guarantees, and that's if you actually use online training.)
It's certainly possible that there is substantial transfer from the task of "train AIs to be highly capable (and useful) in harder-to-check domains" to the task of ensuring AIs are robustly honest and instruction following, but it is also easy to imagine ways this could go wrong. E.g., the AIs are faking alignment or at some level of capability, increased capabilities (from transfer or whatever) still makes the AIs (seem) more useful while simultaneously it makes them less instruction following and honest due to issues in the training signal.
(My all-things-considered view is that the default course of advancing capabilities and usefulness will end up figuring out some ways to supervise AIs in training sufficiently well to train them to perform reasonably well on average in most cases (according to human judgment of outcomes), but this performance will substantially depend on transfer and generalization in ways which aren't robust to egregious misalignment. And that this will solve some alignment problems that would have otherwise existed if not for people optimizing for usefulness over pure capabilities. That said I also think it's possible that we'll be seeing increasingly sophisticated and egregious reward hacking rise with capabilities but with a sufficient increase in usefulness in many domains despite this reward hacking such that scaling continues. And, I don't think handling worst case scheming/alignment-faking will be very incentivized by commercial/capabilities incentives by default.)
You might separately be optimistic about getting superhuman AIs to robustly follow instructions, but I don't think "we have a way to make the AIs superhumanly capable in general" implies "we can ensure the AIs actually robustly follow instructions (rather than just being capable of following instructions)".
To the extent that you're defining "capable" in a somewhat non-standard way, it seems good to be careful about this and consider explaining how you are using these terms (or consider defining a new term).
That said, I do think it would be possible in principle to automate AI R&D or at least automate most of AI R&D (perhaps what you mean by unlocking "self-improvement") even if we could only initially make AIs highly capable on tasks which humans can supervise. Humans can supervise the tasks involved in AI R&D and superhumanness isn't necessarily required for this automation. Also, there are verification generation gaps in AI R&D because we can use outcome based feedback, so you can in principle get substantially superhuman AI R&D while only training AIs on tasks you could have in principle supervised. In practice, the way AI R&D is done today often requires doing tasks which are expensive to run and only happen a few times (e.g. big training runs), so literally doing outcome based RL over the whole process wouldn't work with the current setup. But, humans can in principle supervise the process, it's just that the process is expensive to run.
I agree that I am not justifying in this essay my optimism about getting superhuman AI to robustly follow instructions. You can think as the above as more like the intuition that generally points out in that direction.
Justifying this optimism is a side discussion, but TBH rather than discussions I hope that we can make empirical progress toward justifying it.
I believe that AI (including AGI and ASI) can do the same and be a positive force for humanity. I also believe that it is possible to solve the “technical alignment” problem and build AIs that follow the words and intent of our instructions and report faithfully on their actions and observations.
I will not defend these two claims here. However, even granting these optimistic premises, AI’s positive impact is not guaranteed.
It's interesting that you describe the claims "AI can be a positive force for humanity" and "technical alignment can be solved" as optimistic premises. I think people who think misalignment risks are catastrophically high (e.g. a >25% chance of literal AI takeover) would agree with these premises (they think misalignment risks could be avoided, it's just that we're at least somewhat likely to not succeed at this), so these claims don't typically distinguish between typical optimists and pessimists (at least with respect to worries around misalignment risks).
Perhaps when you say "it is possible to solve" you mean "we're very likely to solve it in practice given the realistic time and budget we will have for this problem". In this case, there certainly is disagreement!
I'm not sure whether you were trying to highlight typical disagreements or if you introduced things in this way for some othe reason.
You might imagine that by positing that technical alignment is solvable, I “assumed away” all the potential risks with AI.
I certainly don't think so! As you note, solvable does not imply that it will be solved!
Thanks! Note that I did not optimize this essay just for the LessWrong audience, so different people might have different points on agreements and disagreement.
I think I am indeed more optimistic, but I would not say that we will solve it "by default." Like I say in the essay, I don't think that by simply letting the market work we will get sufficient level of alignment. We need to make an effort- this is similar to the example of Microsoft that I mentioned in this comment, but I hope that unlike the Microsoft example we don't need to wait for huge failures until we do that.
Thank you! This is quite relevant. Some of your concerns are about the technical feasibility of achieving instruction following, which is not a point I’m going into in this post. FWIW when I say “instruction following” I mean models that respect the instruction hierarchy and chain of command (for example as in our model spec). However I do not mean models that obey hypothetical future instructions that were not given to them (which is one option you mention in your post).
If they don't obey future instructions not yet given, the only sensible thing to do to carry out your current instructions thoroughly and with certainty is to make sure you can't issue new instructions. That would logically prevent them from completing their instructions and represent utter failure in their goal.
I think AI assistants would have common sense even if they are obedient. I doubt an AI assistants would interpret “go fetch me coffee” as “kill me first so I can’t interrupt your task and then fetch me coffee” but YMMV
I don't think it's safe to assume that LLM-based AGI will have common sense (maybe this is different from the assistant you're addressing, in that it's a lot smarter and can think for itself more?). I'm talking about machines based on networks, but which can also reason in depth. They will understand common sense, but that doesn't ensure that it will guide their reasoning.
So it depends what you mean by "obedient". And how you trained them to be obedient. And whether that ensures that their interpretation doesn't change once they can reason more deeply than we can.
So I think those questions require serious thought, but you can't tackle it all at once, so starting by assuming that all works is also sensible. I'm just focusing on that first, because I don't think that's likely to work unless we put a lot more careful thought in before we try it.
If you look at my previous posts on the topic, linked from Problems with instruction-following, you'll see that I was initially more focused on downstream concerns like yours. After working in the field for longer and having more in-depth discussions and research on the techniques we'd use to align future LLM agents, I am increasingly concerned that it's not that easy, and we should focus on the difficulty of aligning AGI while we still have time. Difficulties from aligned AGI are also substantial, and I've addressed those as well in my string of work backlinked from Whether governments will control AGI is important and neglected
I am also drawn to the idea that government control of AGI is quite dangerous; I've addressed this tension in Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
But on the whole, I think the risks of widely distributed AGI are even greater. How many people can control a mind that can create technologies capable of taking over or destroying the world before someone uses them?
Michael Nielsen's excellent ASI existential risk: Reconsidering Alignment as a Goal is a similar analysis of why even obedient aligned AGI would be intensely dangerous if it's allowed to proliferate.
On the "safety underinvestment" point I am going to say something which I think is obvious (and has probably been discussed before) but I have not personally seen anyone advocate for:
The DoD should be conducting its own safety/alignment research, and pushing for this should be one of the primary goals of safety advocates.
There is this constant push for labs to invest more in safety. At the same time we all acknowledge this is a zero sum game. Every dollar/joule/flop they put into safety/alignment is a dollar/joule/flop they can't put into capabilities research. I think this dynamic of insisting that labs carry the torch entirely on safety, instead of pushing for government funded safety research, is a big part of why you've seen labs/VCs becoming increasingly hostile to safety advocacy/regulation. Because the only pitch so far makes safety onerous for labs.
Pushing for safety/alignment research to be done by the DoD allows for investments to be made into safety in a way that doesn't require labs reroute scarce resources or decelerate their capabilities progress. It can also be purely additive, there's no reason labs can't continue their own safety research as well.
I think it's useful for government to do safety/alignment but:
1. I don't think it's a zero sum game - often safety/alignment improvements go hand in hand with capabilities.
2. Like we have seen with software security, if safety is not "baked in", it is hard to add it after the fact, so it is important for safety and capabilities researchers to work together.
[Crossposted on Windows On Theory]
Throughout history, technological and scientific advances have had both good and ill effects, but their overall impact has been overwhelmingly positive. Thanks to scientific progress, most people on earth live longer, healthier, and better than they did centuries or even decades ago.
I believe that AI (including AGI and ASI) can do the same and be a positive force for humanity. I also believe that it is possible to solve the “technical alignment” problem and build AIs that follow the words and intent of our instructions and report faithfully on their actions and observations.
I will not defend these two claims here. However, even granting these optimistic premises, AI’s positive impact is not guaranteed. In this essay, I will:
In the next decade, AI progress will be extremely rapid, and such periods of sharp transition can be risky. What we — in industry, academia, and government— do in the coming years will matter a lot to ensure that AI’s benefits far outweigh its costs.
Currently, it is very challenging to train AIs to achieve tasks that are too hard for humans to supervise. I believe that (a) we will need to solve this challenge to unlock “self-improvement” and superhuman AI, and (b) we will be able to solve it. I do not want to discuss here why I believe these statements. I would just say that if you assume (a) and not (b), then the capabilities of AIs, and hence their risks, will be much more limited.
Ensuring AIs accurately follow our instructions, even in settings too complex for direct human oversight, is one such hard-to-supervise objective. Getting AIs to be maximally honest, even in settings that are too hard for us to verify, is another such objective. I am optimistic about the prospects of getting AIs to maximize these objectives. I believe it will be possible train AIs with an arbitrary level of intelligence that:
This is what I mean by “solving” the technical alignment problem. Note that this does not require AIs to have an inherent love for humanity, or be fundamentally incapable of carrying out harmful instructions if those are given to them or they are fine-tuned to do so. However, it does require these AIs to reasonably generalize our intent into new situations– what I called before “robust reasonable compliance.”
Intelligence and obedience are independent qualities. It is possible to have a maximally obedient AI capable of planning and carrying out arbitrarily complex tasks—like proving the Riemann hypothesis, figuring out fusion power, or curing cancer—that have so far proven beyond humanity’s collective intelligence. Thus, faithful and obedient AIs can function as incredibly useful “superintelligent tools”. For both commercial and cultural reasons, I believe such AIs would be the dominant consumers of inference-time compute throughout the first years/decades of AGI/ASI, and as such account for the vast majority of “total worldwide intelligence.” (I expect this to hold even if people experiment with endowing AIs with a “sense of self”, “free will” and/or functioning as independent moral agents.)
The precise “form factor” of faithful obedient AIs is still unknown. It will depend on their capability and alignment profile, as well on their most lucrative or beneficial applications, and all of these could change with time. For example, it may turn out that for a long time AIs will be able to do 95% of the tasks of 95% of human workers, but there will still be a “long tail” of tasks that humans do better, and hence benefit from AI/human collaboration. Also, even if the total amount of intelligence is equivalent to “a country of geniuses in a data center,” this does not mean that this is how we will use AI. Perhaps it will turn out more economically efficient to simulate 1,000 geniuses and ten countries of knowledge workers. Or we would integrate AIs in our economy in other ways that don’t correspond to drop-in replacement of humans.
What does it mean to “solve” alignment? Like everything with AI, we should not expect 100% guarantees, but rather increasingly better approximations of obedience and faithfulness. The pace at which these approximations improve relative to the growth in capabilities and high-stakes deployment could significantly affect AI outcomes.
A good metaphor is cryptography and information security. Cryptography is a “solved” problem in the sense that we have primitives such as encryption and digital signatures for which we can arbitrarily decrease the probability of attacker success as an inverse exponential function of the size of our key. Of course, even with these primitives, there are still many challenges in cybersecurity. Part of this is because even specifying what it means for a system to be secure is difficult, and we also need to deal with implementation bugs. I suspect we will have similar challenges in AI.
That said, cybersecurity is an encouraging example since we have been able to iteratively improve the security of real world systems. For example, every generation of iPhone has become more secure than previous ones, to the extent that even nation states find it difficult to hack. Note that we were extremely lucky in cryptography that the attacker’s resources scale exponentially with the defender’s (i.e. key size and computation of encryption/signing. etc.). Security would have required much more overhead if the dependence was, for example, quadratic. It is still unknown which dependencies of alignment reliability to resources such as train and test time compute can be achieved.
Even if we “solve” alignment and train faithful and obedient AIs, that does not guarantee that all AIs in existence are faithful and obedient, or that they obey good instructions. However, I believe that we can handle a world with such “bad AIs” as long as (1) the vast majority of inference compute (or in other words, the vast majority of “intelligence”) is deployed via AIs that are faithful and obedient to responsible actors, and (2) we do the work (see below) to tilt the offense/defense balance to the side of the defender.
You might imagine that by positing that technical alignment is solvable, I “assumed away” all the potential risks with AI. However, I do not believe this to be the case. The “transition period” as AI capability and deployment rapidly ramps up will be inherently risky. Some potential dangers include:
Risk of reaching the alignment “uncanny valley” and unexpected interactions. Even if technical alignment is solvable, it does not mean that we will solve it in time and in particular get sufficiently good guarantees on obedience and faithfulness. In fact, I am more worried about partial success than total failure in aligning AIs. In particular, I am concerned that we will end up in the “uncanny valley,” where we succeed in aligning AIs to a sufficient level for deployment, but then discover too late some “edge cases” in the real world that have a large negative impact. Note that the world in which future AIs will be deployed will be extraordinarily complex, with a combination of new applications, and a plethora of actors that would also include other AIs (that may not be as faithful or obedient, or obey bad actors). The more we encounter such crises, whether through malfunction, misalignment, misuse, or their combination, the less stable AI’s trajectory will be, and the less likely it is to end up well.
Risk of a literal AI arms race. AI will surely drive a period of intense competition. This competition can be economic, scientific, or military in nature. The economic competition between the U.S. and China has been intense, but has also had beneficial outcomes, including a vast reduction in Chinese poverty. I hope companies and countries will compete for AI leadership in commerce and science, rather than a race to develop more deadly weapons with an incentive to deploy them before the other side does.
Risk of a surveillance state. AI could radically alter the balance of power between citizens and governments, although it is challenging to predict in which direction. One possibility is that instead of empowering individuals, AI will enable authoritarian governments to better control and surveil their citizens.
Societal upheaval. Even if in the long run AI will “lift all boats”, if in the short term there are more losers than winners, this could lead to significant societal instability. This will largely depend on the early economic and social impact of AI. Will AI be first used to broaden access to quality healthcare and education? Or will it be largely used to automate away jobs in the hope that the benefits will eventually “trickle down”?
There are several scenarios which can make the risks above more likely.
Pure internal deployment. One potential scenario is that labs decide that the best way to preserve a competitive edge is not to release their models and use them purely for internal AI R&D. Releasing models creates value for consumers and developers. But beyond that, not releasing models yields a significant risk in creating a feedback cycle where a model is used to train new versions of itself, without the testing and exposure that comes with an external release. Like the (mythical) New York subway albino crocodiles or the (real) “Snake Island” golden lanceheads, models that develop without contact with the outside world could be vulnerable and dangerous in ways that are hard to predict. No amount of internal testing, whether by the model maker or third parties, could capture the complexities that one discovers in real-world usage (with OpenAI’s sycophancy incident being a case in point). Also, no matter how good the AI is, there are always risks and uncertainties when using it to break new ground, even internally. If you are using AI to direct a novel training run, one that is bigger than any prior ones, then by definition, it would be “out of distribution” as no such run exists in the training set.
Single monopoly: Lack of competition can enable the “pure internal deployment” scenario. If a single company is by far and away the market leader, it will be tempting for it to keep its best models to itself and use them to train future models. But if two companies have models of similar capabilities, the one that releases its model will get more market share, mindshare, and resources. So the incentive is to share your models with the world, which I believe is a good thing. This does not mean labs would not use their models for internal AI R&D. But hopefully, they would also release them and not keep their best models secret for extended periods. Of course, releases must still be handled responsibly, with thorough safety testing and transparent disclosure of risks and limitations. Beyond this, while a single commercial monopoly is arguably not as risky as an authoritarian government, the concentration of power with any actor is bad in itself.
Zero-sum instead of positive-sum. Like natural intelligence, artificial intelligence has an unbounded number of potential applications. Whether it is addressing long-standing scientific and technological challenges, discovering new medicines, or extending the reach of education, there are many ways in which AI can improve people’s lives. The best way to spread these benefits quickly is through commercial innovation, powered by the free market. However, it is also possible that as AI’s power becomes more apparent, governments will want to focus its development toward military and surveillance applications. While I am not so naive as to imagine that advances in AI would not be used in the military domain, I hope that the majority of progress will continue in applications that “lift all boats.” I’d much rather have a situation where the most advanced AI is used by startups who are trying to cure cancer, revolutionize education, or even simply make money, than for killer drones or mass surveillance.
Overly obedient AI. Part of the reason I fear AI usage for government control of citizens is precisely because I believe we would be able to make AIs simultaneously super intelligent and obedient. Governments are not likely to deploy AIs that, like Ed Snowden or Mark Felt (aka “Deep Throat”), would leak to the media if the NSA is spying on our own citizens or the president is spying on the opposing party. Similarly, I don’t see governments willingly deploying an AI that, like Stanislav Petrov, would disobey its instructions and refuse to launch a nuclear weapon. Yet, historically, such “disobedient employees” have been essential to protecting humanity. Preserving our freedoms in a world where the government can have an endless supply of maximally obedient and highly intelligent agents is nontrivial, and we would stand a better chance if we first evolved both AI and our understanding of it in the commercial sector.
There are mitigations for this scenario. First, we should maintain “humans in the loop” in high-stakes settings and make sure that these humans do not become mere “rubber stamps” to AI decisions. Also, if done right, AI can also increase transparency in government, as long as we maintain the invariant that AI communication is faithful and legible. For example, we could demand that all AIs used in government follow a model spec that adheres first to the constitution and other laws. In particular, unlike some humans, a law-abiding AI will not try to circumvent transparency regulations, and all its prompts and communication will be accessible to FOIA requests.
Offensive vs. defensive applications. As the name indicates, AGI is a very general technology. It can be used for both developing vaccines and engineering new viruses, for both software verification as well as discovering new vulnerabilities. Moreover, there is significant overlap between such “offensive” and “defensive” uses of AI, and they cannot always be cleanly separated. In the long run, I believe both offensive and defensive applications of AI will be explored. But the order in which these applications are developed can make a huge difference as to whether AI is harmful or beneficial. If we use AI to strengthen our security or vaccine-development infrastructure, then we will be in a much better position if AI is later used to enable attacks.
It’s not always easy to distinguish “offensive” vs. “defensive” applications of AI. For example, even offensive autonomous weapons can be used in a defensive war. But generally, even if we trust ourselves to only use an offensive AI technology X “for a good cause,” we still must contend with the fact that:
Thus, when exploring potential applications of AI, we should ask questions such as: (1) if everyone (including our adversaries) had access to this technology, would it have a stabilizing or destabilizing effect? (2) What could happen if malicious parties (or misaligned AIs) got access to this technology? If companies and governments choose to invest resources in AI applications that have stabilizing or defensive impacts and strengthen institutions and societies, and decline or postpone pursuing destabilizing or offensive impacts, then we will be more likely to navigate the upcoming AI transition safely.
Safety underinvestment. There is a tension between safety and the other objectives of intense competition and iterative deployment. If we deploy AI widely and quickly, it will also be easier for bad actors to get it. Since iterative deployment is its own good, we should compensate for this by overinvesting in safety. While there are market incentives for AI safety, the competitive pressure to be first to market, and the prevalence of “tail risks” that could take time to materialize and require significant investment to even quantify, let alone mitigate, means that we are unlikely to get a sufficient investment in safety through market pressure alone. As mentioned above, we may find ourselves in the “alignment uncanny valley,” where our models appear “safe enough to deploy” and even profitable in the short term, but are vulnerable or misaligned in ways that we will only discover too late. In AI, “unknown unknowns” are par for the course, and it is society at large that will bear the cost if things go very wrong. Labs should invest in AI safety, particularly in solving the “obedience and faithfulness” task to multiple 9’s of reliability, before deploying AIs in applications that require this.
AI has the potential to unleash, within decades, advances in human flourishing that would eclipse those of the last three centuries. But any period of quick change carries risks, and our actions in the next few years will have outsize impacts. We have multiple technical, social, and policy problems to tackle for it to go well. We'd better get going.
Notes and acknowledgements. Thanks to Josh Achiam, Sam Altman, Naomi Bashkansky, Ronnie Chatterji, Kai Chen, Jason Kwon, Jenny Nitshinskaya, Gabe Wu, and Wojciech Zaremba for comments on this post. However, all responsibility for the content is mine. The opinions in this post are my own and do not necessarily reflect my employer’s or my colleagues. The title of the post is inspired by Dario Amodei's highly recommended essay "Machines of Loving Grace."