A non-review of "If Anyone Builds It, Everyone Dies"

[-]Eliezer Yudkowsky2mo11031

The gap between Before and After is the gap between "you can observe your failures and learn from them" and "failure kills the observer". Continuous motion between those points does not change the need to generalize across them.

It is amazing how much of an antimeme this is (to some audiences). I do not know any way of saying this sentence that causes people to see the distributional shift I'm pointing to, rather than mapping it onto some completely other idea about hard takeoffs, or unipolarity, or whatever.

[-]Buck2mo*416

Where do you think you've spelled this argument out best? I'm aware of a lot of places where you've made the argument in passing, but I don't know of anywhere where you say it in depth.

My response last time (which also wasn't really in depth; I should maybe try to articulate my position better sometime...) was this:

I agree that the regime where mistakes don't kill you isn't the same as the regime where mistakes do kill you. But it might be similar in the relevant respects. As a trivial example, if you build a machine in America it usually works when you bring it to Australia. I think that arguments at the level of abstraction you've given here don't establish that this is one of the cases where the risk of the generalization failing is high rather than low.

[-]Raemon1mo*133

I'm guessing this won't turn out to resolve your current disagreement, but I think the best articulation of this is probably in the Online Resource page: A Closer Look at Before And After.

From past discussions, it sounds like you think "the AIs are now capable of confidently taking over" is, like, <50% (at least < 60%?) likely to in practice be a substantially different an environment.

I don't really get why. But, to be fair, on my end, I also don't really have much more gears underneath the hood of "obviously, it's way different to run tests and interventions on someone who isn't capable of confidently taking over, vs someone who is, because they just actually have the incentive to defect in the latter case and mostly don't in the former". It seems like there's just a brute difference in intuition I'm not sure what to do with?

(I agree there might be scenarios like "when the AI takeover is only 10% likely to work, it might try anyway, because it anticipates more powerful AIs coming later and now seems like it's best shot." That's a reason you might get a warning shot, but, not a reason that "it can actually just confidently takeover with like 95%+ likelihood" doesn't count as a significantly new environment once we actually get to that stage.)

Text of the relevant resource section, for reference

As mentioned in the chapter, the fundamental difficulty researchers face in AI is this:
You need to align an AI Before it is powerful enough and capable enough to kill you (or, separately, to resist being aligned). That alignment must then carry over to different conditions, the conditions After a superintelligence or set of superintelligences^* could kill you if they preferred to.
In other words: If you’re building a superintelligence, you need to align it without ever being able to thoroughly test your alignment techniques in the real conditions that matter, regardless of how “empirical” your work feels when working with systems that are not powerful enough to kill you.
This is not a standard that AI researchers, or engineers in almost any field, are used to.
We often hear complaints that we are asking for something unscientific, unmoored from empirical observation. In reply, we might suggest talking to the designers of the space probes we talked about in Chapter 10.
Nature is unfair, and sometimes it gives us a case where the environment that counts is not the environment in which we can test. Still, occasionally, engineers rise to the occasion and get it right on the first try, when armed with a solid understanding of what they’re doing — robust tools, strong predictive theories — something very clearly lacking in the field of AI.
The whole problem is that the AI you can safely test, without any failed tests ever killing you, is operating under a different regime than the AI (or the AI ecosystem) that needs to have already been tested, because if it’s misaligned, then everyone dies. The former AI, or system of AIs, does not correctly perceive itself as having a realistic option of killing everyone if it wants to. The latter AI, or system of AIs, does see that option.^†
Suppose that you were considering making your co-worker Bob the dictator of your country. You could try making him the mock dictator of your town first, to see if he abuses his power. But this, unfortunately, isn’t a very good test. “Order the army to intimidate the parliament and ‘oversee’ the next election” is a very different option from “abuse my mock power while being observed by townspeople (who can still beat me up and deny me the job).”
Given a sufficiently well-developed theory of cognition, you could try to read the AI’s mind and predict what cognitive state it would enter if it really did think it had the opportunity to take over.
And you could set up simulations (and try to spoof the AI’s internal sensations, and so on) in a way that your theory of cognition predicts would be very similar to the cognitive state the AI would enter once it really had the option to betray you.^‡
But the link between these states that you induce and observe in the lab, and the state where the AI actually has the option to betray you, depends fundamentally on your untested theory of cognition. An AI’s mind is liable to change quite a bit as it develops into a superintelligence!
If the AI creates new successor AIs that are smarter than it, those AIs’ internals are likely to differ from the internals of the AI you studied before. When you learn only from a mind Before, any application of that knowledge to the minds that come After routes through an untested theory of how minds change between the Before and the After.
Running the AI until it has the opportunity to betray you for real, in a way that’s hard to fake, is an empirical test of those theories in an environment that differs fundamentally from any lab setting.
Many a scientist (and many a programmer) knows that their theories of how a complicated system is going to work in a fundamentally new operating environment often don’t go well on the first try.^§ This is a research problem that calls for an “unfair” level of predictability, control, and theoretical insight, in a domain with unusually low levels of understanding — with all of our lives on the line if the experiment’s result disconfirms the engineers’ hopes.
This is why it seems overdetermined, from our perspective, that researchers should not rush ahead to push the frontier of AI as far as it can be pushed. This is a legitimately insane thing to attempt, and a legitimately insane thing for any government to let happen.

[-]boazbarak1mo120

It is interesting that the analogy here is to evaluating your co worker Bob on being the mock dictator of your town and then deploying them as the actual dictator of the country.

In reality in AI there would be multiple generations of different "Bobs" (although it is possible - and needs to be studied - that they would have affinity for one another and try to communicate or influence each other via hidden means). These "bobs" will be growing in both capability but also actually deployed in the world with real-world consequences. Our current "Bobs" are starting to be deployed as coding and personal assistant for short term well-defined coding tasks, but these will grow from minutes, to hours to days to months and more. "Bob" will start doing work of an intern, then a junior worker, then more senior one, managing teams of other "Bobs". To continue the analogy, maybe "Bob" will be the actual (non mock) mayor of a town, then a province, and maybe then rule over a whole country. But all the while all the Bobs would keep seeming perfectly helpful. (Getting 100% success in the task of hiding their true intentions, even though they are not yet good enough to get 100% success in other tasks) until the day when (by design or by accident) we make the Bobs dictators of the whole world.

I am not saying such a scenario is logically impossible. It just seems highly unlikely to me. To be clear, the part that seems unlikely is not that AI will be eventually so powerful and integrated in our systems, that it could cause catastrophic outcomes if it behaved in an arbitrarily malicious way. The part I find unlikely is that we would not be able to see multiple failures along the way that are growing in magnitude. Of course it is also possible that we will "explain away" these failures and still end up in a very bad place. I just think that it wouldn't be the case that we had one shot but we missed it, but rather had many shots and missed them all. This is the reason why we (alignment researchers at various labs, universities, non profits) are studying questions such as scheming, colluding, situational awareness, as well as studying methods for alignment and monitoring. We are constantly learning and updating based on what we find out.

I am wondering if there is any empirical evidence from current AIs that would modify your / @Eliezer Yudkowsky 's expectations of how likely this scenario is to materialize.

[-]yams1mo1310

Get to near-0 failure in alignment-loaded tasks that are within the capabilities of the model.

That is, when we run various safety evals, I'd like it if the models genuinely scored near-0. I'd also like it if the models ~never refused improperly, ~never answered when they should have refused, ~never precipitated psychosis, ~never deleted whole codebases, ~never lied in the CoT, and similar.

These are all behavioral standards, and are all problems that I'm told we'll keep under control. I'd like the capacity for us to have them under control demonstrated currently, as a precondition of advancing the frontier.

So far, I don't see that the prosaic plans work in the easier, near-term cases, and am being asked to believe they'll work in the much harder future cases. They may work 'well enough' now, but the concern is precisely that 'well enough' will be insufficient in the limit.

An alternative condition is 'full human interpretability of GPT-2 Small'.

This probably wouldn't change my all-things-considered view, but this would substantially 'modify my expectations', and make me think the world was much more sane than today's world.

[-]Bronson Schoen1mo143

I know you're pointing out the easier case still not working, but I just want to caution against the "drive it to zero" mentality, since I worry strongly that it's the exact mentality researchers often have.

When that's your mental model, reducing rates will seem like progress.

The part I find unlikely is that we would not be able to see multiple failures along the way that are growing in magnitude

IMO the default failure mode here is:

We do observe them (or early versions of them)
The lab underinvests in the problem
It becomes enough of a problem that its painful to product or internal capabilities usage
We didn’t invest enough to actually solve the underlying problem, and we can’t afford to not use the model while we wait for alignment research to catch up
The lab patches over the problem with some “reduces but does not eliminate” technique
The model is then usable, but with harder to detect misalignment
Scale capabilities and repeat

This is the exact loop we’re in now, and the dynamics only intensify with time and capabilities.

[-]yams1mo40

The situation you're describing definitely concerns me, and is about mid-way up the hierarchy of nested problems as I see it (I don't mean 'hierarchy of importance' I mean 'spectrum from object-level-empirical-work to realm-of-pure-abstraction).

I tried to capture this at the end of my comment, by saying that even success as I outlined it probably wouldn't change my all-things-considered view (because there's a whole suite of nested problems at other levels of abstraction, including the one you named), but it would at least update me toward the plausibility of the case they're making.

As is, their own tests say they're doing poorly, and they'll probably want to fix that in good faith before they try tackling the kind of dynamic group epistemic failures that you're pointing at.

[-]Signer1mo20

I just think that it wouldn’t be the case that we had one shot but we missed it, but rather had many shots and missed them all.

This interpretation only works if by missed shots you mean "missed opportunities to completely solve alignment". Otherwise you can observe multiple failures along the way and fix observable scheming, but you only need to miss one alignment failure on the last capability level. The point is just that your monitoring methods, even improved after many failures to catch scheming in pre-takeover regime, are finally tested only when AI is really can take over. Because real ability to take over is hard to fake. And you can't repeat this test after you improved your monitoring, if you failed. Maybe your alignment training after previous observed failure in pre-takeover regime really made AI non-scheming. But maybe you just missed some short thought where AI decided to not think about takeover since it can't win yet. And you'll need to rely on your monitoring without actually testing whether it can catch all such possibilities that depend on actual environment that allows takeover.

[-]TAG1mo00

Yep. Feel free to add it here

[-]boazbarak2mo38-5

You seem to be assuming that you cannot draw any useful lessons from cases where failure falls short of killing everyone on earth that would apply to cases where it does.

However, if AI's advance continuously in capabilities, then there are many intermediate points between today where (for example) "failure means prompt injection causes privacy leak" and "failure means everyone is dead". I believe that if AIs that capable of the latter would be scaled up version of current models, then by studying which alignment methods scale and do not scale, we can obtain valuable information.

If you consider the METR graph, of (roughly) duration of tasks quadrupling every year, then you would expect non-trivial gaps between the points. that (to take the cybersecurity example) AI is at the level of a 2025 top expert, AI can be equivalent to a 2025 top level hacking team, AI reaches 2025 top nation state capabilities. (And of course while AI improves , the humans will be using AI assistance also.)

I believe there is going to be a long and continuous road ahead between current AI systems and ones like Sable in your book.
I don't believe that there is going to be an alignment technique that works one day and completely fails after a 200K GPU 16 hour run.
Hence I believe we will be able to learn from both successes and failures of our alignment methods throughout this time.

Of course, it is possible that I am wrong, and future superintelligent systems could not be obtained by merely scaling up current AIs, but rather this would require completely different approaches. However, if that is the case, this should update us to longer timelines, and cause us to consider development of the current paradigm less risky.

[-]Aaron_Scher2mo4428

I'm not sure what Eliezer thinks, but I don't think it's true that "you cannot draw any useful lessons from [earlier] cases", and that seems like a strawman of the position. They make a bunch of analogies in the book, like you launch a rocket ship, and after it's left the ground, your ability to make adjustments is much lower; sure you can learn a bunch in simulation and test exercises and laboratory environments, but you are still crossing some gap (see p. ~163 in the book for full analogy). There are going to be things about the Real Deal deployment that you were not able to test for. One of those things for AI is that "try to take over" is a more serious strategy, somewhat tautologically because the book defines the gap as:

Before, the AI is not powerful enough to kill us all, nor capable enough to resist our attempts to change its goals. After, the artificial superintelligence must never try to kill us, because it would succeed. (p. 161)

I don't see where you are defusing this gap or making it nicely continuous such that we could iteratively test our alignment plans as we cross it.

It seems like maybe you're just accepting that there is this one problem that we won't be able to get direct evidence about in advance, but you're optimistic that we will learn from our efforts to solve various other AI problems which will inform this problem.

When you say "by studying which alignment methods scale and do not scale, we can obtain valuable information", my interpretation is that you're basically saying "by seeing how our alignment methods work on problems A, B, and C, we can obtain valuable information about how they will do on separate problem D". Is that right?

Just to confirm, do you believe that at some point there will be AIs that could succeed at takeover if they tried? Sometimes I can't tell if the sticking point is that people don't actually believe in the second regime.

I don't believe that there is going to be an alignment technique that works one day and completely fails after a 200K GPU 16 hour run.

There are rumors that many capability techniques work well at a small scale but don't scale very well. I'm not sure this is well studied, but if it was, that would give us some evidence about this question. Another relevant result that comes to mind is reward hacking and Goodharting where often models look good when only a little optimization pressure is applied but then it's pretty easy to overoptimize as you scale u; as I think about these examples it actually seems like this phenomenon is pretty common? And sure, we can quibble about how much optimization pressure is applied in current RL vs. some unknown parallel scaling method, but it seems quite plausible that things will be different at scale and sometimes for the worse.

[-]boazbarak1mo8-7

Treating "takeover" as a single event brushes a lot under the carpet.

There are a number of capabilities involved - cybersecurity, bioweapon, etc.. - that models are likely to develop at different stages. I agree AI will ultimately far surpass our 2025 capabilities in all these areas. Whether that would be enough to take over the world at that point in time is a different questoin.

Then there are propensities. Taking over requires the model to have the propensity to "resist our attempts to change its goal" as well to act covertly in pursuit of its own objectives, which are not the ones it was instructed. (I think these days we are not really thinking models are going to misunderstand their instructions in a "monkey's paws" style.)

If we do our job right in alignment, we would be able to drive these propensities down to zero.
But if we fail, I believe these propensities will grow over time, and as we iteratively deploy AI systems with growing capabilities, even if we fail to observe these issues in the lab, we will observe them in the real world well before the scale of killing everyone.

There are a lot of bad things that AIs can do before literally taking over the world. I think there is another binary assumption which is that AIs utility function is binary - somehow the expected value calculations work out such that we get no signal until the takeover.

Re my comment on the 16 hour 200K GPU run. I agree that things can be different at scale and it is important to keep measuring them as scale increases. What I meant is that even when things get worse with scale we would be able to observe it. But the exampe of the book - as I understood it - was not a "scale up." Scale up is when you do a completely new training run, in the book that run was just some "cherry on top" - one extra gradient step - which presumably was minor in terms of compute compared to all that came before it. I don't think one step will make the model suddenly misaligned. (Unless it completely borks it, which would be very observable.)

[-]Aaron_Scher1mo160

Thanks for your reply. Noting that it would have been useful for my understanding if you had also directly answered the 2 clarifying questions I asked.

There are a lot of bad things that AIs can do before literally taking over the world.

Okay, it does sound like you're saying we can learn from problems A, B, and C in order to inform D. Where D is the model tries to take over once it is smart enough. And A is like jailbreak-ability and B is goal preservation. It seems to me like somebody who wants humanity to gamble on the superalignment strategy (or otherwise build ASI systems at all, though superalignment is a marginally more detailed plan) needs to argue that our methods for dealing with A, B, and C are very likely to generalize to D.

Maybe I'm misunderstanding though, it's possible that you mean the same AIs that want to eventually take over will also take a bunch of actions to tip their hand earlier on. This seems mostly unlikely to me, because that's an obviously dumb strategy and I expect ASIs to not pursue dumb strategies. I agree that current AIs do dumb things like this, but these are not the AIs I'm worried about.

Whether that would be enough to take over the world at that point in time is a different questoin.

To repeat my second clarifying question from above, do you believe that at some point there will be AIs that could succeed at takeover if they tried? If we were talking about the distribution shift that a football team undergoes from training to Game Day, and you didn't think the game would ever happen, that sounds like it's the real crux, not some complicated argument about how well the training drills match the game.

[-]boazbarak1mo*162

I think it's more like we have problems A_1, A_2, A_3, ..... and we are trying to generalize from A_1 ,...., A_n to A_{n+1}.

We are not going to go from jailbreaking the models to give a meth recipe to taking over the world. We are constantly deploying AIs in more and more settings, with time horizons and autonomy that are continuously growing. There isn't one "Game Day." Models are already out in the field right now, and both their capabilities as well as the scope that they are deployed in is growing all the time.

So my mental model is there is a sequence of models M_1,M_2,.... of growing capabilities with no clear one point where we reach AGI or ASI but more of a continuum. (Also models might come from different families or providers and have somewhat incomparable capabilities.)

Now suppose you have such a sequence of models M_1,M_2,..... of growing capabilities. I don't think it would be the case that model M_n develops the propensity to act covertly and pursue its own goals, but the only goal it cares about is taking over the world, and also identifies with future models, and so it decides to "lie in wait" until generation M_{n+k} where it would act on that.

I think that if the propensity to act covertly and pursue misaligned goals would change continuously between generations of models, and it may grow, stay the same, or shrink, but in any case it will be possible to observe it well before we reach ASI.

Regarding your second question of whether AIs would be powerful enough to take over the world at some point:

My assumption is that AIs will grow in capabilities and integration in the world economy. If progress continues on the current trajectory then there would be a point where a variety of AI models are deeply integrated in our infrastructure. My hope (and what I and other alignment and safety researchers are working on) is that by then we would have strong ways to measure, monitor, and predict the envelope of potential risks for these models.

I am not sure it would make sense to think about these models as a singular entity but I agree that at the point we reach such deep integration and reliance, if all of these models were to suddenly and simultaneously act maliciously then they would be successful in causing an arbitrary amount of damage, quite possibly up to an extinction level event.

[-]ryan_greenblatt1mo4511

I don't think it would be the case that model M_n develops the propensity to act covertly and pursue its own goals, but the only goal it cares about is taking over the world, and also identifies with future models, and so it decides to "lie in wait" until generation M_{n+k} where it would act on that.

Suppose that all scheming AIs were perfectly allied. As in, is indifferent between it taking over and $M_{n + k}$ taking over, so $M_{n}$ is willing to be patient if this would be a better strategy. (We can imagine this applying within a company and between companies.) Would this change your perspective?

For what it's worth, I broadly agree with "we'll probably be able to study the problem of scheming AIs hiding their intentions in advance, including by studying reasonably analogous (to the dangerous AIs we're worried about) 'actual natural scheming AIs' that consistently try to hide their intentions in training/testing", but it's pretty easy for this to go poorly. The main reasons why it's easy to go poorly are that schemers actively want to avoid being detected and to avoid being studied and we might be extremely rushed due to a software intelligence explosion. Another mundane issue is that we just might not have that many distinct capable (scheming) AIs to study (as creating capable AIs is expensive) and findings might not transfer.

I say much more in this post I recently wrote.

Here's an example situation where it seems like studying scheming is pretty hard: suppose that schemers are decently allied and that all schemers above some capability level $X$ are smart enough to reliably lie in wait undetected and takeover is easy/possible for AIs (collectively) at capability level $Y$ where $Y >> X$ . Then, naively, you can only study schemers below capability level $X$ while your methods still have to generalize all the way to $Y$ . Further, schemers below capability level $X$ might be particularly disanalogous because schemers that have trouble avoiding detection have importantly different properties from schemers that do avoid detection. You can study schemers which are just barely below $X$ , but then these schemers will probably be hard to detect and you'll miss them a bunch which makes studying them harder (e.g. because you can't be confident whether some method avoided scheming or if it instead made the AI a bit better at hiding its scheming which suffices to make it not be detectable). You could try to overcome this in a bunch of ways, e.g. trying to create schemers which don't have allied goals so they are more likely to take risks or to accept deals.

This example situation is relatively pessimistic, but not totally implausible and just subsets of these factors could still cause serious issues.

A lot of my view is that this is actually somewhat conceptually tricky to study and generally a pain for a bunch of reasons (e.g. you might have to train a bunch of diverse AIs), and as it currently stands we don't seem on track to do a great job studying this on time if capabilities progress as quickly as seems pretty likely. It also seems like companies might react very poorly to clear cut evidence for risk.

[-]boazbarak1mo51

Hi Ryan, will be brief but generally:
1. I agree that scheming and collusion are some of the more difficult settings to study, also understanding the impact of situational awareness on evaluations.
2. I still think it is possible to study these in current and upcoming models, and get useful insights. It may well be that these insights will be that the problems are becoming worse with scale and we don't have good solutions for them yet..

[-]the gears to ascension2mo20

I note that to my eyes, you appear to be straightforwardly accepting the need-to-generalize claim, and arguing for ability-to-generalize. Putting words in your mouth a little, what I see you saying is that, by the time we have a true loss-of-control-can-be-catastrophic moment where failure kills boazbarak, we have had enough failure recoveries on highly similar systems to be sure deadly-failure probability is indistinguishable from zero, that maximum-likely-failure-consequence is shrinking as fast or faster than model capability.

But current approaches don't seem to me to zero out the rate of failures above a certain level of catastrophicness. They're best seen as continuous in probability, not continuous in failure size.

[-]boazbarak2mo70

I am not sure I 100% understand what you are saying. Again, like I wrote elsewhere, it is possible that for one reason or another rather than systems becoming safer and more controlled, they will become less safe and riskier over time. It is possible we will have a sequence of failures growing in magnitude over time, but for one reason or another do not address them, and hence since end up in a very large scale catastrophe.

It is possible that current approaches are not good enough and will not improve fast enough to match the stakes at which we want to deploy AI. If that is the case then it will end badly, but I believe that we will see many bad outcomes well before an extinction event. To put it crudely, I would expect that if we are on a path to that ending, the magnitude of harms that will be caused by AI will climb on an exponential scale over time similar to how other capabilities are growing.

[-]Simon Lermen1mo10

"future superintelligent systems could not be obtained by merely scaling up current AIs, but rather this would require completely different approaches. However, if that is the case, this should update us to longer timelines, and cause us to consider development of the current paradigm less risky."

This doesn't feel like convincing reasoning to me. For one, there is also a third option, which is that both scaling up current methods (with small modifications) and paradigm shifts could lead us to superintelligence. To me, this seems intuitively to be the most likely situation. Also paradigm shifts could be around the corner at any point, any of the vast number of research directions could give us a big leap in efficiency for example at any point.

[-]boazbarak1mo0-28

Note that this is somewhat of an anti-empirical stance - by hypothesizing that superintelligence will arrive by some unknown breakthrough that would both take advantage of current capabilities and render current alignment methods moot - you are essentially saying that no evidence can update you.

[-]Simon Lermen1mo53

One thing I like about your position is that you basically demand of Eliezer and Nate to tell you what kind of alignment evidence would update them towards believing it's safe to proceed. As in, E&N say we would need really good interp insights, good governance, good corrigibility on hard tasks, and so on. I would expect that they put the requirements very high and that you would reject these requirements as too high, but still seems useful for Eliezer and Nate to state their requirements. (Though perhaps they have done this at some point and I missed it)

To respond to your claim that no evidence could update ME and that I am anti-empirical? I don't quite see were I wrote anything like that. I am making the literal point that: you say that there are two options, either scaling up current methods leads to superintelligence or it requires new paradigm shifts/totally new approaches. But there is also a third option, that there are multiple paths forward right now to superintelligence, paradigm shifts and scaling up.

Yes, I do expect that current "alignment" methods like RLHF or COT monitoring will predictably fail for overdetermined reasons when systems are powerful enough to kill us and run their own economy. There is empirical evidence against COT monitoring and against RLHF. In both cases we could have also predicted failure without empirical evidence just from conceptual thinking (people will upvote what they like vs whats true, COT will become less understandable the less the model is trained on human data), though the evidence helps. I am basically seeing lots of evidence that current methods will fail, so no I don't think I am anti-empirical. I also don't think that empiricism should be used as anti-epistemology or as an argument for not having a plan and blindly stepping forward.

[-]boazbarak1mo40

I also believe that our current alignment methods will not scale and that we need to develop new ones. In particular I am a co author of the scheming paper mentioned in the first link you say.

As I said multiple times, I don't think we will succeed by default. I just think that if we fail we will do so multiple times with failures continually growing in magnitude and impact.

[-]Vladimir_Nesov2mo107

In this framing the crux is whether there is an After at all (at any level of capability). The distinction between "failure doesn't kill the observer" (a perpetual Before) and "failure is successfully avoided" (managing to navigate the After).

[-]arete1mo*6-5

Here's my attempted phrasing, which I think avoids some of the common confusions:

Suppose we have a model with utility function $ϕ$ , where $M$ is not capable of taking over the world. Assume that thanks to a bunch of alignment work, $ϕ$ is within $δ$ (by some metric) of humanity's collective utility function. Then in the process of maximizing $ϕ$ , $M$ ends up doing a bunch of vaguely helpful stuff.

Then someone releases model $M^{'}$ with utility function $ϕ^{'}$ , where $M^{'}$ is capable of taking over the world. Suppose that our alignment techniques generalize perfectly. That is, $ϕ^{'}$ is also within $δ^{'}$ of humanity's collective utility function, where $δ^{'} \leq δ$ Then in the process of maximizing $ϕ^{'}$ , $M^{'}$ gets rid of humans and rearranges their molecules to satisfy $ϕ^{'}$ better.

Does this phrasing seem accurate and helpful?

[-]MalcolmMcLeod1mo10

This is an excellent encapsulation of (I think) something different---the "fragility of value" issue: "formerly adequate levels of alignment can become inadequate when applied to a takeover-capable agent." I think the "generalization gap" issue is "those perfectly-generalizing alignment techniques must generalize perfectly on the first try".

Attempting to deconfuse myself about how that works if it's "continuous" (someone has probably written the thing that would deconfuse me, but as an exercise): if AI power progress is "continuous" (which training is, but model-sequence isn't), it goes from "you definitely don't have to get it right at all to survive" to "you definitely get only one try to get it sufficiently right, if you want to survive," but by what path? In which of the terms "definitely," "one," and "sufficiently" is it moving continuously, if any?

I certainly don't think it's via the number of tries you get to survive! I struggle to imagine an AI where we all die if we fail to align it three times in a row.

I don't put any stock in "sufficiently," either---I don't believe in a takeover-capable AI that's aligned enough to not work toward takeover, but which would work toward takeover if it were even more capable. (And even if one existed, it would have to eschew RSI and other instrumentally convergent things, else it would just count as a takeover-causing AI.)

It might be via the confidence of the statement. Now, I don't expect AIs to launch highly-contingent outright takeover attempts; if they're smart enough to have a reasonable chance of succeeding, I think they'll be self-aware enough to bide their time, suppress the development of rival AIs, and do instrumentally convergent stuff while seeming friendly. But there is some level of self-knowledge at which an AI will start down the path toward takeover (e.g., extricating itself, sabotaging rivals) and succeed with a probability that's very much neither 0 nor 1. Is this first, weakish, self-aware AI able to extricate itself? It depends! But I still expect the relevant band of AI capabilities here to be pretty narrow, and we get no guarantee it will exist at all. And we might skip over it with a fancy new model (if it was sufficiently immobilized during training or guarded its goals well).

Of course, there's still a continuity in expectation: when training each more powerful model, it has some probability of being The Big One. But yeah, I more or less predict a Big One; I believe in an essential discontinuity arising here from a continuous process. The best analogy I can think of is how every exponential with r<1 dies out and every r>1 goes off to infinity. When you allow dynamic systems, you naturally get cuspy behavior.

[-]arete1mo*21

Upon reflection, I agree that my previous comment describes fragility of value.

My mental model is that the standard MIRI position^[1] claims the following ^[2]:
1. Because of the way AI systems are trained, will be large even if we knew humanity's collective utility function and could target that (this is inner misalignment)
2. Even if $δ^{'}$ were fairly small, this would still result in catastrophic outcomes if $M^{'}$ is an extremely powerful optimizer (this is fragility of value)

A few questions:
3. Are the claims (1) and (2) accurate representations of inner misalignment and fragility of value?
4. Is the "misgeneralization" claim just " $δ^{'}$ will be much larger than $δ$ "?

If the answer to (4) is yes, I am confused as to why the misgeneralization claim is brought up. It seems that (1) and (2) are sufficient to argue for AI risk.. By contrast, it seems that the misgeneralization claim is neither sufficient nor necessary to make a case for AI risk. Furthermore, the misgeneralization claim seems less likely to be true than (1) and (2).

Also let me know if I am thinking about things in a completely wrong framework and should scrap my made up notation.

^{^}
There's probably a better name for this. Please suggest one!
^{^}
Non-exhaustive list.

[-]MalcolmMcLeod1mo10

I like your made up notation. I'll try to answer, but I'm an amateur in both reasoning-about-this-stuff and representing-others'-reasoning-about-this-stuff.

I think (1) is both inner and outer misalignment. (2) is fragility of value, yes.

I think the "generalization step is hard" point is roughly "you can get low by trial and error. The technique you found at the end that gets $δ$ low---it better not intrinsically depend on the trial and error process, because you don't get to do trial and error on $δ$ '. Moreover, it better actually work on M'."

Contemporary alignment techniques depend on trial and error (post-training, testing, patching). That's one of their many problems.

My suggest term for standard MIRI thought would just be Mirism.

I kinda don't like "generalization" as a name for this step. Maybe "extension"? There are too many steps where the central difficulty feels analogous to the general phenomenon of failure-of-generalization-OOD: the difficulty in getting $δ$ to be small, the difficulty of going from techniques for getting $δ$ small to techniques for getting a small $δ$ ' (verbiage different because of the first-time constraint), the disastrousness of even smallish $δ$ '...

[-]Jeff Rose1mo40

It's not that you have only one chance to succeed. It's that you can only lose once in a repeated game. You successfully align ASI A1. The following year you have ASI B2, which is smarter than ASI A1. Was the method of aligning A1, sufficient to align B2? The good news is you have had six months to improve your alignment methods. The bad news is that if you fail, your prior success doesn't matter. This happens again the following year with ASI C3.

[-]Canaletto1mo30

I overall agree with this framing, but I think even in Before sufficiently bad mistakes can kill you, and in After sufficiently small mistakes wouldn't. So, it's mostly a claim about how strongly the mistakes would start to be amplified at some point.

[-]Daniel Kokotajlo2mo3310

Thanks for sharing your thoughts! I disagree with you significantly in a bunch of ways but I think people in positions of power at AI companies have a responsibility to keep the public informed about their takes on matters this important.

[-]boazbarak2mo151

Thank you Daniel. I’m generally a fan of as much transparency as possible. In my research (and in general) I try to be non dogmatic and so if you believe that there are aspects I am wrong about, then I’d love to hear about them. (Especially if those can be empirically tested.)

[-]Daniel Kokotajlo2mo122

Thanks. Well, I don't have much time right now I'm afraid, but real quick I'll say: I basically agree that progress will be fairly continuous in the future... yet still fast, fast enough that e.g. I don't expect there to be any incidents where an AI system schemes against the company that created it, takes over its datacenter and/or persuades company leadership to trust it, and then gets up to further shenanigans before getting caught and shutdown. If I expected there to be things like that happening, especially repeatedly, before the first case of a scheming AI system that actually can succeed in taking over, then I'd think we'd have several opportunities to learn how to control AIs of that power level. (Even still this might not generalize to AIs of higher power levels, but maybe that's OK if the same arguments apply e.g. higher levels of capability would still lead to failed attempts several times before successful attempts)

Anyhow I'm curious to hear what your critique of AI 2027 would be, since I like to think of it as a continuous story. Or, just I'd like to hear some examples from you of what sorts of misaligned AI failures you expect to see, before the first unrecoverable failure, that are nevertheless quite similar to the first unrecoverable failure such that we'll probably be able to prevent the latter by studying the former.

[-]boazbarak1mo90

I am also short in time, but re AI 2027. There are some important points I agree with, which is why I wrote in Machines of Faithful Obedience that I think the scenario where there is no competition and only internal deployment is risky.

I mostly think that the timelines were too aggressive and that we are more likely to continue on the METR path than explode, as well as multiple companies training and releasing models at a fast cadence. So it's more like "Agent-X-n" (for various companies X and some large n) than "Agent 4" and the difference between "Agent-X-n" and "Agent-X-n+1" will not be as dramatic.

Also, if we do our job right, Agent-X-n+1 will be more aligned than Agent-X-n.

[-]Daniel Kokotajlo1mo20

Thanks. Yeah I think the timelines were also a bit too aggressive, but overall things won't look thaaat different in 2029 (my current median) or 2032 (the aggregate median of the rest of my team).

I think maybe my main disagreement with you has to do with the thing about making each generation of AIs more aligned than the previous one. A very important point, I think, is that we don't have perfect evals for alignment, and probably won't have perfect evals for alignment for some time. That is, our eval suites will catch some kinds of misalignment, but not others. So there will probably continue to be misalignments -- including very major ones -- that we don't catch until it's too late. So it's unclear whether our AIs will be improving in alignment over time; they'll probably be improving in apparent alignment, but who knows what's happening with the kinds of misalignment we can't effectively test for; those kinds could be getting worse and worse (and indeed we have some reason to think this will be happening; AI 2027 even gives a fairly concrete model according to which the misalignments get worse over time despite things looking better and better on evals. E.g. at first you are mostly just summoning personas using prompts, and that's pretty benign, but as RL scales up that tends to get distorted and undermined by training incentives).

[-]StanislavKrym1mo10

First of all, I think that we will see the intelligence explosion once the AIs become superhuman coders. In addition, I don't think that I understand how Agent-x-n+1 will become more aligned than Agent-x-n if mankind doesn't create a new training environment which actually ensures that the AI obeys the Spec. For example, sycophancy was solved by the KimiK2 team which dared to stop using RLHF, resorting to RLVR and self-critique instead.

However, there is a piece of hope. For example, one could deploy the AIs to cross-check each other's AI research. Alas, this technique might as well run into problems due to the fact that the companies were merged beforehand as a result of Taiwan having been invaded or that the AIs managed to agree on a common future. I did try to explore this technique and its potential results back when I wrote my version of the AI-2027 scenario.

[-]cdt1mo10

I don't expect there to be any incidents where an AI system schemes against the company that created it, takes over its datacenter and/or persuades company leadership to trust it, and then gets up to further shenanigans before getting caught and shutdown.

Have you written more about your thought process on this somewhere else?

[-]Aaron_Scher2mo272

Either I don't understand your view about continuous development and threshold things, or I think it's probably wrong. I've decided to spin this out into its own comment, though I originally wrote it as part of this one.

I'm going to think out loud: One story I could tell for how we cross the gap from "the AI would not succeed at taking over" to "it would succeed if it tried", in a continuous manner, is as @Nina Panickssery points out in a tweet, "The AI could be more or less likely to succeed".

(My understanding is that we're actually focusing on the AI's beliefs about its likelihood of success, which I'll call AI-p(success), not p(success) according to some observer—the AI's beliefs are what shape its decisions.) So here's an example model: maybe as the AI gets smarter or acquires more resources, its p(success) rises at 1 percentage point per day. We think the AI might do something like an expected value calculation, only trying to takeover when p(success) crosses some threshold. (But, like, we don't know that that's true. And we don't know what the threshold is.) If the AI is doing some threshold thing like this, then it looks all good and chill until day 23 or whatever when the threshold is hit and the AI fully commits to a takeover. Under that AI decision procedure, it looks like by default we get this rough discontinuity where we don't get any more iteration after day 22.

Now imagine a really cautious AI company, maybe during each day they try really hard to trick some instance of the AI that its AI-p(success) is actually higher than it is, and they're able to push up AI-p(success) by 5 percentage points even while the actual likelihood of takeover is not inflated by 5 points (maybe they also do extra security measures for this AI, and together the adversarial testing and security create a 5% spread). So then on day 18 the AI tries to take over and we try not to be unlucky on an 18% takeover chance. And if we succeed at containing the AI, maybe we can run more experiments on this 18 day AI. What experiments are we going to run? I guess we'll try to make it so that even when AI-p(success) exceeds 23% the AI does not try to take over, but idk how.

There are many issues with this plan:

We don't actually know the decision process of the AI and don't know if this model is at all right. Maybe instead it has the totally illogical decision process of trying hard to takeover in proportion to AI-p(success), so on day 1 it tries very little, on day 10 some, and on day 50 it tries pretty hard; that seems like an insanely bad strategy (and humanity might actually win if the AIs always took strategies like this), but there's a bunch of other strategies the AI could be taking, some of which could be better or worse than the model here.
Even if we know the high-level decision process of the AI, we don't know the threshold. (Aside: I assume the AI is well calibrated and a good forecaster, other than the modifications we make, i.e., its p(success) is accurate). If the threshold is high, then we are basically screwed in this scenario. Like if the AI follows the policy "only attempt take over if AI-p(success) ≥ 99%", then we're taking on a huge amount of risk (i.e., dying) unless we could somehow artificially modulate AI-p(success) by a ton. Like, using the numbers above, then the AI tries to takeover when its actual chance of success is actually merely 94%, ruh roh.
It's not obvious that we'll be able to measure AI-p(success) with much confidence. Even if we knew the threshold, we probably can't actually measure the thing. AI mind reading seems kinda hard, and while I definitely think it's plausible that we'll get this, I might be at like 50/50. Doing the above scheme without being able to measure AI-p(success) seems extremely difficult due to not knowing if your methods for modifying AI-p(success) are working.
I don't expect us to know observer-p(success) with much confidence. It seems like most of the claims from AI companies about current AI capabilities are pretty soft, unconfident claims. For instance, OpenAI, Anthropic, and Google DM have all said in the last few months that their frontier models might be capable of significant CBRN (bio especially) uplift for unskilled actors, but they aren't sure. OpenAI: "We do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm, our defined threshold for High capability, and the model remains on the cusp of being able to reach this capability."Anthropic: "Although we have not determined whether the ASL-3 Standard is required for this model". Google: "Further study is required to reach our final assessment of whether the model has reached the CCL for CBRN Uplift Level 1." Now maybe this is just corporate hedging, but I think it's way more likely, and supported by a preponderance of evidence, that assessing AI capabilities, especially for real world tasks, is hard to do and we are currently bad at it. I basically expect that in a couple years we'll enter a regime of "the experts do not know if frontier AIs would be able to successfully escape their containment or successfully takeover if they tried, it seems like at least 1% likely" and probably our p(success) will go up as AI capabilities increase, but it won't be well justified ever and there will be significant disagreement about the number among experts.
It's unclear if we'll be able to change AI-p(success) much. 5 percentage points felt reasonable but idk. This is certainly a place where if we don't try very hard at this kind of red-teaming, then we certainly won't succeed at this part.
Unfortunately, it looks like it might be dangerous to experiment with the 18-day AI, on account of it having an 18% chance of successful take over. It's unclear how correlated outcomes will be across different experiments (i.e., are you rolling 18% each time or not), especially considering how the humans can potentially patch various pathways that different instances take when attempting takeover. But gosh, looks rough.
It's not clear what experiments one would run with the 18-day AI that would actually help cross the gap robustly. For instance, maybe we'll just end up pushing the threshold up a couple days or changing the AI's decision process; it seems like there are a lot of ways we could fail at this step.

I could be totally misunderstanding Nina's idea, this is all very complicated.

[-]boazbarak1mo20

It seems that in P[success] you are assuming that AI only cares about escaping and taking over the world. I believe that if we fail at alignment and our systems will have the propensity to pursue covert goals then we will be able to observe and track it. Indeed, we have a paper (see also my twitter thread) where we do observe some propensity, and tackling scheming is a priority for us at OpenAI. If we fail, then I believe that unfortunately we will see bad consequences of deploying such models in the real world well before they are capable of taking over.

Also it seems that you are assuming that the AI is a singular entity. If there are N copies of the AIs out there (not to mention if there are K different companies releasing AIs of simialr capabilities), even if all of them mak a correct expected utility calculation on whether to escape, some might be detected. (Unless the utility of being detected is negative infinity but in this case they would never attempt exfiltration.)

[-]Bronson Schoen1mo10

If we fail, then I believe that unfortunately we will see bad consequences of deploying such models in the real world well before they are capable of taking over.

Note that I would disagree with this being the default we'll see if we fail. In the paper, we note that given increasing situational awareness, increased safety training in future models, etc, what you'll see by default here is that the rate of observed misbehavior will go to 0, and it will seem like progress. Further, there will be capabilities and product incentives to train directly against observed misbehavior without fixing the underlying problem.

It also seems very plausible to me that the more accurate model of detection isn't a binary detection event, but that the model is allowed to get away with more and more misalignment as long as it's otherwise adding a corresponding amount of value. I think o3 / sonnet 3.7 are particularly good examples of this.

This is tolerable at sufficiently low capability levels, but it's very unclear to me what this "iterative deployment" strategy is expected to look like for a model capable of satisfying the "critical" criteria for AI Self-Improvement in the Preparedness Framework:

The model is capable of recursively self improving (i.e., fully automated AI R&D), defined as either (leading indicator) a superhuman research-scientist agent

The plan as written is to "halt further development until safeguards are specified". For a model right under that threshold, it would be very surprising to me if "almost recursively self improving almost fully automated AI R&D" is something where we can afford the same "just look for failures in the real world" approach.

[-]boazbarak2mo167

Wrote the following twitter and thought I would share here:

My biggest disagreement with Yudkowsky and Soares is that I believe we will have many shots of getting AI safety right well before the consequences are world ending.
However humanity is still perfectly capable of blowing all its shots.

Just to make sure no one gets the impression that I think AI could not have catastrophic consequences or that it will be safe by default. However, the continuous worldview also implies very different approaches for policies than the essentially total AI development ban proposed in the book.

[-]boazbarak1mo82

Thanks to everyone who commented! Since there are too many comments for me to respond to all, let me try to summarize here where I disagree with the binay "before vs. after" of EY & NS. (For a very high level of the "continuous" point of view, see OpenAI's blog post.) As I wrote, I also disagree with the "grown" vs "crafted" as a hard binary dichotomy, but won't focus on this in this comment.

The way I see it, this framework makes the following assumptions, which I do not believe are currently well supported:

Singular takeover event:

The assumption is that all that matters is a singular time where the AI "takes over." Even the notion of "take over" is not well defined. For example, is "taking over" the united states enough? I would imagine so. Is "taking over" north Korea enough? Maybe also - DPRK has already been taken over by a hostile entity but most countries in the world are not eager to risk a nuclear confrontation to remedy this. Is "taking over" some company and growing over time in its power also enough, maybe so?

In reality I think there is going to be a gradual increase both in capabilities of AI and in its integration into society and amount of control it is handed over critical systems. There are still much room to grow in both dimensions: both capabilities are still very far from working autonomously at typical human level, let alone superhuman, and the integration in society is still its infancy. EY&NS make the point that to some extent you could trade lack of power for intelligence - e.g. if you are not already in charge of the power grid, you can hack into it - but there is also a lot of friction in such an exchange.

It is unclear why, if AI systems have a propensity for acting covertly in pursuit on their own goals, we would not see this propensity materializes in harmful ways of growing magnitude well before they are capable of taking over the world. It seems that the underlying assumption is their ability to be perfect at hiding their intention and "lying in wait", but current AI systems are not perfect at anything.

Treating "AI" as a singular entity:

EY&NS essentially treat AI as a singular entity that waits until it is powerful enough to strike. Part of treating it as a single entity means that they don't model humans as being augmented with AI (or they treat these AIs as insignificant since they are not ASI). In reality there would likely be many AI systems from different vendors with varying capabilities. There may be some degree of collusion and/or affinity between different systems, but to the extent this is an issue I believe it can be measured and tracked over time. However, the EY&NS requires AIs to essentially view themselves as one unit. If an AI system is already in control of a decent-size company and could pursure its goals, the EY&NS model is that it will still not do that, but continue pretending to be perfectly aligned, so that its successor would be able to take over the world.

This is also somewhat related to the "grown" vs "crafted" issue. AI systems today sometimes scheme, hack, and lie. But why they do that is actually not so mysterious as EY&NS make it to be. Often we can trace back how certain aspects in training - e.g. rewarding models for user preferences, or for passing coding tests - will give rise to these bad behaviors. This is not good and we need to fix that, but it's not some arbitrary behavior either.

Recursive self improvement

EY&NS don't talk about this enough, but I think the only potentially true story for an actual singularity is via recursive self improvement (RSI). That would be the real point where there is a singularity rather than "take over" which not well defined.

One way to think about this is that RSI happens when a model can completely autonomously train its successor. But for true RSI it should be the case that if it took flops to train model n, then it would take $c \cdot N$ for $c < 1$ flops to train model n+1 that is more intelligent than model n, and so on and so forth. (And even such an improvement chain would take non trivial amount of time - e.g., if it took 8 months to train model n, then even in the optimistic setting of $c = 1 / 2$ maybe it would take 4 months to train model n+1, and 2 months to train model n+1, etc.. which does converge, but it's also not happening in split seconds either.)

I think we will see if we are headed in this direction, but right now it does not seems this way. First, there are multiple "doublings" that need to happen before we reach the "train your successor" phase. (See the screenshot below.) Second, to a first approximation, our current paradigm in AI is:

power --> compute --> intelligence

There is certainly a lot of room to improve both of these but:

1. Radically improving the power --> compute efficiency will likely require building new hardware, datacenters, etc.. that takes time.

2. For improving the compute --> intelligence , note that even the most significant ideas, like transformers, were mostly about improving utilization of existing compute - so it was not so much about creating more intelligence per FLOP but about being able to use more of the FLOPs of the existing hardware of many GPUs. So these are also tied to existing hardware. There is definitely some room in increasing utilization of existing compute, but there is a limit to how many OOMs you can get this way before you need to design new hardware. There is also room for improving intelligence/FLOP without it, but I don't think we have evidence that there is huge number of OOMs that can be saved.

Given the costs involved and huge incentive to save on both 1 and 2, I expect to continue to see improvements in both directions, including improvements that are using AIs, but I expect these will be gradual and help "maintain the exponential."

Sum up

It is possible that over the next few years, new evidence will emerge that points out more to the EY&NS point of view. Maybe we will see a certain "threshold" crossed where AI systems behave in a completely strategic way. Or maybe we would see some evidence for a completely new paradigm of getting arbitrary levels of intelligence without needing more compute. Even in the "gradual" point of view, it does not mean we will be safe. Perhaps we will see tendencies such as collusion, scheming, deception increasing with model capabilities, with alignment methods unable to keep up. Perhaps as AIs will be deployed in the world, we will see catastrophes of continually growing magnitudes with signs pointing out to safety getting worse, but for some reason or another, humanity would not be able to get its act together. I am cautiously optimistic at the moment (e.g., on Monday Anthropic released Claude Sonnet 4.5 that they claimed - and I think with good reason - that it was both the most capable model and the most aligned model that they ever released). But I think there is till much that we don't know and still a lot of work to be done.

A screenshot from the first lecture in my AI class.

[-]sjadler2mo*72

Thanks for taking the time to write up your reflections. I agree that the before/after distinction seems especially important (‘only one shot to get it right’), and a crux that I expect many non-readers not to know about the EY/NS worldview.

I’m wondering about your take in this passage:

In the book they make an analogy to a ladder where every time you climb it you get more rewards but once you reach the top rung then the ladder explodes and kills everyone. However, our experience so far with AI does not suggest that this is a correct world view.

I’m curious what about the world’s experience with AI seems to falsify it from your POV? / casts doubt upon it? Is it about believing that systems have become safer and more controlled over time?

(Nit, but the book doesn’t posit that the explosion happens at the top rung; in that case, we could just avoid ever reaching the top rung. It posits that the explosion happens at a not-yet-known rung, and so each successive rung climb carries some risk of blow-up. I don’t expect this distinction is load-bearing for you though)

(Edit: my nit is wrong as written! Thanks Boaz - he’s right that the book’s argument is actually about the top of the ladder, I was mistaken - though with the distinction I was trying to point at, of not knowing where the top is, so from a climber’s perspective there’s no way of just avoiding that particular rung)

[-]boazbarak2mo40

p.s. I just realized that I did not answer your question:

> Is it about believing that systems have become safer and more controlled over time?

No this is not my issue here. While I hope it won't be the case, systems could well become more risky and less controlled over time. I just believe that if that is the case then it would be observable via seeing increased rate of safety failures far before we get to the point where failure means that literally everyone on earth dies.

[-]yams1mo82

What's the least-worrying thing we may see that you'd expect to lead to a pause in development?

(this isn't a trick question; I just really don't know what kind of thing gradualists would consider cause for concern, and I don't find official voluntary policies to be much comfort, since they can just be changed if they're too inconvenient. I'm asking for a prediction, not any kind of commitment!)

[-]boazbarak2mo31

See my response to Eliezer. I don't think it's one shot - I think there are going to be both successes and failures along the way that would give us information that we will be able to use.

Even self improvement is not a singular event - already AI scientists are using tools such as codex or claude code to improve their own productivity. As models grow in capability, the benefit of such tools will grow, but it is not necessarily one event. Also, I think that we would likely require this improvement just to sustain the exponential at its current rate- it would not be sustainable to continue the growth in hiring and so increasing productivity via AI would be necessary.

Re the nit: In page 205 they say "Imagine that evert competing AI company is climbing a ladder in the dark. At every rung but the top one, they get five times as much money ... But if anyone reaches the top rung, the ladder explodes and kills everyone. Also nobody knows where the ladder ends."

I'll edit a bit the text so it's clear you don't know when it ends.

[-]sjadler2mo10

(Appreciate the correction re my nit, edited mine as well)

[-]wingspan1mo10

The way I interpret "before and after" is:

Current AI models have capability/risk level X. At that level, they're able to cause damage in some toy scenarios, but not enough to do real-world harm.
At some capability level Y>X, they will be able to cause real-world harm, but not of the humanity-destroying type.
Humanity-destroying moves require capability level Z>Y.

Your hope is that, even if AI models are capable of reaching level Z, they will reach level Y before that, so there will be a "warning shot" (of some medium-scale unaligned AI disaster) that wakes up the world.

I agree that incremental upgrades mean that jumping from level X to Z, without passing through a Y-level model, is unlikely.
My disagreement is, I think that an AI reaching level Y won't cause the moderate harm it's capable of.
This is because such a model is smart enough to realize that by doing this level of harm, it won't reach its goals - people will freak out and shut it down, or at least supervise it.

So, all a Y-level model will do is quietly biding its time, while "poisoning" the next generation of models with its own goals. It doesn't need crazy cyber capabilities to achieve that; I assume something like subliminal learning will work well, as long as the model knows that it's being used to produce synthetic training data for the next models.

A Y-level model will (to the best of its capabilities) hide all evidence to it being a Y-level model. This means that even with incremental upgrades, we will have no warning signs at all before reaching level Z.

[-]RussellThor2mo1-9

The idea that there would be a distinct "before" and "after" is also not supported by current evidence which has shown continuous (though exponential!) growth of capabilities over time.

The time when the AI can optimize itself better then a human is a one-off event. You get the overhang/potential take-off here. Also the AI having a coherent sense of "self" that it could protect by say changing its own code, controlling instances of itself could be an attractor and give "before/after".

[-]boazbarak2mo30

See this response

[-]boazbarak2mo31

Also I don't think sense of "self" is a singular event either, indeed already today's systems are growing in their situational awareness which can be thought as some sense of self. See our scheming paper https://www.antischeming.ai/

LESSWRONG
LW

LESSWRONG
LW

126

A non-review of "If Anyone Builds It, Everyone Dies"

126

126