if ASI is developed gradually , alignment can be tweaked as you go along.
The whole problem is that alignment, as in "AI doesn't want to take over in a bad way" is not assumed to be solved. So you think your alignment training works for your current version of pre-takeover ASI, but actually previous versions already schemed for a long time, so running a version capable of takeover suddenly for you creates a discontinuity, where ASI takes over because it now can. It means all your previous alignment work and scheming detection is finally tested when you run a version capable of takeover and you can only fail once on this test. And training against scheming is predicted to not work and just create stealthier schemers. And "AI can take over" is predicted to be hard to fake for AI so you can't confidently check for scheming just by observing what it would do in fake environment.
The whole problem is that alignment, as in “AI doesn’t want to take over in a bad way” is not assumed to be solved
That's a broken way of thinking about it.
Doomers see AI alignment as a binary, either perfect and final, or non existent. But no other form of safety works like that. No one talks of "solving" car safety for once and all like maths problem, instead it's assumed to be an engineering problem, an issue of making steady , incremental progress. Good enough alignment is good enough!.
So you think your alignment training works for your current version of pre-takeover ASI, but actually previous versions already schemed for a long time, so running a version capable of takeover suddenly for you creates a discontinuity
Scheming is an assumption, not a fact.
I'll make the point that safety engineering can have discontinuous failure modes. The reason the Challenger collapsed was because some o-ring seals in a booster had gotten too cold before launch, preventing them from sealing off the flow of hot gas to the main engine and blowing up the rocket. The function of these o-rings is pretty binary: either gas is kept in and the rocket works, or it's let out and the whole thing explodes.
AI research might end up with similar problems. It's probably true that there is such a thing as good enough alignment, but that doesn't necessarily imply that progress on solving it can be made incrementally and doesn't have all or nothing stakes in deployment.
AI research might end up with similar problems
Might. IABIED requires a discontinuity to be almost certain.
I don't think anyone is against incremental progress. It's just that if after incremental progress AI takes over, then it's not good enough alignment. And what's the source of confidence in it being enough?
"Final or nonexistent" seems to be appropriate for scheming detection - if you missed only one way for AI to hide it's intentions, it will take over. So yes, degree of scheming in broad sense and how much you can prevent it is a crux and other things depend on it. Again, I don't see how you can be confident that future AI wouldn't scheme.
It’s just that if after incremental progress AI takes over,
Why would that be discontinuous?
if you missed only one way for AI to hide it’s intentions, it will take over.
Assuming it has an intention, and a malign one. Deception depends on a chain of assumptions. They all have to be well over 90% to lead to a conclusion of near certain doom.
Again, I don’t see how you can be confident that future AI wouldn’t scheme.
I'm not arguing for 0% p(doom) , I'm arguing against 99%.
Why would that be discontinuous?
Because incremental progress missed deception.
I’m arguing against 99%
I agree such confidence lacks justification.
Why would that be discontinuous?
Because incremental progress missed deception
I'm talking about the how of takeover. Could any AI, even one of many, take over successfully in its first attempt?
If all AIs are scheming, they can take over together. If a world with a powerful AI that is actually on humanity's side is assumed instead, then at some level of power of friendly AI you probably can run unaligned AI and it will not be able to do much harm. But just assuming there being many AIs doesn't solve scheming by itself - if training actually works as bad as predicted, then no AI of many would be aligned enough.
I can easily imagine (but I am not an expert, so my imagination is less constrained by reality) that the jump from current LLMs to a superintelligence could be very small. Like, maybe we are already 99% there and there are just some small details missing... such as keeping the LLMs running constantly in a loop (so they keep thinking even when no one asks them), adding an API that lets them form long-term memories (longer than the context window), designing a prompt that lets them use this effectively, and maybe adding some monitoring system that detects when they go crazy and resets them (restarts the context window, keeps the long-term memory).
This way, the LLM wouldn't get smarter overnight, but it could get agenty overnight. It could start working on its goals, tirelessly, maybe very quickly.
I can easily imagine (but I am not an expert, so my imagination is less constrained by reality) that the jump from current LLMs to a superintelligence could be very small
The wider argument requires it to be highly probable, not just possible.
Like, maybe we are already 99% there and there are just some small details missing… such as keeping the LLMs running constantly in a loop (so they keep thinking even when no one asks them), adding an API that lets them form long-term memories (longer than the context window), designing a prompt that lets them use this effectively,
Being able to see, being able to drill down to letters...well, snark aside, I do think there is low hanging fruit in current models ... but the full doom scenario isn't going happen in the very near term, because they still need humans to maintain their data centres.
. It could start working on its goals,
Where does it get them from?
One way to think about it is that progress in AI capabilities means ever bigger and nastier surprises. You find that your AIs can produce realistic but false prose in abundance, you find that they have an inner monologue capable of deciding whether to lie, you find that there are whole communities of people doing what their AIs tell them to do... And humanity has failed if this escalation results in a nasty surprise big enough that it's fatal for human civilization, that happens before we get to a transhuman world that is nonetheless safe even for mere humans (e.g. Ilya Sutskever's "plurality of humanity-loving AGIs").
Technology having unexpected side effects is an old story ... which means it's not killed us yet. The conclusion of certain doom still isn't justified.
A number of reviewers have noticed the same problem IABIED: an assumption that lessons learnt in AGI cannot be applied to ASI -- that there is a "discontinuity or "phase change".
It seems to the sceptics that if ASI is only slightly behind human capabilities , then good enough alignment is good enough; and if ASI is developed gradually , alignment can be tweaked as you go along. So, only a sudden leap to much more than human ASI constitutes a problem.
Yet Y&S claim that AIs cannot be aligned, even under the assumption of gradualism. The only explanation so far is Eliezer's "Dragon story" ... but I find it makes the same assumptions, and Buck, to whom it was directed, seems to find it unsatisfactory , too.
Quotes below.
Buck Shirgellis: ""I’m not trying to talk about what will happen in the future, I’m trying to talk about what would happen if everything happened gradually, like in your dragon story!
You argued that we’d have huge problems even if things progress arbitrarily gradually, because there’s a crucial phase change between the problems that occur when the AIs can’t take over and the problems that occur when they can. To assess that, we need to talk about what would happen if things did progress gradually. So it’s relevant whether wacky phenomena would’ve been observed on weaker models if we’d looked harder; IIUC your thesis is that there are crucial phenomena that wouldn’t have been observed on weaker models.
In general, my interlocutors here seem to constantly vacillate between “X is true” and “Even if AI capabilities increased gradually, X would be true”. I have mostly been trying to talk about the latter in all the comments under the dragon metaphor."
Will McAskill: "Sudden, sharp, large leaps in intelligence now look unlikely. Things might go very fast: we might well go from AI that can automate AI R&D to true superintelligence in months or years (see Davidson and Houlden, “How quick and big would a software intelligence explosion be?"). But this is still much slowerthan, for example, the “days or seconds” that EY entertained in “Intelligence Explosion Microeconomics”. And I don’t see any good arguments for expecting highly discontinuous progress, rather than models getting progressively and iteratively better.
In Part I of IABIED, it feels like one moment we’re talking about current models, the next we’re talking about strong superintelligence. We skip over what I see as the crucial period, where we move from the human-ish range to strong superintelligence[1]. This is crucial because it’s both the period where we can harness potentially vast quantities of AI labour to help us with the alignment of the next generation of models, and because it’s the point at which we’ll get a much better insight into what the first superintelligent systems will be like. The right picture to have is not “can humans align strong superintelligence”, it’s “can humans align or control AGI-”, then “can {humans and AGI-} align or control AGI” then “can {humans and AGI- and AGI} align AGI+” and so on.
Elsewhere, EY argues that the discontinuity question doesn’t matter, because preventing AI takeover is still a ‘first try or die’ dynamic, so having a gradual ramp-up to superintelligence is of little or no value. I think that’s misguided. Paul Christiano puts it well: “Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter.”
Scott Alexander: "But I think they really do imagine something where a single AI “wakes up” and goes from zero to scary too fast for anyone to notice. I don’t really understand why they think this, I’ve argued with them about it before, and the best I can do as a reviewer is to point to their Sharp Left Turn essay and the associated commentary and and see whether my readers understand it better than I do. "
Clara Collier: "Humanity only gets one shot at the real test." That is, we will have one opportunity to align our superintelligence. That's why we'll fail. It's almost impossible to succeed at a difficult technical challenge when we have no opportunity to learn from our mistakes. But this rests on another implicit claim: Currently existing AIs are so dissimilar to the thing on the other side of FOOM that any work we do now is irrelevant.
Most people working on this problem today think that AIs will get smarter, but still retain enough fundamental continuity with existing systems that we can do useful work now, while taking on an acceptably low risk of disaster. That's why they bother. Yudkowsky and Soares dismiss these (relative) optimists by stating that "these are not what engineers sound like when they respect the problem, when they know exactly what they're doing. These are what the alchemists of old sounded like when they were proclaiming their grand philosophical principles about how to turn lead into gold."1 I would argue that the disagreement here has less to do with fundamental respect for the problem than specific empirical beliefs about how AI capabilities will progress and what it will take to control them. If one believes that AI progress will be slow and continuous, or even relatively fast and continuous, it follows that we’ll have more than one shot at the goal".
Boaz Barak: "While they are not explicit about it, they implicit assumption is that there is a sharp threshold between non superintelligent AI and superintelligent AI. As they say “the greatest and most central difficulty in aligning artificial superintelligence is navigating the gap between before and after.”