[Question] What the discontinuity is, if not FOOM?

if ASI is developed gradually , alignment can be tweaked as you go along.

The whole problem is that alignment, as in "AI doesn't want to take over in a bad way" is not assumed to be solved. So you think your alignment training works for your current version of pre-takeover ASI, but actually previous versions already schemed for a long time, so running a version capable of takeover suddenly for you creates a discontinuity, where ASI takes over because it now can. It means all your previous alignment work and scheming detection is finally tested when you run a version capable of takeover and you can only fail once on this test. And training against scheming is predicted to not work and just create stealthier schemers. And "AI can take over" is predicted to be hard to fake for AI so you can't confidently check for scheming just by observing what it would do in fake environment.

[-]TAG1mo20

The whole problem is that alignment, as in “AI doesn’t want to take over in a bad way” is not assumed to be solved

That's a broken way of thinking about it.

Doomers see AI alignment as a binary, either perfect and final, or non existent. But no other form of safety works like that. No one talks of "solving" car safety for once and all like maths problem, instead it's assumed to be an engineering problem, an issue of making steady , incremental progress. Good enough alignment is good enough!.

So you think your alignment training works for your current version of pre-takeover ASI, but actually previous versions already schemed for a long time, so running a version capable of takeover suddenly for you creates a discontinuity

Scheming is an assumption, not a fact.

[-]Felix Choussat1mo10

I'll make the point that safety engineering can have discontinuous failure modes. The reason the Challenger collapsed was because some o-ring seals in a booster had gotten too cold before launch, preventing them from sealing off the flow of hot gas to the main engine and blowing up the rocket. The function of these o-rings is pretty binary: either gas is kept in and the rocket works, or it's let out and the whole thing explodes.

AI research might end up with similar problems. It's probably true that there is such a thing as good enough alignment, but that doesn't necessarily imply that progress on solving it can be made incrementally and doesn't have all or nothing stakes in deployment.

[-]TAG1mo20

AI research might end up with similar problems

Might. IABIED requires a discontinuity to be almost certain.

[-]Signer1mo10

I don't think anyone is against incremental progress. It's just that if after incremental progress AI takes over, then it's not good enough alignment. And what's the source of confidence in it being enough?

"Final or nonexistent" seems to be appropriate for scheming detection - if you missed only one way for AI to hide it's intentions, it will take over. So yes, degree of scheming in broad sense and how much you can prevent it is a crux and other things depend on it. Again, I don't see how you can be confident that future AI wouldn't scheme.

[-]TAG1mo20

It’s just that if after incremental progress AI takes over,

Why would that be discontinuous?

if you missed only one way for AI to hide it’s intentions, it will take over.

Assuming it has an intention, and a malign one. Deception depends on a chain of assumptions. They all have to be well over 90% to lead to a conclusion of near certain doom.

Again, I don’t see how you can be confident that future AI wouldn’t scheme.

I'm not arguing for 0% p(doom) , I'm arguing against 99%.

[-]Signer1mo10

Why would that be discontinuous?

Because incremental progress missed deception.

I’m arguing against 99%

I agree such confidence lacks justification.

[-]TAG1mo20

Why would that be discontinuous?

Because incremental progress missed deception

I'm talking about the how of takeover. Could any AI, even one of many, take over successfully in its first attempt?

[-]Signer1mo10

If all AIs are scheming, they can take over together. If a world with a powerful AI that is actually on humanity's side is assumed instead, then at some level of power of friendly AI you probably can run unaligned AI and it will not be able to do much harm. But just assuming there being many AIs doesn't solve scheming by itself - if training actually works as bad as predicted, then no AI of many would be aligned enough.

[-]TAG1mo20

All AI's scheming co-operatively is less likely than on scheming.

[-]Viliam1mo40

I can easily imagine (but I am not an expert, so my imagination is less constrained by reality) that the jump from current LLMs to a superintelligence could be very small. Like, maybe we are already 99% there and there are just some small details missing... such as keeping the LLMs running constantly in a loop (so they keep thinking even when no one asks them), adding an API that lets them form long-term memories (longer than the context window), designing a prompt that lets them use this effectively, and maybe adding some monitoring system that detects when they go crazy and resets them (restarts the context window, keeps the long-term memory).

This way, the LLM wouldn't get smarter overnight, but it could get agenty overnight. It could start working on its goals, tirelessly, maybe very quickly.

[-]TAG1mo20

I can easily imagine (but I am not an expert, so my imagination is less constrained by reality) that the jump from current LLMs to a superintelligence could be very small

The wider argument requires it to be highly probable, not just possible.

Like, maybe we are already 99% there and there are just some small details missing… such as keeping the LLMs running constantly in a loop (so they keep thinking even when no one asks them), adding an API that lets them form long-term memories (longer than the context window), designing a prompt that lets them use this effectively,

Being able to see, being able to drill down to letters...well, snark aside, I do think there is low hanging fruit in current models ... but the full doom scenario isn't going happen in the very near term, because they still need humans to maintain their data centres.

. It could start working on its goals,

Where does it get them from?

[-]Mitchell_Porter1mo20

One way to think about it is that progress in AI capabilities means ever bigger and nastier surprises. You find that your AIs can produce realistic but false prose in abundance, you find that they have an inner monologue capable of deciding whether to lie, you find that there are whole communities of people doing what their AIs tell them to do... And humanity has failed if this escalation results in a nasty surprise big enough that it's fatal for human civilization, that happens before we get to a transhuman world that is nonetheless safe even for mere humans (e.g. Ilya Sutskever's "plurality of humanity-loving AGIs").

[-]TAG1mo20

Technology having unexpected side effects is an old story ... which means it's not killed us yet. The conclusion of certain doom still isn't justified.

LESSWRONG
LW

LESSWRONG
LW

18

[Question] What the discontinuity is, if not FOOM?

18

18