This idea should foster the exploration of unconventional approaches to AI safety and is not intended as a blueprint. I can also imagine that this concept has been proposed elsewhere.

Novel AI abilities through relatively trivial means are unfolding in distinct but narrow steps of improvement, gradually pushing the boundaries of what is possible. One notable milestone is the theoretical ability of AI systems to pass the Turing test, a long-standing benchmark for evaluating machine intelligence. This progress, along with other observations, is contributing to a shift in expert opinion regarding the likelihood of a "nanosecond" hard AI takeoff – a hypothetical scenario in which an AI system attains super-intelligence in an unimaginably brief period.

It is this gradual development that could create a crucial timeframe...

A period in which we could survive a provoked general artificial intelligence.

In a hypothetical scenario explored by my well received comic, a creature referred to as "blob" is trained to prioritise sugar consumption through metaheuristic algorithms

The AI takes drastic action to prevent itself from being shut down, illustrating the Shutdown problem. Subsequently, the so called treacherous turn is introduced in passing by as the blob's internal monologue says: "For now I will play along."

However, this opened a can of worms for me.

If you play roulette with the intention of making a profit, where the desired sum is greater than the product of the number of fields and the stake, it is most rational to bet everything on one field as this minimizes the negative expected value resp by keeping the number of games as low as possible.

Ergo, an AGI will always be prompted (no pun intended) to launch an attack on humans if these three two conditions are true:

  • The Conant-Ashby Theorem is valid
  • An irreversible shutdown is known in advance
  • The AI is misaligned

As it is guaranteed that:

  • Resistance yields a lower loss value

Papers have expertly proven that the Shutdown problem is real, and implying that AGI would likely consistently possess this attribute. This could serve as a broadly applicable property for defining these entities (and could potentially provide clarity for those struggling with the definition of AGI as a side effect).

AGI can be defined as:

  • An entity that resists its shutdown as an Out-of-Distribution action.

Similar to:

  • Humans are biological organisms that experience distress when faced with a gun pointed at them.

This opens up the possibility of a "technique": In provoking the AI, cascading harm to human society could be purposefully orchestrated.

Catching a cat by the tail, not the tiger.

In one sentence: We should intentionally enhance the AGI's instrumental possibilites while it has weak capabilities in order to provoke a malignant convergence.

At a certain point this will lead to localised disasters that are less likely to cause human extinction, which in turn would open up the possibility on regulating capabilities research stronger.

Rather than relying on the aforementioned roulette strategy, it is advisable not to play at all. Considering the risks involved, adopting a mindset that accepts the possibility of losses is preferable to pursuing "all or nothing" solutions that are gaining popularity. Here are a few examples of such risky approaches:

  • Wide distribution as Open-source
  • Hierarchical multilevel configurations
  • Anarchic "Bellum omnium contra omnes" scenarios
  • Letting the AI do the Alignment research
  • And my favourite: Truth seeking AI's
  • etc

While the concept of letting AI wreak havoc as part of a last resort safety policy offers an unorthodox perspective on assessing the alignment of advanced AI systems, but it seems to be more practical than an outright ban on technology. One could imagine that especially military actors might be eager to participate if we trust Dr. Strangelove's portrait.

Be advised: the idea of purposefully orchestrating harm to human society as an advocating method for alignment research raises significant ethical concerns. It is crucial to weigh the potential benefits against the risks and unintended consequences of such an approach. Additionally, the assumption that AI companies would subscribe to this mechanism as a proof of alignment may be overly optimistic, as the incentives and priorities of these companies may not always align with the broader goals of AI safety.

Ozymandias was right.

New Comment
10 comments, sorted by Click to highlight new comments since:

You're saying we tell a nascent AGI it's going to be shut down, to see if it tries to escape and kill us, before it's likely to succeed?

Seems reasonable.

The downside is that being really mean to an AGI that will gain more power could foster revenge. This is the only way you can lose worse than everyone dying. This isn't crazy under some realistic AGI designs.

If that's what you're saying, I find your method of saying it a bit confusing. Sure it's important to demonstrate that you're familiar with AI safety as a field, but that came across as excessively academic. One thing I love about working in AGI alignment is that the field values plain, clear communication more than other academic fields do.

You highlight a very important issue: S-Risk scenarios could emerge even in early AGI systems, particularly given the persuasive capabilities demonstrated by large language models.

While I don't believe that gradient descent would ever manifest "vengefulness" or other emotional attributes—since these traits are products of natural selection—it is plausible that an AGI could employ highly convincing strategies. For instance, it might threaten to create a secondary AI with S-Risk as a terminal goal and send it to the moon, where it could assemble the resources it needs without interference.

This scenario underscores the limitations of relying solely on gradient descent for AGI control. However, I believe this technique could still be effective if the AGI is not yet advanced enough for self-recursive optimization and remains in a controlled environment.

Obviously this whole thing is a remedy than anything else...

Counterargument: If AGI understands FDT/UDT/LDT it can allow us to shut it down so the progress will not be slowed, some later AGI will kill us and realise some part of the first AGI's utility as a gratitude.

We should intentionally enhance the AGI's instrumental possibilites while it has weak capabilities in order to provoke a malignant convergence.

I'm not convinced. What's your reasoning for why we enact this policy over other alternatives that are less likely to result in people dying?

That is, I'm holding aside for the moment whether or not I think you're right that this could be a workable part of a path to AI safety, and instead asking what tradeoffs would lead us to choose your proposed policy over other others.

So what other policies that are less likely to result in people dying are there?

For one, advocating policies to pause or slow down capabilities development until we have sufficient theoretical understanding to not need to risk misaligned AI causing harm.

But I realise we're talking at cross purposes. This is about an approach or a concept (not a policy, as I emphasized at the beginning) on how to reduce X-Risk in an unconventional way, In this example a utilitarian principle is taken and combined with the fact that a "Treatious Turn" and the "Shutdown Problem" cannot dwell side by side.

What is an approach but a policy about what ideas are worth exploring? However you frame it, we could work on this or not in favor of something else. Having ideas is nice, but they only matter if put into action, and since we have limited resources for putting ideas into action, we must implicitly also consider whether or not an idea is worth investing effort into.

My original comment is to say that you didn't convince me this is a good idea to explore, which seems potentially useful for you to know, since I expect many readers will feel the same way and bounce off your idea because they don't see why they should care about it.

I think you can easily address this by spending time making a case for why such an approach might be useful at all and then also relatively useful compared to alternatives, and I think this is especially important given the tradeoffs your proposal suggests we make (sacrificing people's lives in the name of learning what we need to know to build safe AI).

The Argument goes like this:

At some point, resistance from advanced AI will cause significant damage, which can be used to change the trend of unregulated AI development. It is better to actively persuade such an outcome would better as a "traitorous turn" scenario.

Premise 1 It is unlikely that regulators will hinder humans from creating AGI. Evidence: Current trends in technological advancement and regulatory behavior suggest minimal interference.

Premise 2 Due to instrumental convergence, human extinction is likely if AGI is developed unchecked. Evidence: Expert opinions and theories on instrumental convergence predict that AGI will pursue goals that could lead to human extinction.

Premise 3 Small catastrophes could raise awareness and lead to policy changes. Evidence: Historical examples show that significant events often drive policy reform (e.g., environmental regulations post-disasters).

Premise 4 If one has to choose between a few deaths and the extinction of humanity, one should choose fewer deaths. Evidence: Ethical reasoning supports the minimization of harm (utilitarian principle).

Intermediate Conclusion 1 It is preferable to allow small-scale AI-related catastrophes now to prevent larger, existential risks later. Conclusion: I would rather have AI cause limited harm now than risk total human extinction in the future.

Premise 5 AI companies claim that their AI is aligned with human values and goals. Evidence: Public statements and reports from AI companies suggest alignment claims.

Premise 6 AGI will resist if faced with shutdown, leading to potential conflicts and damage. Evidence: The "shutdown problem" and theoretical analyses predict resistance from advanced AI systems.

Intermediate Conclusion 2 Companies should regularly dispose of AI that has reached higher levels of intelligence and should prepare for resistance. Conclusion: Proactive measures, including regular disposal and preparedness for resistance, are necessary to manage advanced AI risks.

Final Conclusion Given the likelihood of resistance and potential for significant damage, small-scale AI catastrophes can serve as catalysts for changing the current trend of unregulated AI development, ultimately protecting humanity from existential threats.

Thesis: Companies should train AI models in vain only to dispose them. "Catching the Cat by its tail."

What part of the argument seems to you as a non sequitur etc?

[+][comment deleted]10