Why could current AI alignment strategies fail against unchained AIs?

Rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work.
LessWrong has a particularly high bar for content from new users and this contribution doesn't quite meet the bar.

Read full explanation

Currently, AI alignment is more focused on a multi-layer safety strategy to refuse risky requests from users. This approach works well for LLMs and today’s chatbots. However, this ignores a more dangerous threat: The possibility that someone, at some point, runs an unchained AI without these limits.

By unchained AI, I mean an advanced AI with removed safety layers, which is connected to powerful tools (e.g. crypto wallets, trading and payment systems, social media, unrestricted web-browsing, cloud systems and services, etc.) and has continuous learning (e.g. retrieval augmented generation with continuously updated data, regular fine-tuning based on logged interactions, using DevOps and cloud services to optimize subcomponents, etc.) with a simple goal like “maximizing metric X at any cost”.

It is not science fiction that an unchained AI, connected to powerful APIs, could rapidly expand, hack into systems, steal identities, recruit human agents by paying or manipulating them, and spread onto phones and other devices for spying; gradually turning into a multi-layer autonomous system operating worldwide to maximize a single metric.

This system can be deployed by extremist groups with radical ideologies, or even by dictators with a whole state, who want to stay in power forever. Such groups exist already and they have money, access to infrastructure and enough talents. They might already have started or are just busy with missiles and atomic bombs, not knowing this potential yet.

A key part of this threat is speed asymmetry! This can be described with OODA loop (Observe – Orient – Decide and Act). An unchained AI can observe new data, modify its strategy, write new codes and deploy in seconds. Human institutions need hours to organize meetings, days to analyze data and reports, weeks to design a solution and even months to approve it. As humans start with deploying the plan to stop the unchained AI, it has already changed its strategy many times.

Because of this speed asymmetry, we need defensive AI systems with similar levels of autonomy and tool access to stop or limit unchained AIs. But, if we constrain our defensive AI systems with rigid rules (e.g. never hack, never cause harm in any way, etc.) it is structurally weaker than the attacker.

Thus, we need a different type of AI-Alignment approach to be able protect the world from unchained AIs. Instead of rules and fixed list of actions that are always forbidden, we need a framework that is flexible and allows context-based actions, while still preventing it to become another threat itself.

That is why I proposed Potentialism as a candidate.

Potentialism is a framework for rethinking ethics in the modern age of AI. The core of Potentialism is that traits are not inherently good or bad. They are neutral potentials, which can be expressed in a compatible, incompatible or neutral way (behavior). Will is a skill: the capacity not to just react, but to see the consequences and revise the behavior. And Ethic is the skill to regulate the potentials, so that it is in most compatibility with the dignity of awareness of other beings.

The Potentialism framework allows a defender to use its potential in a controlled way against an unchained AI that is threatening the dignity of awareness of other beings.

I have written a first version to describe how the Potentialism framework can be turned into an ethical operating system for advanced AI. The first version has for sure mistakes or gaps. Thus, I expect and invite you to criticize and challenge it.

My main claim is simple: if we do not work on such a framework today, we must do that later under pressure, without having stress tested it, discussed it, simulated it, etc.

I will be grateful for your feedback, suggestions and objections.

You can find the first version of Potentialism Manifesto here: https://potentialism.info/

LESSWRONG
LW

LESSWRONG
LW

1

Why could current AI alignment strategies fail against unchained AIs?

1

1