How do we align humans and what does it mean for the new Conjecture's strategy

Igor Ivanov

Divide and conquer

Roman maxim (maybe^[1])

Democracy is the worst form of Government except for all those other forms that have been tried from time to time.

Winston Churchill

Introduction

Recently, Conjecture proposed an idea of modular AGI, with each module being interpretable, having limited intellect, and functioning similarly to the human mind. They called this concept Cognitive Emulation.

In their proposal, the authors explicitly say

We have a lot of experience and knowledge of building systems that are broadly beneficial and safe, while operating in the human capabilities regime.

The most upvoted comment to the post, by the time I write this post, is critical to this statement

What? A major reason we're in the current mess is that we don't know how to do this. For example we don't seem to know how to build a corporation (or more broadly an economy) such that its most powerful leaders don't act like Hollywood villains (race for AI to make a competitor 'dance')? Even our "AGI safety" organizations don't behave safely (e.g., racing for capabilities, handing them over to others, e.g. Microsoft, with little or no controls on how they're used).

Given this disagreement, I decided to contemplate humanity's ability to align humans and organizations and draw parallels between this field of knowledge and AI safety.

Problems with power

Many doom scenarios related to AGI involve 2 stages:
1. AGI gains too much power
2. AGI does something very bad due to its Shoggothness^[2]

If we are talking about aligning people and organizations, the "Too much power" part is more relevant, because even the worst psychopaths and dictators are still people, not Shoggoths, and their ugliness is often a result of their unlimited power. I will touch on this topic later in this post.

Throughout history, many people had an enormous amount of power, and usually, it didn't go well, so humanity developed a number of instruments for restricting the power of a single person or a group of people. Maybe, we might extract some value for AI alignment from discussing this expertise.

I also want to mention that this post will be focused on power in political systems. There are some differences between these systems and other types of social systems, like corporations or the military, but without a single focus, this post will be less coherent.

The distinction between the nature of human power and an AGI power

It's important to point out that human power and AGI power are somewhat different.

People on the top of social hierarchies are usually smarter than most other people, but their intellect is still within the human range, and they also have all other human vulnerabilities and limitations.

A leader only has power if he or she has an influence on other people, so they do what the leader wants. Instruments for that might be laws, personal loyalty, or money. In other words, no general can win a battle without an army, and no billionaire is a billionaire if no one offers something to sell.

The source of the power of an AGI will be different. AGI might not need people to project its power and might do everything by itself. It might be able to outsmart everyone, come up with a superhumanly elaborated plan, and execute it with unimaginable coordination. Similarly to human leaders, it might also use humans to reach its goals, but this path might or might not happen. So, the main source of the power of an AGI will probably be its cognitive abilities.

This means that we can restrict human leaders' power by restricting their ability to influence other people, but for restricting the power of AGI it makes sense to restrict its intellectual abilities.

Why concentration of power is harmful

It's common knowledge in political sciences that the excessive concentration of power in single hands is both kind of natural for political systems without special instruments to restrict it, and also dangerous.

There are two main reasons for this:

Powerful people live in a bubble of lies
There might be no one to say "No"

Bubble of lies

If a leader has too much power and can singlehandedly promote or fire people, then people around the ruler have the incentive to scheme and lie because the opinion of the leader is too important for their career. This makes that leader live in a bubble of good news and sycophancy which leads them to perceive reality in a more and more distorted way. That's why dictators tend to do ridiculous things. For example, when Ugandan dictator Idi Amin proclaimed himself the Conqueror of the British Empire, there was no one around to say "Dude, it's ridiculous. You didn't conquer the British Empire."

No one to say "No"

If one person has too much power, they can do whatever they want disregarding the interests of others.

An autocrat can do dangerous decisions without risks to their political careers. They might start a war, and punish everyone who dares to say "No". They also control the media, so the population mostly listens and watches what the autocrat wants them to listen and watch.

For a democratic leader starting a war is much more dangerous. They have to convince their citizens, that they should go to this war and die there, while independent press and strong opposition are happy to use this against that leader.

It's hard to predict whether AGI with too much power will have issues related to living in a bubble of lies, but it definitely might do things harmful to humans if they have no vote.

How do we restrict the power of rulers

Time constraint

The longer a leader holds power, the more time they have to grow that power. That's why presidents are usually not allowed to rule for more than 2 terms.

If you are a president, it is logical to give important positions to the people who share your values and goals. The loyal people you can trust. It is also convenient to have around people who owe you something, or, maybe, you can blackmail them if they want to become disloyal. It is also natural to get rid of hostile people and those who are not cooperative.

The more time a president rules, the more loyal people they put on important positions, and the more power that president has. But it all takes time, so if a president can only rule for 8 years, these processes usually don't become too damaging.

Time constraints also remind a president that after the end of the term they might be accountable for bad actions they did, and this by itself is a deterrence.

Applicability of time constraints for AI safety

Is it possible to create an AGI that will only exist for a limited time, so it will not be able to gain too much power and do too much harm?

I couldn't find any mentions of such an idea on LessWrong, but, I think, it might be hard in a similar way to why it's hard to make a "stop button" for AGI. The more time AGI can operate, the better it might satisfy its utility function, so it might have an instrumental goal to remove the time restriction, and also neutralize the people who might resist, so I don't see a way to use it as the main power-restricting mechanism.

Transparency laws

One of the problems with politicians is that their intentions sometimes don't match their words. For example, a minister of energetics might promote a green energy reform, and the reason for that, according to his words, is that pollution from currently operating coal plants is dangerous for people and for the environment. But he doesn't mention that his son is an investor in the biggest importer and distributor of solar panels in the country, and will earn a lot of money because of this reform. Of course, the son will thank his dad.

Basically, it means that this minister decepts the general public to achieve his own goals, and his real goal is to earn money.

People found a way of observing this hidden motivation. In many countries, people in power, as well as their relatives, must disclose all their property and streams of income. This helps everyone to have a better answer to the question "Will this person or his close ones earn something because of that action?" So these laws are a way to mechanistically interpret a person's motivations by elucidating their motivational structure.

These laws are not perfect in elucidating politicians' potentially harmful motivations. Politicians might be driven by self-pride, desire for power, hateful ideology, or by a million other reasons, but at least they allow people to see some of them.

Application of transparency for AI safety

There are many attempts to make AI more transparent. The whole mechanistic interpretability field and ELK, in particular, are first come to my mind. The Conjecture's Cognitive Emulation also seems to address this problem by trying to emulate human cognition.

Separation of powers

Typically, In functional democracies parliament makes laws, but doesn't enforce them. Judges of the highest court interpret laws and might veto the ones that contradict the constitution or are unacceptable in some other way, and also can't enforce them. President, ministries, and numerous other actors enforce these laws in practice but can't create new ones. This system is called the separation of powers: legislative power, judiciary power, and executive power.

This system is designed in a way that restricts the power of any single branch. For example, if members of a parliament want to produce a policy that unfairly benefits them, it's not that easy. They are not able to enforce this policy, and also judges might block it. Similarly, if the police want to make laws helping them to read everyone's messages so they can track drug dealers at the expense of people's privacy, then they have to convince members of parliament that these laws are necessary. Members of parliament are usually elected by the popular vote, and if they pass an oppressive law, their voters might not like it.

Mass media are sometimes called the fourth power, because they shape the population's opinion of the three branches of government, and they love to dig through their dirty laundry for sensations.

These institutions are not perfect, but if you look at authoritarian or totalitarian countries without real division of powers, they usually look much uglier, and their leaders don't really care about the prosperity of the country and its citizens.

Applicability of the separation of powers for AI safety

Is it possible to create an AGI in such a way that it will be split into several independent, or even competing modules, so that one module makes plans but doesn't implement them, and the other module implements plans but doesn't benefit from them?

I don't know. This idea is somewhat similar to the concept of Oracle AI which can only make plans, and not implement them, but I don't know if there is a way to be sure that it's impossible to gain power for the planning module.

Since the source of power for AGI will be its cognitive abilities, it makes sense to split it into several blocks so each of them has only limited cognitive abilities. As far as I understand, this is one of the core ideas behind Conjecture's Cognitive Emulation proposal, as well as several other proposals ^[3]. Authors argue that these solutions might be safer than monolithic black-boxy AGI, and will severely reduce its Shoggothness. People have pointed out that at the moment, this is a concept, and it is unclear whether it will be competitive with non-modular models, and as safe as the authors of these ideas hope, but at least this logic seems somewhat interesting to explore.

Conclusion

This post was meant as a mental exercise for me to contemplate our instruments of aligning humans and human organizations, and draw parallels with AI safety concepts. Through this lens, Cognitive Emulation seems like a reasonable approach to try, and I wish luck to the Conjecture's team to try implementing this strategy.

^{^}
There are no written proofs of someone from antiquity using this phrase
^{^}
There is a meme in AI alignment community that compares AGI with Shoggoth - a monster from mythology of H.P. Lawcraft.
^{^}
See Ought's proposal to supervise process, not outcome, or Drexler's The Open Agency model

[-]ProgramCrafter3y20

On topic of "Applicability of time constraints for AI safety":

https://www.lesswrong.com/posts/itTLCFj5NCHhFbK2Q/are-limited-horizon-agents-a-good-heuristic-for-the-off?commentId=xZoL4awrBjD4Wtxkv

These time-limited agents have an incentive to coordinate with future versions of themselves: You’ll make more money today, if past-you set up the conditions for a profitable trade yesterday.
So a sequence of time-limited agents could still develop instrumental power-seeking.

[-]Igor Ivanov3y21

Thank you! The idea of inter-temporal coordination looks interesting

LESSWRONG
LW