EDIT: So this got upvotes while being a draft. Not sure what happened there, but it's now officially posted, even though it's incomplete. Won't be including it in the Guide for now, since the style clashes.

If I include it in the Guide I'll come back to this and give it a conclusion.

Summary

It is best for laypersons to understand the AI research process (henceforth the Process) as an AGI of unclear alignment that is working to produce a potentially superintelligent successor AGI.

The identity of AGI and AI research

Safety-pertinent traits of an AGI

1. Opacity

An AGI is necessarily a black box to any human interlocutors. Humans are relatively transparent to other humans solely because we have inbuilt unconscious heuristics that are adapted to making sense of other humans. These heuristics would not work at all when targeted at a non-human intellect, such as an AGI, creating an infinite chain of suspicion between the human and the non-human intellect. If the non-human intellect is also superintelligent though, the chain of suspicion is one-way: the human would be foolish to ever trust the non-human intellect (in any scenario other than one where there is somehow some guarantee of cooperation with the human), but the non-human intellect might be able to "trust" the human, to whatever extent the concept of "trust" can be said to apply to it.

One would think that, as a human construct, an AI would necessarily be completely transparent to at least some humans, but the current state-of-the-art techniques result in mathematical models that are thoroughly unintelligible when it comes to determining their decision making process, much like the human brain, which is the inspiration of those techniques. Though this does not mean an AI resembles a human brain much, the human brain does appear to place a lower bound on how complex and intelligible we should expect the decision making process of an AGI to be, that is, an AGI is unlikely to be simpler and more intelligible than a brain, purely because it would be capable of doing far, far more than a brain. At any rate, the state-of-the-art is not trending to simplicity and intelligibility, and more generally, and as a rule, human science and technology do not tend towards increasing simplicity and intelligibility, but to completely the opposite, any simplicity and intelligibility being purely an illusion for the benefit of end users.

2. Self-improvement

The most plausible scenario for a superintelligence coming into being is a sub-superintelligent AGI that is competent enough to improve on its own intellect and efficacy, iterating on itself until attaining a level of both far above what any human or coalition of humans can attain.

3. Amorality

As a non-human intellect, an AGI is most likely to not have human moral intuitions, or anything resembling moral intuitions of any sort. This is compounded by the trait of opacity: we would have no way of discerning whether it possesses anything resembling morality, and if it does, whether it is one acceptable to most humans.

These traits make superintelligence the most plausible existential threat we are facing in the near term. As we shall see, these traits also describe the Process, which is also the primary potential source of a rogue superintelligence.

Safety-pertinent traits of the Process

1. Opacity

AI research, like all scientific research, is utterly unintelligible to the layperson. What are the odds AI research results in the destruction of humanity? The most truthful, dispassionate answer is: no one knows. If a layperson, or truthfully, even an expert in this space, set out to try to make this determination, they would find it impossible due to the sheer amount of work that is rapidly occurring in this space, the unintelligibility of the current techniques, and the absence of some sort of formal verification tool or process that guarantees safety. It is not possible to ask questions to the Process, and even if it were, the answers would be either unenlightening simplifications or unintelligible.

2. Self-improvement

As mentioned above, the progress of AI research has been a long, difficult process, and it is currently in an exuberant phase of high activity. Constant activity and self-improvement is what we should expect to see of an AGI en route to superintelligence. Though of course, the frenzied activity of this system is being entirely caused by humans, unlike in an improving AGI, the activity is aimed towards removing all humans from the process, that is, towards creating an algorithm that will independently continue the process humans are currently manually executing. This is unlike any other scientific endeavor, as it is the only scientific research aimed towards its own abolition.

3. Amorality

Systems composed of humans are invariably amoral, because they are responding to entirely different considerations than any human individual is. Note that amorality does not mean immorality, it merely means the system, taken as a whole, does not feel moral compunctions, or anything at all for that matter.

The Process is and has always been, nearly for all intents and purposes, an AGI that has been slowly improving itself through the decades. Like in many apocalyptic scenarios for AGI, an AGI process has been executing without us understanding that we're looking at an AGI. However, it does possess some important differences: crucially, due to being composed of humans, the system has built-in safety mechanisms.

Byzantine Generals

A Byzantine fault (also interactive consistency, source congruency, error avalanche, Byzantine agreement problem, Byzantine generals problem, and Byzantine failure) is a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the "Byzantine Generals Problem", developed to describe a situation in which, in order to avoid catastrophic failure of the system, the system's actors must agree on a concerted strategy, but some of these actors are unreliable.

Unlike in most of the apocalyptic AGI scenarios, our analog AGI is partially intelligible. It is a distributed system, and though it is not possible to make accurate predictions of the system as a whole, or for the system to provide a self-assessment of itself, it is possible to query the individual components of the system, as they are humans.

Consider the situation where we have a digital AI with a distributed architecture. Let's visualize it as a cluster of GPT-Ns. We would like to know whether the activity of this cluster could wind up producing rogue superintelligence. The individual nodes each being a GPT-N instance, we ask each one via terminal for their assessment of the threat and alignment status of the activity of the entire system. The results are mixed. Some of the instances report there is a significant chance the system can produce, or become into, a rogue superintelligence. Some of them assert that there is no conceivable way rogue superintelligence could be an outcome of the system's activity. Some claim to not understand how to provide such an assessment. This is clearly an instance of the Byzantine Generals problem. Some of the components are failing, in the sense they are providing an erroneous assessment. If the instances that reported significant chance of disaster are correct, we would like to pause the activity of the system and perform an exhaustive analysis to understand how the determination was made and exactly how the system could produce disaster.

With the analog distributed AGI of AI research, this is exactly the situation we find ourselves in. Some of the individuals involved think the activity of the system is quite likely to produce disaster. Others that it is quite impossible. Yet others do not understand the query.

The analogy breaks down here, as algorithms for handling Byzantine Fault Tolerance in computer systems do not map so well to handling the Byzantine General's Problem with actual Byzantine Generals. Though humans are, according to neuroscience, entirely mechanical entities like any machine, we currently are unable to provide a wholly mechanical account of human behavior. Such an account could potentially allow us to implement a BFT algorithm for humans.

Even absent such an algorithm, the Byzantine General's Problem is relevant to the agents within the system, who need to understand that that is their position, and that they need to coordinate if they think the system is an existential threat. I have an impression some of them do understand that, and hopefully, those who think the system is safe, or don't understand how it could become dangerous, will also remain blind to, or in denial of, the fact that they are Byzantine Generals.

Since all AGI's should be safety-critical systems, with redundant safety mechanisms, we must now turn to external mechanisms.

Andon Cord

The Andon Cord was a manifestation of the original Jidoka principle. Toyota implemented the Andon Cord as a physical rope that followed the assembly line and could be pulled to stop the manufacturing line at any time. Most western culture analysis of this type of Cord might assume this was implemented as a safety cut off switch. At Toyota, it was a tool to instill autonomic behavior patterns. This is what Rother calls Kata.

Furthermore, this wasn’t an ask permission to stop the line, the pull actually stopped the line. As the story goes, anyone could pull the Andon Cord anytime. Sounds mad doesn’t it? Salvador Dali eloquently says “The only difference between a madman and me is that I’m not mad.”

Even if the Process could be shown to be self-correcting or mostly safe, all such assessments would necessarily be unintelligible or misleading simplifications to the agents external to the system, that is, the laypersons. They clearly need a mechanism of their own to stop the process if they think it is malfunctioning, as the process' outputs will clearly affect them.

What does pulling an Andon Cord on a machine made up of humans look like, or how to install such a cord, is uncertain. It seems fruitful to try to understand how to make the Process conform to the Toyota Way that produced such innovations as the Andon Cord.

Foregrounding the threat of the Process in the public mind

The public needs to be made aware that the process poses a potential, but not yet actual, threat to them. It then becomes possible for them to engage with the process, and for the process to face selective pressures to converge to acceptable solutions, pressures it is currently not facing. However, like in AI alignment scenarios, the process may just opt to become better at persuasion or other deflections, instead of actually attempting to handle the threat it poses, which is why this should only be one tool among many.

There are processes that have got away with unalignment (the fossil fuel industry), but there are others that haven't (the nuclear industry, which perhaps is unjustifiably mistrusted). All else being equal, it appears that processes that can resist being shutdown only possess that capability because of deep enmeshment in the broader process of civilization. Processes that have not buried so deep, such as AI research, are perhaps more vulnerable to being discredited, and losing the capability to advance rapidly and unintelligibly, than is commonly thought, though this would have to be reevaluated in the case the Process has actually become as enmeshed in other systems as the fossil fuel industry.

Foregrounding the threat of the Process could be achieved with popular art. There is a notable lack of popular films that competently depict the actual threat of the Process. The film could follow the perspective of an AI researcher that is ambivalent about the risks of AI. The plot could be a gradual unfolding of an AI disaster, which would provide the opportunity to showcase the difficulties of alignment and containment, as well as various disasters of escalating intensity, culminating in extermination as a rogue superintelligence commences to consume all the resources of the solar system in pursuit of an inscrutable goal. The final scene would be focused on the researcher protagonist, who has survived until this point, observing the beginning of the construction of a superstructure, and attempting to make himself believe the superintelligence's claim that its actions are objectively correct (it would really drive home the nature of the alignment problem if the superintelligence is attempting to convince all humans of the correctness of its actions even as it exterminates them, most likely by deploying utilitarian arguments with alternating strategies of philosophically rigorous disputation, hijacking of religious motifs, mockery, gaslighting, and ELI5-style arguments). As the protagonist copes by attempting to believe that he is witnessing the birth of something more majestic than anyone could have possibly thought, a holy fool beggar that had been a recurring character making cryptic comments through the movie states the tagline of the movie in response: "Whatever displaces Life can only be Death." The protagonist's face is blank as he is disintegrated.

The possibility that successful alignment of one Process instance causes it to lose to a more reckless Process

Clearly a real risk, but I don't think it's hopeless. Usually, China is held up as THE example of an agent that would blithely forge ahead with the Process in spite of all arguments to the contrary, but the imprisonment of He Jiankui is reason to hope. The CCP is very skittish, and only needs to recognize the Process as a threat to engage in draconian measures in response.

The Ice-Nine objection

It may sound far fetched or galaxy brained to think of AI research itself as an AGI. The objection could be made that by those terms, all scientific research is an existential threat. Chemistry could produce an Ice-Nine like substance, physics a lethal exotic phenomenon. I guess the main reason those other processes are not in themselves existential threats is that none of the individuals executing them have raised flags the field is proceeding in a dangerous direction, though considering this closer, that is not a very good argument to extend trust to those disciplines, as they still fall prey to being unintelligible to outsiders. The broader peril of technological black balls, as Bostrom has described, and of which rogue superintelligence is but one example, will have to reckoned with at one point.

New to LessWrong?

New Comment