For those of you interested, András Kornai's paper "Bounding the impact of AGI" from this year's AGI-Impacts conference at Oxford had a few interesting ideas (which I've excerpted below).

Summary:

  1. Acceptable risk tolerances for AGI design can be determined using standard safety engineering techniques from other fields
  2. Mathematical proof is the only available tool to secure the tolerances required to prevent intolerable increases in xrisk
  3. Automated theorem proving will be required so that the proof can reasonably be checked by multiple human minds

Safety engineering

Since the original approach of Yudkowsky (2006) to friendly AI, which sought mathematical guarantees of friendliness, was met with considerable skepticism, we revisit the issue of why such guarantees are essential. In designing radioactive equipment, a reasonable guideline is to limit emissions to several orders of magnitude below the natural background radiation level, so that human-caused dangers are lost in the noise compared to the pre-existing threat we must live with anyway. In the full paper, we take the “big five” extinction events that occurred within the past half billion years as background, and argue that we need to design systems with a failure rate below 10−63 per logical operation.

What needs to be emphasized in the face of this requirement is that the very best physical measurements have only one part in 1017 precision, not to speak of social and psychological phenomena where our understanding is considerably weaker. What this means is that guarantees of the requisite sort can only be expected from mathematics, where our measurement precision is already considerably better.

 

How reliable is mathematics?

The period since World War II has brought incredible advances in mathematics, such as the Four Color Theorem (Appel and Haken 1976), Fermat’s Last Theorem (Wiles 1995), the classification of finite simple groups (Gorenstein 1982, Aschbacher 2004), and the Poincare conjecture (Perelman 1994). While the community of mathematicians is entirely convinced of the correctness of these results, few individual mathematicians are, as the complexity of the proofs, both in terms of knowledge assumed from various branches of mathematics and in terms of the length of the deductive chain, is generally beyond our ken. Instead of a personal understanding of the matter, most of us now rely on argumentum ad verecundiam: well Faltings and Ribet now think that the Wiles-Taylor proof is correct, and even if I don’t know Faltings or Ribet at least I know and respect people who know and respect them, and if that’s not good enough I can go and devote a few years of my life to understand the proof for good. Unfortunately, the communal checking of proofs often takes years, and sometimes errors are discovered only after a decade has passed: the hole in the original proof of the Four Color Theorem (Kempe 1879) was detected by Heawood in 1890. Tomonaga in his Nobel lecture (1965) describes how his team’s work in 1947 uncovered a major problem in Dancoff (1939):

Our new method of calculation was not at all different in its contents from Dancoff’s perturbation method, but had the advantage of making the calculation more clear. In fact, what took a few months in the Dancoff type of calculation could be done in a few weeks. And it was by this method that a mistake was discovered in Dancoff’s calculation; we had also made the same mistake in the beginning.

To see that such long-hidden errors are are by no means a thing of the past, and to observe the ‘web of trust’ method in action, consider the following example from Mohr (2012).

The eighth-order coefficient A1(8) arises from 891 Feynman diagrams of which only a few are known analytically. Evaluation of this coefficient numerically by Kinoshita and coworkers has been underway for many years (Kinoshita, 2010). The value used in the 2006 adjustment is A1(8) = -1.7283(35) as reported by Kinoshita and Nio (2006). However, (...) it was discovered by Aoyama et al. (2007) that a significant error had been made in the calculation. In particular, 2 of the 47 integrals representing 518 diagrams that had not been confirmed independently required a corrected treatment of infrared divergences. (...) The new value is (Aoyama et al., 2007) A1(8) = 1.9144(35); (111) details of the calculation are given by Aoyama et al. (2008). In view of the extensive effort made by these workers to ensure that the result in Eq. (111) is reliable, the Task Group adopts both its value and quoted uncertainty for use in the 2010 adjustment.

 

Assuming no more than three million mathematics and physics papers published since the beginnings of scientific publishing, and no less than the three errors documented above, we can safely conclude that the overall error rate of the reasoning used in these fields is at least 10-6 per paper.

 

The role of automated theorem-proving

That human reasoning, much like manual arithmetic, is a significantly error-prone process comes as no surprise. Starting with de Bruijn’s Automath (see Nederpelt et al 1994) logicians and computer scientists have invested significant effort in mechanized proof checking, and it is indeed only through such efforts, in particular through the Coq verification (Gonthier 2008) of the entire logic behind the Appel and Haken proof that all lingering doubts about the Four Color Theorem were laid to rest. The error in A1(8) was also identified by using FORTRAN code generated by an automatic code generator (Mohr et al 2012).

To gain an appreciation of the state of the art, consider the theorem that finite groups of odd order are solvable (Feit and Thompson 1963). The proof, which took two humans about two years to work out, takes up an entire issue of the Pacific Journal of Mathematics (255 pages), and it was only this year that a fully formal proof was completed by Gonthier’s team (see Knies 2012). The effort, 170,000 lines, 15,000 definitions, 4,200 theorems in Coq terms, took person-decades of human assistance (15 people working six years, though many of them part-time) even after the toil of Bender and Glauberman (1995) and Peterfalvi (2000), who have greatly cleaned up and modularized the original proof, in which elementary group-theoretic and character-theoretic argumentation was completely intermixed.

The classification of simple finite groups is two orders of magnitude bigger: the effort involved about 100 humans, the original proof is scattered among 20,000 pages of papers, the largest (Aschbacher and Smith 2004a,b) taking up two volumes totaling some 1,200 pages. While everybody capable of rendering meaningful judgment considers the proof to be complete and correct, it must be somewhat worrisome at the 10-64 level that there are no more than a couple of hundred such people, and most of them have something of a vested interest in that they themselves contributed to the proof. Let us suppose that people who are convinced that the classification is bug-free are offered the following bet by some superior intelligence that knows the answer. You must enter a room with as many people you can convince to come with you and push a button: if the classification is bug-free you will each receive $100, if not, all of you will immediately die. Perhaps fools rush in where angels fear to tread, but on the whole we still wouldn’t expect too many takers.

Whether the classification of finite simple groups is complete and correct is very hard to say – the planned second generation proof will still be 5,000 pages, and mechanized proof is not yet in sight. But this is not to say that gaining mathematical knowledge of the required degree of reliability is hopeless, it’s just that instead of monumental chains of abstract reasoning we need to retreat to considerably simpler ones. Take, for example, the first Sylow Theorem, that if the order of a finite group G is divisible by some prime power pn, G will have a subgroup H of this order. We are absolutely certain about this. Argumentum ad verecundiam of course is still available, but it is not needed: anybody can join the hive-mind by studying the proof. The Coq verification contains 350 lines, 15 definitions, 90 theorems, and took 2 people 2 weeks to produce. The number of people capable of rendering meaningful judgment is at least three orders of magnitude larger, and the vast majority of those who know the proof would consider betting their lives on the truth of this theorem an easy way of winning $100 with no downside risk.

 

Further remarks

Not only do we have to prove that the planned AGI will be friendly, the proof itself has to be short enough to be verifiable by humans. Consider, for example, the fundamental theorem of algebra. Could it be the case that we, humans, are all deluded into thinking that an n-th degree polynomial will have roots? Yes, but this is unlikely in the extreme. If this so-called theorem is really a trap laid by a superior intelligence we are doomed anyway, humanity can find its way around it no more than a bee can find its way around the windowpane. Now consider the four-color theorem, which is still outside the human-verifiable range. It is fair to say that it would be unwise to create AIs whose friendliness critically depends on design limits implied by the truth of this theorem, while AIs whose friendliness is guaranteed by the fundamental theorem of algebra represent a tolerable level of risk. 

Recently, Goertzel and Pitt (2012) have laid out a plan to endow AGI with morality by means of carefully controlled machine learning. Much as we are in agreement with their goals, we remain skeptical about their plan meeting the plain failure engineering criteria laid out above.

New Comment
13 comments, sorted by Click to highlight new comments since:

To paraphrase Kornai's best idea (which he's importing from outside the field):

A reasonable guideline is limiting the human caused xrisk to several orders of magnitude below the natural background xrisk level, so that human-caused dangers are lost in the noise compared to the pre-existing threat we must live with anyway.

I like this idea (as opposed to foolish proposals like driving risks from human made tech down to zero), but I expect someone here could sharpen the xrisk level that Kornai suggests. Here's a disturbing note from the appendix where he does his calculation:

Here we take the “big five” extinction events that occurred within the past half billion years as background. Assuming a mean time of 10^8 years between mass extinctions and 10^9 victims in the next one yields an annualized death rate of 10, comparing quite favorably to the reported global death rate of  ~500 for contact with hornets, wasps, and bees (ICD-9-CM E905.3). [emphasis added]

Obviously, this is a gross mis-understanding of xrisks and why they matter. No one values human lives linearly straight down to 0 or assumes no expansion factors for future generations.

A motivated internet researcher could probably just look up the proper citations from Bostrom's "Global Catastrophic Risks" and create a decomposed model that estimated the background xrisk level from only nature (and then only nature + human risks w/o AI), and develop a better safety margin that would be lower than the one in this paper (implying that AGI could afford to be a few orders of magnitude riskier than Kornai's rough estimates).

A reasonable guideline is limiting the human caused xrisk to several orders of magnitude below the natural background xrisk level, so that human-caused dangers are lost in the noise compared to the pre-existing threat we must live with anyway.

I like this idea [...]

We are well above there right now - and that's very unlikely to change before we have machine superintelligence.

Martel (1997) estimates a considerably higher annualized death rate of 3,500 from meteorite impacts alone (she doesn’t consider continental drift or gamma-ray bursts), but the internal logic of safety engineering demands we seek a lower bound, one that we must put up with no matter what strides we make in redistribution of food, global peace, or healthcare.

Is this correct? I'd expect that this lower-bound was superior to the above (10 deaths / year) for the purpose of calculating our present safety factor... unless we're currently able to destroy earth-threatening meteorites and no one told me.

unless we're currently able to destroy earth-threatening meteorites and no one told me.

Well, we do have the technological means to build something to counter one of them, if we were to learn about it tomorrow and it had ETA 2-3 years. Assuming the threat is taken seriously and more resources and effort are put into this than they were / are in killing militant toddlers in the middle-east using drones, that is.

But if one shows up now and it's about to hit Earth on the prophecy-filled turn of the Mayan calendar? Nope, GG.

It is well known that conservatism and caution can result in greater risks under some circumstances. That was the point of my "The risks of caution" and Max's The Perils of Precaution. However, the message does not seem to be sinking in very well - and instead we have statements like: "Not only do we have to prove that the planned AGI will be friendly, the proof itself has to be short enough to be verifiable by humans". I don't think that is a sensible conclusion. It looks more like a naive and unrealistic approach to the issue.

[-][anonymous]10

In some very old publication, Eliezer made an xrisk engineering analogy with transistor failure in chip fabrication. The idea was that you often can't reduce transistor failure to acceptably minuscule levels, because rare catastrophic events are more common than acceptable levels. Things like trees falling on houses and destroying laptops, while rare, still contribute relatively huge amounts to transistor failure rates. Despite that, chips are reliable in the absence of catastrophic events. That reliability isn't a consequence of driving down transistor failure rates, it's a consequence of shoving all the failure probability into worlds where all the transistors on a chip fail at once.

Since that was an old publication, I have to take it with salt, but I still wonder, does designing "systems with a failure rate below 10^−63 per logical operation" miss some crucial point about conditional component failure?

Regarding: "We are absolutely certain about this."

That is not a good sign in a paper about risk analysis :-(

Insofar as it is intended to bring out the emotional parallels surrounding certainty of 63 orders of magnitude, I think it does a good job and brings a good example. I would, for one, go into that room with everyone I know for the free $100 (assuming it was conditional on the theorem being true, or I had at least a week to peruse the specific proof it checked and knew the exact specifications of what it judged to mean "bug-free")

My confidences don't differ by "63 orders of magnitude" about anything. If they did, I would know I was being overconfident about something.

Automated theorem proving will be required so that the proof can reasonably be checked by multiple human minds

Who will prove the correctness of the theorem provers?

I believe Coq is already short and proven using other proving programs that are also short and validated. So I believe the tower of formal validation that exists for these techniques is pretty well secured. I could be wrong about that though... would be curious to know the answer to that.

Relatedly, there are a lot of levels you can go with this. For instance, I wish someone would create other programming languages like CompCert for programming formally validated programs.