This is a very basic introduction to AGI safety work, cross-posted from the MIRI blog. The discussion of AI V&V methods (mostly in the 'early steps' section) is probably the only part that will be new to regulars here.
Improvements in AI are resulting in the automation of increasingly complex and creative human behaviors. Given enough time, we should expect artificial reasoners to begin to rival humans in arbitrary domains, culminating in artificial general intelligence (AGI).
A machine would qualify as an 'AGI', in the intended sense, if it could adapt to a very wide range of situations to consistently achieve some goal or goals. Such a machine would behave intelligently when supplied with arbitrary physical and computational environments, in the same sense that Deep Blue behaves intelligently when supplied with arbitrary chess board configurations — consistently hitting its victory condition within that narrower domain.
Since generally intelligent software could help automate the process of thinking up and testing hypotheses in the sciences, AGI would be uniquely valuable for speeding technological growth. However, this wide-ranging productivity also makes AGI a unique challenge from a safety perspective. Knowing very little about the architecture of future AGIs, we can nonetheless make a few safety-relevant generalizations:
- Because AGIs are intelligent, they will tend to be complex, adaptive, and capable of autonomous action, and they will have a large impact where employed.
- Because AGIs are general, their users will have incentives to employ them in an increasingly wide range of environments. This makes it hard to construct valid sandbox tests and requirements specifications.
- Because AGIs are artificial, they will deviate from human agents, causing them to violate many of our natural intuitions and expectations about intelligent behavior.
Today's AI software is already tough to verify and validate, thanks to its complexity and its uncertain behavior in the face of state space explosions. Menzies & Pecheur (2005) give a good overview of AI verification and validation (V&V) methods, noting that AI, and especially adaptive AI, will often yield undesired and unexpected behaviors.
An adaptive AI that acts autonomously, like a Mars rover that can't be directly piloted from Earth, represents an additional large increase in difficulty. Autonomous safety-critical agents need to make irreversible decisions in dynamic environments with very low failure rates. The state of the art in safety research for autonomous systems is improving, but continues to lag behind capabilities work. Hinchman et al. (2012) write:
As autonomous systems become more complex, the notion that systems can be fully tested and all problems will be found is becoming an impossible task. This is especially true in unmanned/autonomous systems. Full test is becoming increasingly challenging on complex system. As these systems react to more environmental [stimuli] and have larger decision spaces, testing all possible states and all ranges of the inputs to the system is becoming impossible. [...] As systems become more complex, safety is really risk hazard analysis, i.e. given x amount of testing, the system appears to be safe. A fundamental change is needed. This change was highlighted in the 2010 Air Force Technology Horizon report, "It is possible to develop systems having high levels of autonomy, but it is the lack of suitable V&V methods that prevents all but relatively low levels of autonomy from being certified for use." [...]
The move towards more autonomous systems has lifted this need [for advanced verification and validation techniques and methodologies] to a national level.
AI acting autonomously in arbitrary domains, then, looks particularly difficult to verify. If AI methods continue to see rapid gains in efficiency and versatility, and especially if these gains further increase the opacity of AI algorithms to human inspection, AI safety engineering will become much more difficult in the future. In the absence of any reason to expect a development in the lead-up to AGI that would make high-assurance AGI easy (or AGI itself unlikely), we should be worried about the safety challenges of AGI, and that worry should inform our research priorities today.
Below, I’ll give reasons to doubt that AGI safety challenges are just an extension of narrow-AI safety challenges, and I’ll list some research avenues people at MIRI expect to be fruitful.
New safety challenges from AGI
A natural response to the idea of starting work on high-assurance AGI is that AGI itself appears to be decades away. Why worry about it so early? And, supposing that we should worry about it, why think there's any useful work we can do on AGI safety so far in advance?
In response to the second question: It’s true that, at a glance, AGI looks difficult to effectively prepare for. The issue is important enough, however, to warrant more than a glance. Long-term projects like mitigating climate change or detecting and deflecting asteroids are intuitively difficult. The same is true of interventions that depend on projected future technologies, such as work on post-quantum cryptography in anticipation of quantum computers. In spite of that, we've made important progress on these fronts.
Covert channel communication provides one precedent. It was successfully studied decades in advance of being seen in the wild. Roger Schell cites a few other success cases in Muehlhauser (2014b), and suggests reasons why long-term security and safety work remains uncommon. We don’t know whether early-stage AGI safety work will be similarly productive, but we shouldn’t rule out the possibility before doing basic research into the question. I’ll list some possible places to start looking in the next section.
What about the first question? Why worry specifically about AGI?
I noted above that AGI is an extreme manifestation of many ordinary AI safety challenges. However, MIRI is particularly concerned with unprecedented, AGI-specific behaviors. For example: An AGI’s problem-solving ability (and therefore its scientific and economic value) depends on its ability to model its environment. This includes modeling the dispositions of its human programmers. Since the program’s success depends in large part on its programmers’ beliefs and preferences, an AI pursuing some optimization target can select actions for the effect they have on programmers’ mental states, not just on the AI’s material environment.
This means that safety protocols will need to be sensitive to risks that differ qualitatively from ordinary software failure modes — AGI-specific hazards like ‘the program models its programmed goals as being better served if it passes human safety inspections, so it selects action policies that make it look safer (to humans) than it really is’. If we model AGI behavior using only categories from conventional software design, we risk overlooking new intelligent behaviors, including 'deception'.
At the same time, oversimplifying these novel properties can cause us to anthropomorphize the AGI. If it's naïve to expect ordinary software validation methods to immediately generalize to advanced autonomous agents, it's even more naïve to expect conflict prevention strategies that work on humans to immediately generalize to an AI. A 'deceptive' AGI is just one whose planning algorithm identifies some human misconception as instrumentally useful to its programmed goals. Its methods or reasons for deceiving needn't resemble a human's, even if its capabilities do.
In human society, we think, express, and teach norms like 'don't deceive' or Weld & Etzioni's (1994) 'don't let humans come to harm' relatively quickly and easily. The complex conditional response that makes humans converge on similar goals remains hidden inside a black box — the undocumented spaghetti code that is the human brain. As a result of our lack of introspective access to how our social dispositions are cognitively and neurally implemented, we're likely to underestimate how contingent and complex they are. For example:
- We might expect that an especially intelligent AI system would have especially worthy goals, since knowledge and insight are associated with many other virtues in humans. E.g., Halls (2007) conjectures this on the grounds that criminality negatively correlates with intelligence in humans. Bostrom's (2003) response is that there's no particular reason to expect AIs to converge on anthropocentric terminal values like 'compassion' or 'loyalty' or 'novelty'. A superintelligent AI could consistently have no goal other than to construct paperclips, for example.
- We might try to directly hand-code goals like 'don't deceive' into the agent by breaking them apart into simpler goals, e.g., 'don't communicate information you believe to be false'. However, in the process we're likely to neglect the kinds of subtleties that can be safely left implicit when it's a human child we're teaching — lies of omission, misleading literalism, novel communication methods, or any number of edge cases. As Bostrom (2003) notes, the agent's goals may continue to reflect the programmers' poor translation of their requirements into lines of code, even after its intelligence has arrived at a superior understanding of human psychology.
- We might instead try to instill the AGI with humane values via machine learning — training it to promote outcomes associated with camera inputs of smiling humans, for example. But a powerful search process is likely to hit on solutions that would never occur to a developing human. If the agent becomes more powerful or general over time, initially benign outputs may be a poor indicator of long-term safety.
Advanced AI is also likely to have technological capabilities, such as strong self-modification, that introduce other novel safety obstacles; see Yudkowsky (2013).
These are quick examples of some large and poorly-understood classes of failure mode. However, the biggest risks may be from problem categories so contrary to our intuitions that they will not occur to programmers at all. Relying on our untested intuitions, or on past experience with very different systems, is unlikely to catch every hazard.
As an intelligent but inhuman agent, AGI represents a fundamentally new kind of safety challenge. As such, we’ll need to do basic theoretical work on the general features of AGI before we can understand such agents well enough to predict and plan for them.
What would early, theoretical AGI safety research look like? How does one vet a hypothetical technology? We can distinguish research projects oriented toward system verification from projects oriented toward system requirements.
Verification-directed AGI research extends existing AI safety and security tools that are likely to help confirm various features of advanced autonomous agents. Requirements-directed AGI research instead specifies desirable AGI abilities or behaviors, and tries to build toy models exhibiting the desirable properties. These models are then used to identify problems to be overcome and basic gaps in our conceptual understanding.
In other words, verification-directed approaches would ask 'What tools and procedures can we use to increase our overall confidence that the complex systems of the future will match their specifications?' They include:
- Develop new tools for improving the transparency of AI systems to inspection. Frequently, useful AI techniques like boosting are tried out in an ad-hoc fashion and then promoted when they're observed to work on some problem set. Understanding when and why programs work would enable stronger safety and security guarantees. Computational learning theory may be useful here, for proving bounds on the performance of various machine learning algorithms.
- Extend techniques for designing complex systems to be readily verified. Work on clean-slate hardware and software approaches that maintain high assurance at every stage, like HACMS and SAFE.
- Extend current techniques in program synthesis and formal verification, with a focus on methods applicable to complex and adaptive systems, such as higher-order program verification and Spears' (2000, 2006) incremental reverification. Expand on existing tools — e.g., design better interfaces and training methods for the SPIN model checker to improve its accessibility.
- Apply homotopy type theory to program verification. The theory's univalence axiom lets us derive identities from isomorphisms. Harper & Licata (2011) suggest that if we can implement this as an algorithm, it may allow us to reuse high-assurance code in new contexts without a large loss in confidence.
- Expand the current body of verified software libraries and compilers, such as the Verified Software Toolchain. A lot of program verification work is currently directed at older toolchains, e.g., relatively small libraries in C. Focusing on newer toolchains would limit our ability to verify systems that are already in wide use, but would put us in a better position to verify more advanced safety-critical systems.
Requirements-directed approaches would ask 'What outcomes are we likely to want from an AGI, and what general class of agent could most easily get us those outcomes?' Examples of requirements-directed work include:
- Formalize stability guarantees for intelligent self-modifying agents. A general intelligence could help maintain itself and implement improvements to its own software and hardware, including improvements to its search and decision heuristics à la EURISKO. It may be acceptable for the AI to introduce occasional errors in its own object recognition or speech synthesis modules, but we’d want pretty strong assurance about the integrity of its core decision-making algorithms, including the module that approves self-modifications. At present, the toy models of self-modifying AI discussed in Fallenstein & Soares (2014) run into two paradoxes of self-reference, the ‘Löbian obstacle’ and the ‘procrastination paradox.’ Similar obstacles may arise for real-world AGI, and finding solutions should improve our general understanding of systems that make decisions and predictions about themselves.
- Specify desirable checks on AGI behavior. Some basic architecture choices may simplify the task of making AGI safe, by restricting the system's output channel (e.g., oracle AI in Armstrong et al. (2012)) or installing emergency tripwires and fail-safes (e.g., simplex architectures). AGI checks are a special challenge because of the need to recruit the agent to help actively regulate itself. If this demand isn't met, the agent may devote its problem-solving capabilities to finding loopholes in its restrictions, as in Yampolskiy's (2012) discussion of an 'AI in a box'.
- Design general-purpose methods by which intelligent agents can improve their models of user requirements over time. The domain of action of an autonomous general intelligence is potentially unlimited; or, if it is limited, we may not be able to predict which limits will apply. As such, it needs safe situation-general goals. Coding a universally safe and useful set of decision criteria by hand, however, looks hopelessly difficult. Instead, some indirectly normative method like Dewey's (2011) value learning seems necessary, to allow initially imperfect decision-makers to improve their goal content over time. Related open questions include: what kinds of base cases can we use to train and test a beneficial AI?; and how can AIs be made safe and stable during the training process?
- Formalize other optimality criteria for arbitrary reasoners. Just as a general-purpose adaptive agent would need general-purpose values, it would also need general-purpose methods for tracking features of its environment and of itself, and for selecting action policies on this basis. Mathematically modeling ideal inference (e.g., Hutter (2012)), ideal decision-theoretic expected value calculation (e.g., Altair (2013)), and ideal game-theoretic coordination (e.g., Barasz et al. (2014)) are unlikely to be strictly necessary for AGI. However, they're plausibly necessary for AGI safety, because models like these would give us a solid top-down theoretical foundation upon which to cleanly construct components of autonomous agents for human inspection and verification.
We can probably make progress in both kinds of AGI safety research well in advance of building a working AGI. Requirements-directed research focuses on abstract mathematical agent models, which makes it likely to be applicable to a wide variety of software implementations. Verification-directed approaches will be similarly valuable, to the extent they are flexible enough to apply to future programs that are much more complex and dynamic than any contemporary software. We can compare this to present-day high-assurance design strategies, e.g., in Smith & Woodside (2000) and Hinchman et al. (2012). The latter write, concerning simpler autonomous machines:
With better software analysis techniques, software can be analyzed at design-time with the goal of finding software faults earlier. This analysis can also prove the absence of error or negative properties. As system complexity and functionality increase, complete testing is becoming impossible and enhanced analysis techniques will have to be used. Furthermore, many of these software techniques, such as model checking, can be used in analysis of requirements and system design to find conflicting requirements or logic faults before a single line of code is written saving more time and money over traditional testing methods.
Verification- and requirements-directed work is complementary. The point of building clear mathematical models of agents with desirable properties is to make it easier to design systems whose behaviors are transparent enough to their programmers to be rigorously verified; and verification methods will fail to establish system safety if we have a poor understanding of what kind of system we want in the first place.
Some valuable projects will also fall in between these categories — e.g., developing methods for principled formal validation, which can increase our confidence that we're verifying the right properties given the programmers' and users' goals. (See Cimatti et al. (2012) on formal validation, and also Rushby (2013) on epistemic doubt.)
MIRI's focus: the mathematics of intelligent agents
MIRI's founder, Eliezer Yudkowsky, has been one of the most vocal advocates for research into autonomous high-assurance AGI, or 'friendly AI'. Russell & Norvig (2009) write:
[T]he challenge is one of mechanism design — to define a mechanism for evolving Al systems under a system of checks and balances, and to give the systems utility functions that will remain friendly in the face of such changes. We can't just give a program a static utility function, because circumstances, and our desired responses to circumstances, change over time.
MIRI’s approach is mostly requirements-directed, in part because this angle of attack is likely to enhance our theoretical understanding of the entire problem space, improving our research priorities and other strategic considerations. Moreover, the relevant areas of theoretical computer science and mathematics look much less crowded. There isn’t an established subfield or paradigm for requirements-directed AGI safety work where researchers could find a clear set of open problems, publication venues, supervisors, or peers.
Rather than building on the formal verification literature, MIRI prioritizes jump-starting these new avenues of research. Muehlhauser (2013) writes that engineering innovations often have their germ in prior work in mathematics, which in turn can be inspired by informal philosophical questions. At this point, AGI safety work is just beginning to enter the 'mathematics' stage. Friendly AI researchers construct simplified models of likely AGI properties or subsystems, formally derive features of those models, and check those features against general or case-by-case norms.
Because AGI safety is so under-researched, we're likely to find low-hanging fruit even in investigating basic questions like 'What kind of prior probability distribution works best for formal agents in unknown environments?' As Gerwin Klein notes in Muehlhauser (2014a), "In the end, everything that makes it easier for humans to think about a system, will help to verify it." And, though MIRI's research agenda is determined by social impact considerations, it is also of general intellectual interest, bearing on core open problems in theoretical computer science and mathematical logic.
At the same time, it's important to keep in mind that formal proofs of AGI properties only function as especially strong probabilistic evidence. Formal methods in computer science can decrease risk and uncertainty, but they can't eliminate it. Our assumptions are uncertain, so our conclusions are too.
Though we can't reach complete confidence in the safety of an AI, we can still decrease the probability of catastrophic failure. In the process, we are likely to come to a better understanding of the most severe failure modes. If we begin working now to better understand AGI as a theoretical system, we'll be in a far better position to implement robust safety measures as AI improves in intelligence and autonomy in the decades to come.
∙ Altair (2013). A comparison of decision algorithms on Newcomblike problems. Machine Intelligence Research Institute.
∙ Armstrong et al. (2012). Thinking inside the box: controlling and using an oracle AI. Minds and Machines, 22: 299-324.
∙ Barasz et al. (2014). Robust cooperation in the Prisoner's Dilemma: program equilibrium via probability logic. arXiv.
∙ Bostrom (2003). Ethical issues in advanced artificial intelligence. In Smith et al. (eds.), Cognitive, Emotive and Ethical Aspects of Decision-Making in Humans and in Artificial Intelligence, 2: 12-17.
∙ Cimatti et al. (2012). Validation of requirements for hybrid systems: a formal approach. ACM Transactions on Software Engineering and Methodology, 21.
∙ Dewey (2011). Learning what to value. Artificial General Intelligence 4th International Conference Proceedings: 309-314.
∙ Fallenstein & Soares (2014). Problems of self-reference in self-improving space-time embedded intelligence. Working paper.
∙ Halls (2007). Beyond AI: Creating the Conscience of the Machine.
∙ Harper & Licata (2011). Foundations and applications of higher-dimensional directed type theory. National Science Foundation grant proposal.
∙ Hinchman et al. (2012). Towards safety assurance of trusted autonomy in Air Force flight-critical systems. Layered Assurance Workshop, 17.
∙ Hutter (2012). One decade of Universal Artificial Intelligence. Theoretical Foundations of Artificial General Intelligence, 4: 67-88.
∙ Menzies & Pecheur (2005). Verification and validation and artificial intelligence. Advances in Computers, 65: 154-203.
∙ Muehlhauser (2013). From philosophy to math to engineering. MIRI Blog.
∙ Muehlhauser (2014a). Gerwin Klein on formal methods. MIRI Blog.
∙ Muehlhauser (2014b). Roger Schell on long-term computer security research. MIRI Blog.
∙ Rushby (2013). Logic and epistemology in safety cases. Computer Safety, Reliability, and Security: Proceedings of SafeComp 32 (pp. 1-7).
∙ Russell & Norvig (2009). Artificial Intelligence: A Modern Approach.
∙ Smith & Woodside (2000). Performance validation at early stages of software development. In Gelenbe (ed.), System Performance Evaluation: Methodologies and Applications (pp. 383-396).
∙ Spears (2000). Asimovian adaptive agents. Journal of Artificial Intelligence Research, 13: 95-153.
∙ Spears (2006). Assuring the behavior of adaptive agents. In Rouff et al. (eds.), Agent Technology from a Formal Perspective (pp. 227-257).
∙ Weld & Etzioni (1994). The First Law of Robotics (a call to arms). Proceedings of the Twelfth National Conference on Artificial Intelligence: 1042-1047.
∙ Yampolskiy (2012). Leakproofing the singularity: artificial intelligence confinement problem. Journal of Consciousness Studies, 19: 194-214.
∙ Yudkowsky (2013). Intelligence explosion microeconomics. Technical report.
I wonder if sheaf theory could be useful for this. The basic idea of sheaf theory is that you can prove things about a global structure by looking at local structures, and how you 'glue' them together to get the global structure. Michael Robinson has been doing lots of interesting work using sheaf theory to understand networks. For example, here's a presentation of an application of sheaf cohomology to asynchronous computation design.
This objection sounds like it boils down to "but we might fail!", which is not very convincing. It seems to me that you (and Yudkowsky) are assuming an already superhuman intelligence, which is a bit of a strawman.
We have a century of literature on the proper emotional response of a normal human to various ethical scenarios at various levels of their development. A reasonable approach would be to do psychological evaluations of a sub-human intelligence proto-AGI being educated in a pre-school like environment, with the intent of developing an intelligence with the emotional and moral response matching that of human children before it enters into a recursive self-improvement. At this stage the mind would be simple and transparent enough to audit and ensure no deception is going on.
Some more nitpicky notes:
Complexity does not follow from intelligence, even general intelligence. Complexity is a product of intelligence, not a feature. Just as the simple process of evolution created complex organisms, simple intelligences such as AIXI are capable of doing complex things. But that doesn't make them inherently complex.
This is not a hard problem. It just inherently works on long time scales. Deflecting an asteroid is easy... if you know about it well enough in advance.
If formal methods are only giving you probabilistic evidence, you aren't using appropriate formal methods. There are systems designed to make 1-to-1 correspondences between code and proof (the method I'm familiar with has an intermediate language and maps every subroutine and step of logic to an expression in that intermediate), and this could be used to make the code an airtight proof that, for example, the utility function will only evolve in specified ways and will stay within known limits. This does put limits on how the program can be written, and lesser limits on how the proof can be constructed (it is hard to incorporate nonconstructivist mathematics), but can have every assumption underlying the safety proved correct. (And when I say every step, I include proof that the compiler is sound and will produce a correct program or none at all, proof that each component of the intermediate language reflects the corresponding proof step, etc.)
We only have probabilistic evidence that any formal method is correct. So far we haven't found contradictions implied by the latest and greatest axioms, but mathematics is literally built upon the ruins of old axioms that didn't quite rule out all known contradictions. FAI needs to be able to re-axiomatize its mathematics when inconsistencies are found in the same way that human mathematicians have, while being implemented in a subset of the same mathematics.
Additionally, machines are only probabilistically correct. FAI will probably need to treat its own implementation as a probabilistic formal system.
The standard Zermelo-Fraenkel axioms have lasted a century with only minor modifications -- none of which altered what was provable -- and there weren't many false starts before that. There is argument over whether to include the axiom of choice, but as mentioned the formal methods of program construction naturally use constructivist mathematics, which doesn't use the axiom of choice anyhow.
This blatantly contradicts the history of axiomatic mathematics, which is only about two centuries old and which has standardized on the ZF axioms for half of that. That you claim this calls into question your knowledge about mathematics generally.
If there's anything modern computer science is good at. it's getting guaranteed performance within specified bounds out of unreliable probabilistic systems.
When absolute guarantees are impossible, there are abundant methods available to guarantee correct outcomes up to arbitrarily high thresholds, which can be as high as you like, and it's quite silly to dismiss it as technically probabilistic. You could, for example, pick the probability that a given baryon would undergo radioactive decay (half-life: 10^32 years or greater), the probability that all the atoms in your pants will suddenly jump, in unison, three feet to the left, or some other extremely-improbable threshold.
Is there a formal method for deciding whether or not to include the axiom of choice? As I understand it three of the ZF axioms are independent of the rest, and all are independent of choice. How would AGI choose which independent axioms should be accepted? AGI could be built to only ever accept a fixed list of axioms but that would make it inflexible if further discoveries offer evidence for choice being useful for example.
You are correct, I don't have formal mathematical training beyond college and I pursue formal mathematics out of personal interest, so I welcome corrections. As I understand it geometry was axiomatic for much longer, and the discovery of non-Euclidean geometries required separating the original axioms for different topologies. Is there a way to formally decide now whether or not a similar adjustment may be required for the axioms of ZF(C)? The problem, as I see it, is that formal mathematics is just string manipulation and the choice of which allowed manipulations are useful is dependent on how the world really is. ZF is useful because its language maps very well onto the real world, but as an example unifying general relativity and quantum mechanics has been difficult. Unless it's formally decidable whether ZF is sufficient for a unified theory it seems to me that some method for an AGI to change its accepted axioms based on probabilistic evidence is required, as well as avoid accepting useless or inconsistent independent axioms.
The axiom of choice applies exclusively to infinite sets, and the finite restriction is a consequence of ZF without AC. Since we cannot actually construct infinite sets, infinitesimals, or even irrational numbers, the consequences of ZFC over and above ZF in the real world are negligible at most, and almost certainly nonexistent.
These are not related statements. There are a lot of ways to choose axioms, but I'm not aware of any of them having any consequences in regimes applicable to physics. Any choice of axioms meant to describe the world has to support some basic conclusions like the validity of arithmetic, and that puts great restrictions on what the axioms are, and what they could possibly say about a physical theory.
Geometry was nominally axiomatic for many centuries, but not rigorously axiomatic until the study and development of non-Euclidean geometry, which was the beginning of the rigorous axiomatization of mathematics. Prior to that, statements were frequently taken as axioms on purely intuitionist grounds (in many subfields); it took the demonstration that Euclid's fifth axiom (the parallel postulate) was unnecessary to make mathematicians actually check their assumptions and establish sets of axioms which were consistent.
Thanks. Well written and interesting.
This is in a way surprising given how obsessed our culture is with safety in many other contexts.
I have two questions:
What is the best strategy for persuading governments and others that more resources should be spent on AGI safety research? Do we have good reasons to believe that more resources will be spent on this in the relatively near future?