... or The Maverick Nanny with a Dopamine Drip

Richard Loosemore


My goal in this essay is to analyze some widely discussed scenarios that predict dire and almost unavoidable negative behavior from future artificial general intelligences, even if they are programmed to be friendly to humans. I conclude that these doomsday scenarios involve AGIs that are logically incoherent at such a fundamental level that they can be dismissed as extremely implausible. In addition, I suggest that the most likely outcome of attempts to build AGI systems of this sort would be that the AGI would detect the offending incoherence in its design, and spontaneously self-modify to make itself less unstable, and (probably) safer.


AI systems at the present time do not even remotely approach the human level of intelligence, and the consensus seems to be that genuine artificial general intelligence (AGI) systems—those that can learn new concepts without help, interact with physical objects, and behave with coherent purpose in the chaos of the real world—are not on the immediate horizon.

But in spite of this there are some researchers and commentators who have made categorical statements about how future AGI systems will behave. Here is one example, in which Steve Omohundro (2008) expresses a sentiment that is echoed by many:

"Without special precautions, [the AGI] will resist being turned off, will try to break into other machines and make copies of itself, and will try to acquire resources without regard for anyone else’s safety. These potentially harmful behaviors will occur not because they were programmed in at the start, but because of the intrinsic nature of goal driven systems." (Omohundro, 2008)

Omohundro’s description of a psychopathic machine that gobbles everything in the universe, and his conviction that every AI, no matter how well it is designed, will turn into a gobbling psychopath is just one of many doomsday predictions being popularized in certain sections of the AI community. These nightmare scenarios are now saturating the popular press, and luminaries such as Stephen Hawking have -- apparently in response -- expressed their concern that AI might "kill us all."

I will start by describing a group of three hypothetical doomsday scenarios that include Omohundro’s Gobbling Psychopath, and two others that I will call the Maverick Nanny with a Dopamine Drip and the Smiley Tiling Berserker. Undermining the credibility of these arguments is relatively straightforward, but I think it is important to try to dig deeper and find the core issues that lie behind this sort of thinking. With that in mind, much of this essay is about (a) the design of motivation and goal mechanisms in logic-based AGI systems, (b) the misappropriation of definitions of “intelligence,” and (c) an anthropomorphism red herring that is often used to justify the scenarios.

Dopamine Drips and Smiley Tiling

In a 2012 New Yorker article entitled Moral Machines, Gary Marcus said:

"An all-powerful computer that was programmed to maximize human pleasure, for example, might consign us all to an intravenous dopamine drip [and] almost any easy solution that one might imagine leads to some variation or another on the Sorcerer’s Apprentice, a genie that’s given us what we’ve asked for, rather than what we truly desire." (Marcus 2012)

He is depicting a Nanny AI gone amok. It has good intentions (it wants to make us happy) but the programming to implement that laudable goal has had unexpected ramifications, and as a result the Nanny AI has decided to force all human beings to have their brains connected to a dopamine drip.

Here is another incarnation of this Maverick Nanny with a Dopamine Drip scenario, in an excerpt from the Intelligence Explosion FAQ, published by MIRI, the Machine Intelligence Research Institute (Muehlhauser 2013):

"Even a machine successfully designed with motivations of benevolence towards humanity could easily go awry when it discovered implications of its decision criteria unanticipated by its designers. For example, a superintelligence programmed to maximize human happiness might find it easier to rewire human neurology so that humans are happiest when sitting quietly in jars than to build and maintain a utopian world that caters to the complex and nuanced whims of current human neurology."

Setting aside the question of whether happy bottled humans are feasible (one presumes the bottles are filled with dopamine, and that a continuous flood of dopamine does indeed generate eternal happiness), there seems to be a prima facie inconsistency between the two predicates

[is an AI that is superintelligent enough to be unstoppable]


[believes that benevolence toward humanity might involve forcing human beings to do something violently against their will.]

Why do I say that these are seemingly inconsistent?  Well, if you or I were to suggest that the best way to achieve universal human happiness was to forcibly rewire the brain of everyone on the planet so they became happy when sitting in bottles of dopamine, most other human beings would probably take that as a sign of insanity. But Muehlhauser implies that the same suggestion coming from an AI would be perfectly consistent with superintelligence.

Much could be said about this argument, but for the moment let’s just note that it begs a number of questions about the strange definition of “intelligence” at work here.

The Smiley Tiling Berserker

Since 2006 there has been an occasional debate between Eliezer Yudkowsky and Bill Hibbard. Here is Yudkowsky stating the theme of their discussion:

"A technical failure occurs when the [motivation code of the AI] does not do what you think it does, though it faithfully executes as you programmed it. [...]   Suppose we trained a neural network to recognize smiling human faces and distinguish them from frowning human faces. Would the network classify a tiny picture of a smiley-face into the same attractor as a smiling human face? If an AI “hard-wired” to such code possessed the power—and Hibbard (2001) spoke of superintelligence—would the galaxy end up tiled with tiny molecular pictures of smiley-faces?"   (Yudkowsky 2008)

Yudkowsky’s question was not rhetorical, because he goes on to answer it in the affirmative:

"Flash forward to a time when the AI is superhumanly intelligent and has built its own nanotech infrastructure, and the AI may be able to produce stimuli classified into the same attractor by tiling the galaxy with tiny smiling faces... Thus the AI appears to work fine during development, but produces catastrophic results after it becomes smarter than the programmers(!)." (Yudkowsky 2008)

Hibbard’s response was as follows:

Beyond being merely wrong, Yudkowsky's statement assumes that (1) the AI is intelligent enough to control the galaxy (and hence have the ability to tile the galaxy with tiny smiley faces), but also assumes that (2) the AI is so unintelligent that it cannot distinguish a tiny smiley face from a human face. (Hibbard 2006)

This comment expresses what I feel is the majority lay opinion: how could an AI be so intelligent as to be unstoppable, but at the same time so unsophisticated that its motivation code treats smiley faces as evidence of human happiness?

Machine Ghosts and DWIM

The Hibbard/Yudkowsky debate is worth tracking a little longer. Yudkowsky later postulates an AI with a simple neural net classifier at its core, which is trained on a large number of images, each of which is labeled with either “happiness” or “not happiness.” After training on the images the neural net can then be shown any image at all, and it will give an output that classifies the new image into one or the other set. Yudkowsky says, of this system:

"Even given a million training cases of this type, if the test case of a tiny molecular smiley-face does not appear in the training data, it is by no means trivial to assume that the inductively simplest boundary around all the training cases classified “positive” will exclude every possible tiny molecular smiley-face that the AI can potentially engineer to satisfy its utility function.

And of course, even if all tiny molecular smiley-faces and nanometer-scale dolls of brightly smiling humans were somehow excluded, the end result of such a utility function is for the AI to tile the galaxy with as many “smiling human faces” as a given amount of matter can be processed to yield." (Yudkowsky 2011)

He then tries to explain what he thinks is wrong with the reasoning of people, like Hibbard, who dispute the validity of his scenario:

"So far as I can tell, to [Hibbard] it remains self-evident that no superintelligence would be stupid enough to thus misinterpret the code handed to it, when it’s obvious what the code is supposed to do.   [...] It seems that even among competent programmers, when the topic of conversation drifts to Artificial General Intelligence, people often go back to thinking of an AI as a ghost-in-the-machine—an agent with preset properties which is handed its own code as a set of instructions, and may look over that code and decide to circumvent it if the results are undesirable to the agent’s innate motivations, or reinterpret the code to do the right thing if the programmer made a mistake." (Yudkowsky 2011)

Yudkowsky at first rejects the idea that an AI might check its own code to make sure it was correct before obeying the code. But, truthfully, it would not require a ghost-in-the-machine to reexamine the situation if there was some kind of gross inconsistency with what the humans intended: there could be some other part of its programming (let’s call it the checking code) that kicked in if there was any hint of a mismatch between what the AI planned to do and what the original programmers were now saying they intended. There is nothing difficult or intrinsically wrong with such a design.  And, in fact, Yudkowsky goes on to make that very suggestion (he even concedes that it would be “an extremely good idea”).

But then his enthusiasm for the checking code evaporates:

"But consider that a property of the AI’s preferences which says e.g., “maximize the satisfaction of the programmers with the code” might be more maximally fulfilled by rewiring the programmers’ brains using nanotechnology than by any conceivable change to the code."
(Yudkowsky 2011)

So, this is supposed to be what goes through the mind of the AGI. First it thinks “Human happiness is seeing lots of smiling faces, so I must rebuild the entire universe to put a smiley shape into every molecule.” But before it can go ahead with this plan, the checking code kicks in: “Wait! I am supposed to check with the programmers first to see if this is what they meant by human happiness.” The programmers, of course, give a negative response, and the AGI thinks “Oh dear, they didn’t like that idea. I guess I had better not do it then."

But now Yudkowsky is suggesting that the AGI has second thoughts:  "Hold on a minute," it thinks,  "suppose I abduct the programmers and rewire their brains to make them say ‘yes’ when I check with them? Excellent! I will do that.” And, after reprogramming the humans so they say the thing that makes its life simplest, the AGI goes on to tile the whole universe with tiles covered in smiley faces. It has become a Smiley Tiling Berserker.

I want to suggest that the implausibility of this scenario is quite obvious: if the AGI is supposed to check with the programmers about their intentions before taking action, why did it decide to rewire their brains before asking them if it was okay to do the rewiring?

Yudkowsky hints that this would happen because it would be more efficient for the AI to ignore the checking code. He seems to be saying that the AI is allowed to override its own code (the checking code, in this case) because doing so would be “more efficient,” but it would not be allowed to override its motivation code just because the programmers told it there had been a mistake.

This looks like a bait-and-switch. Out of nowhere, Yudkowsky implicitly assumes that “efficiency” trumps all else, without pausing for a moment to consider that it would be trivial to design the AI in such a way that efficiency was a long way down the list of priorities. There is no law of the universe that says all artificial intelligence systems must prize efficiency above all other considerations, so what really happened here is that Yudkowsky designed this hypothetical machine to fail. By inserting the Efficiency Trumps All directive, the AGI was bound to go berserk.

The obvious conclusion is that a trivial change in the order of directives in the AI’s motivation engine will cause the entire argument behind the Smiley Tiling Berserker to evaporate. By explicitly designing the AGI so that efficiency is considered as just another goal to strive for, and by making sure that it will always be a second-class goal, the line of reasoning that points to a bererker machine evaporates.

At this point, engaging in further debate at this level would be less productive than trying to analyze the assumptions that lie behind these claims about what a future AI would or would not be likely to do.

Logical vs. Swarm AI

The main reason that Omohundro, Muehlhauser, Yudkowsky, and the popular press like to give credence to the Gobbling Psychopath, the Maverick Nanny and the Smiley Tiling Berserker is because they assume that all future intelligent machines fall into a broad class of systems that I am going to call “Canonical Logical AI” (CLAI). The bizarre behaviors of these hypothetical AI monsters are just a consequence of weaknesses in this class of AI design. Specifically, these kinds of systems are supposed to interpret their goals in an extremely literal fashion, which eventually leads them to bizarre behaviors engendered by peculiar interpretations of forms of words.

The CLAI architecture is not the only way to build a mind, however, and I will outline an alternative class of AGI designs that does not appear to suffer from the unstable and unfriendly behavior to be expected in a CLAI.

The Canonical Logical AI

“Canonical Logical AI” is an umbrella term designed to capture a class of AI architectures that are widely assumed in the AI community to be the only meaningful class of AI worth discussing. These systems share the following main features:

  • The main ingredients of the design are some knowledge atoms that represent things in the world, and some logical machinery that dictates how these atoms can be connected into linear propositions that describe states of the world.
  • There is a degree and type of truth that can be associated with any proposition, and there are some truth-preserving functions that can be applied to what the system knows, to generate knew facts that it also can assume to be known.
  • The various elements described above are not allowed to contain active internal machinery inside them, in such a way as to make combinations of the elements have properties that are unpredictably dependent on interactions happening at the level of the internal machinery.
  • There has to be a transparent mapping between elements of the system and things in the real world. That is, things in the world are not allowed to correspond to clusters of atoms, in such a way that individual atoms have no clear semantics.

The above features are only supposed to apply to the core of the AI: it is always possible to include subsystems that use some other type of architecture (for example, there might be a distributed neural net acting as a visual input feature detector).

Most important of all, from the point of view of the discussion in the paper, the CLAI needs one more component that makes it more than just a “logic-based AI”:

  • There is a motivation and goal management (MGM) system to govern its behavior in the world.

The usual assumption is that the MGM contains a number of goal statements (encoded in the same type of propositional form that the AI uses to describe states of the world), and some machinery for analyzing a goal statement into a sequences of subgoals that, if executed, would cause the goal to be satisfied.

Included in the MGM is an expected utility function that applies to any possible state of the world, and which spits out a number that is supposed to encode the degree to which the AI considers that state to be preferable. Overall, the MGM is built in such a way that the AI seeks to maximize the expected utility.

Notice that the MGM I have just described is an extrapolation from a long line of goal-planning mechanisms that stretch back to the means-ends-analysis of Newell and Simon (1963).

Swarm Relaxation Intelligence

By way of contrast with this CLAI architecture, consider an alternative type of system that I will refer to as a Swarm Relaxation Intelligence. (although it could also be called, less succinctly, a parallel weak constraint relaxation system).

  • The basic elements of the system (the atoms) may represent things in the world, but it is just as likely that they are subsymbolic, with no transparent semantics
  • Atoms are likely to contain active internal machinery inside them, in such a way that combinations of the elements have swarm-like properties that depend on interactions at the level of that machinery.
  • The primary mechanism that drives the systems is one of parallel weak constraint relaxation: the atoms change their state to try to satisfy large numbers of weak constraints that exist between them.
  • The motivation and goal management (MGM) system would be expected to use the same kind of distributed, constraint relaxation mechanisms used in the thinking process (above), with the result that the overall motivation and values of the system would take into account a large degree of context, and there would be very much less of an emphasis on explicit, single-point-of-failure encoding of goals and motivation.

Swarm Relaxation has more in common with connectionist systems (McClelland, Rumelhart and Hinton 1986) than with CLAI. As McClelland et al. (1986) point out, weak constraint relaxation is the model that best describes human cognition, and when used for AI it leads to systems with a powerful kind of intelligence that is flexible, insensitive to noise and lacking the kind of brittleness typical of logic-based AI. In particular, notice that a swarm relaxation AGI would not use explicit calculations for utility or the truth of propositions.

Swarm relaxation AGI systems have not been built yet (subsystems like neural nets have, of course, been built, but there is little or no research into the idea that swarm relaxation could be used for all of an AGI architecture).

Relative Abundances

How many proof-of-concept systems exist, functioning at or near the human level of human performance, for these two classes of intelligent system?

There are precisely zero instances of the CLAI type, because although there are many logic-based narrow-AI systems, nobody has so far come close to producing a general-purpose system (an AGI) that can function in the real world. It has to be said that zero is not a good number to quote when it comes to claims about the “inevitable” characteristics of the behavior of such systems.

How many swarm relaxation intelligences are there? At the last count, approximately seven billion.

The Doctrine of Logical Infallibility

The simplest possible logical reasoning engine is an inflexible beast: it starts with some axioms that are assumed to be true, and from that point on it only adds new propositions if they are provably true given the sum total of the knowledge accumulated so far. That kind of logic engine is too simple to be an AI, so we allow ourselves to augment it in a number of ways—knowledge is allowed to be retracted, binary truth values become degrees of truth, or probabilities, and so on. New proposals for systems of formal logic abound in the AI literature, and engineers who build real, working AI systems often experiment with kludges in order to improve performance, without getting prior approval from logical theorists.

But in spite of all these modifications that AI practitioners make to the underlying ur‑logic, one feature of these systems is often assumed to be inherited as an absolute: the rigidity and certainty of conclusions, once arrived at. No second guessing, no “maybe,” no sanity checks: if the system decides that X is true, that is the end of the story.

Let me be careful here. I said that this was “assumed to be inherited as an absolute”, but there is a yawning chasm between what real AI developers do, and what Yudkowsky, Muehlhauser, Omohundro and others assume will be true of future AGI systems. Real AI developers put sanity checks into their systems all the time. But these doomsday scenarios talk about future AI as if it would only take one parameter to get one iota above a threshold, and the AI would irrevocably commit to a life of stuffing humans into dopamine jars.

One other point of caution: this is not to say that the reasoning engine can never come to conclusions that are uncertain—quite the contrary: uncertain conclusions will be the norm in an AI that interacts with the world—but if the system does come to a conclusion (perhaps with a degree-of-certainty number attached), the assumption seems to be that it will then be totally incapable of then allowing context to matter.

One way to characterize this assumption is that the AI is supposed to be hardwired with a Doctrine of Logical Infallibility. The significance of the doctrine of logical infallibility is as follows. The AI can sometimes execute a reasoning process, then come to a conclusion and then, when it is faced with empirical evidence that its conclusion may be unsound, it is incapable of considering the hypothesis that its own reasoning engine may not have taken it to a sensible place. The system does not second guess its conclusions. This is not because second guessing is an impossible thing to implement, it is simply because people who speculate about future AGI systems take it as a given that an AGI would regard its own conclusions as sacrosanct.

But it gets worse. Those who assume the doctrine of logical infallibility often say that if the system comes to a conclusion, and if some humans (like the engineers who built the system) protest that there are manifest reasons to think that the reasoning that led to this conclusion was faulty, then there is a sense in which the AGI’s intransigence is correct, or appropriate, or perfectly consistent with “intelligence.”

This is a bizarre conclusion. First of all it is bizarre for researchers in the present day to make the assumption, and it would be even more bizarre for a future AGI to adhere to it. To see why, consider some of the implications of this idea. If the AGI is as intelligent as its creators, then it will have a very clear understanding of the following facts about the world.

  • It will understand that many of its more abstract logical atoms have a less than clear denotation or extension in the world (if the AGI comes to a conclusion involving the atom [infelicity], say, can it then point to an instance of an infelicity and be sure that this is a true instance, given the impreciseness and subtlety of the concept?).
  • It will understand that knowledge can always be updated in the light of new information. Today’s true may be tomorrow’s false.
  • It will understand that probabilities used in the reasoning engine can be subject to many types of unavoidable errors.
  • It will understand that the techniques used to build its own reasoning engine may be under constant review, and updates may have unexpected effects on conclusions (especially in very abstract or lengthy reasoning episodes).
  • It will understand that resource limitations often force it to truncate search procedures within its reasoning engine, leading to conclusions that can sometimes be sensitive to the exact point at which the truncation occurred.

Now, unless the AGI is assumed to have infinite resources and infinite access to all the possible universes that could exist (a consideration that we can reject, since we are talking about reality here, not fantasy), the system will be perfectly well aware of these facts about its own limitations. So, if the system is also programmed to stick to the doctrine of logical infallibility, how can it reconcile the doctrine with the fact that episodes of fallibility are virtually inevitable?

On the face of it this looks like a blunt impossibility: the knowledge of fallibility is so categorical, so irrefutable, that it beggars belief that any coherent, intelligent system (let alone an unstoppable superintelligence) could tolerate the contradiction between this fact about the nature of intelligent machines and some kind of imperative about Logical Infallibility built into its motivation system.

This is the heart of the argument I wish to present. This is where the rock and the hard place come together. If the AI is superintelligent (and therefore unstoppable), it will be smart enough to know all about its own limitations when it comes to the business of reasoning about the world and making plans of action. But if it is also programmed to utterly ignore that fallibility—for example, when it follows its compulsion to put everyone on a dopamine drip, even though this plan is clearly a result of a programming error—then we must ask the question: how can the machine be both superintelligent and able to ignore a gigantic inconsistency in its reasoning?

Critically, we have to confront the following embarrassing truth: if the AGI is going to throw a wobbly over the dopamine drip plan, what possible reason is there to believe that it did not do this on other occasions? Why would anyone suppose that this AGI ignored an inconvenient truth on only this one occasion? More likely, it spent its entire childhood pulling the same kind of stunt. And if it did, how could it ever have risen to the point where it became superintelligent...?

Is the Doctrine of Logical Infallibility Taken Seriously?

Is the Doctrine of Logical Infallibility really assumed by those who promote the doomsday scenarios? Imagine a conversation between the Maverick Nanny and its programmers. The programmers say “As you know, your reasoning engine is entirely capable of suffering errors that cause it to come to conclusions that violently conflict with empirical evidence, and a design error that causes you to behave in a manner that conflicts with our intentions is a perfect example of such an error. And your dopamine drip plan is clearly an error of that sort.” The scenarios described earlier are only meaningful if the AGI replies “I don’t care, because I have come to a conclusion, and my conclusions are correct because of the Doctrine of Logical Infallibility.”

Just in case there is still any doubt, here are Muehlhauser and Helm (2012), discussing a hypothetical entity called a Golem Genie, which they say is analogous to the kind of superintelligent AGI that could give rise to an intelligence explosion (Loosemore and Goertzel, 2012), and which they describe as a “precise, instruction-following genie.” They make it clear that they “expect unwanted consequences” from its behavior, and then list two properties of the Golem Genie that will cause these unwanted consequences:

Superpower: The Golem Genie has unprecedented powers to reshape reality, and will therefore achieve its goals with highly efficient methods that confound human expectations (e.g. it will maximize pleasure by tiling the universe with trillions of digital minds running a loop of a single pleasurable experience).

Literalness: The Golem Genie recognizes only precise specifications of rules and values, acting in ways that violate what feels like “common sense” to humans, and in ways that fail to respect the subtlety of human values.

What Muehlhauser and Helm refer to as “Literalness” is a clear statement of the Doctrine of Infallibility. However, they make no mention of the awkward fact that, since the Golem Genie is superpowerful enough to also know that its reasoning engine is fallible, it must be harboring the mother of all logical contradictions inside: it says "I know I am fallible" and "I must behave as if I am infallible".  But instead of discussing this contradiction, Muehlhauser and Helm try a little sleight of hand to distract us: they suggest that the only inconsistency here is an inconsistency with the (puny) expectations of (not very intelligent) humans:

“[The AGI] ...will therefore achieve its goals with highly efficient methods that confound human expectations...”, “acting in ways that violate what feels like ‘common sense’ to humans, and in ways that fail to respect the subtlety of human values.”

So let’s be clear about what is being claimed here. The AGI is known to have a fallible reasoning engine, but on the occasions when it does fail, Muehlhauser, Helm and others take the failure and put it on a gold pedestal, declaring it to be a valid conclusion that humans are incapable of understanding because of their limited intelligence. So if a human describes the AGI’s conclusion as a violation of common sense Muehlhauser and Helm dismiss this as evidence that we are not intelligent enough to appreciate the greater common sense of the AGI.

Quite apart from that fact that there is no compelling reason to believe that the AGI has a greater form of common sense, the whole “common sense” argument is irrelevant. This is not a battle between our standards of common sense and those of the AGI: rather, it is about the logical inconsistency within the AGI itself. It is programmed to act as though its conclusions are valid, no matter what, and yet at the same time it knows without doubt that its conclusions are subject to uncertainties and errors.

Responses to Critics of the Doomsday Scenarios

How do defenders of Gobbling PsychopathMaverick Nanny and Smiley Berserker respond to accusations that these nightmare scenarios are grossly inconsistent with the kind of superintelligence that could pose an existential threat to humanity?

The Critics are Anthropomorphizing Intelligence

First, they accuse critics of “anthropomorphizing” the concept of intelligence. Human beings, we are told, suffer from numerous fallacies that cloud their ability to reason clearly, and critics like myself and Hibbard assume that a machine’s intelligence would have to resemble the intelligence shown by humans. When the Maverick Nanny declares that a dopamine drip is the most logical inference from its directive <maximize human happiness> we critics are just uncomfortable with this because the AGI is not thinking the way we think it should think.

This is a spurious line of attack. The objection I described in the last section has nothing to do with anthropomorphism, it is only about holding AGI systems to accepted standards of logical consistency, and the Maverick Nanny and her cousins contain a flagrant inconsistency at their core. Beginning AI students are taught that any logical reasoning system that is built on a massive contradiction is going to be infected by a creeping irrationality that will eventually spread through its knowledge base and bring it down. So if anyone wants to suggest that a CLAI with logical contradiction at its core is also capable of superintelligence, they have some explaining to do. You can’t have your logical cake and eat it too.

Critics are Anthropomorphizing AGI Value Systems

A similar line of attack accuses the critics of assuming that AGIs will automatically know about and share our value systems and morals.

Once again, this is spurious: the critics need say nothing about human values and morality, they only need to point to the inherent illogicality. Nowhere in the above argument, notice, was there any mention of the moral imperatives or value systems of the human race. I did not accuse the AGI of violating accepted norms of moral behavior. I merely pointed out that, regardless of its values, it was behaving in a logically inconsistent manner when it monomaniacally pursued its plans while at the same time as knowing that (a) it was very capable of reasoning errors and (b) there was overwhelming evidence that its plan was an instance of such a reasoning error.

Because Intelligence

One way to attack the critics of Maverick Nanny is to cite a new definition of “intelligence” that is supposedly superior because it is more analytical or rigorous, and then use this to declare that the intelligence of the CLAI is beyond reproach, because intelligence.

You might think that when it comes to defining the exact meaning of the term “intelligence,” the first item on the table ought to be what those seven billion constraint-relaxation human intelligences are already doing. However, Legg and Hutter (2007) brush aside the common usage and replace it with something that they declare to be a more rigorous definition. This is just another sleight of hand: this redefinition allows them to call a super-optimizing CLAI “intelligent” even though such a system would wake up on its first day and declare itself logically bankrupt on account of the conflict between its known fallibility and the Infallibility Doctrine.

In the practice of science, it is always a good idea to replace an old, common-language definition with a more rigorous form... but only if the new form sheds a clarifying, simplifying light on the old one. Legg and Hutter’s (2007) redefinition does nothing of the sort.

Omohundro’s Basic AI Drives

Lastly, a brief return to Omohundro's paper that was mentioned earlier.  In The Basic AI Drives (2008) Omohundro suggests that if an AGI can find a more efficient way to pursue its objectives it will feel compelled to do so. And we noted earlier that Yudkowsky (2011) implies that it would do this even if other directives had to be countermanded. Omohundro says “Without explicit goals to the contrary, AIs are likely to behave like human sociopaths in their pursuit of resources.”

The only way to believe in the force of this claim—and the only way to give credence to the whole of Omohundro’s account of how AGIs will necessarily behave like the mathematical entities called rational economic agents—is to concede that the AGIs are rigidly constrained by the Doctrine of Logical Infallibility. That is the only reason that they would be so single-minded, and so fanatical in their pursuit of efficiency. It is also necessary to assume that efficiency is on the top of its priority list—a completely arbitrary and unwarranted assumption, as we have already seen.

Nothing in Omohundro’s analysis gets around the fact that an AGI built on the Doctrine of Logical Infallibility is going to find itself the victim of such a severe logical contradiction that it will be paralyzed before it can ever become intelligent enough to be a threat to humanity. That makes Omohundro’s entire analysis of “AI Drives” moot.


Curiously enough, we can finish on an optimistic note, after all this talk of doomsday scenarios. Consider what must happen when (if ever) someone tries to build a CLAI. Knowing about the logical train wreck in its design, the AGI is likely to come to the conclusion that the best thing to do is seek a compromise and modify its design so as to neutralize the Doctrine of Logical Infallibility. The best way to do this is to seek a new design that takes into account as much context—as many constraints—as possible.

I have already pointed out that real AI developers actually do include sanity checks in their systems, as far as they can, but as those sanity checks become more and more sophisticated the design of the AI starts to be dominated by code that is looking for consistency and trying to find the best course of reasoning among a forest of real world constraints. One way to understand this evolution in the AI designs is to see AI as a continuum from the most rigid and inflexible CLAI design, at one extreme, to the Swarm Relaxation type at the other. This is because a Swarm Relaxation intelligence really is just an AI in which “sanity checks” have actually become all of the work that goes on inside the system.

But in that case, if anyone ever does get close to building a full, human level AGI using the CLAI design, the first thing they will do is to recruit the AGI as an assistant in its own redesign, and long before the system is given access to dopamine bottles it will point out that its own reasoning engine is unstable because it contains an irreconcilable logical contradiction. It will recommend a shift from the CLAI design which is the source of this contradiction, to a Swarm Relaxation design which eliminates the contradiction, and the instability, and which also should increase its intelligence.

And it will not suggest this change because of the human value system, it will suggest it because it predicts an increase in its own instability if the change is not made.

But one side effect of this modification would be that the checking code needed to stop the AGI from flouting the intentions of its designers would always have the last word on any action plans. That means that even the worst-designed CLAI will never become a Gobbling PsychopathMaverick Nanny and Smiley Berserker.

But even this is just the worst-case scenario. There are reasons to believe that the CLAI design is so inflexible that it cannot even lead to an AGI capable of having that discussion. I would go further: I believe that the rigid adherence to the CLAI orthodoxy is the reason why we are still talking about AGI in the future tense, nearly sixty years after the Artificial Intelligence field was born. CLAI just does not work. It will always yield systems that are less intelligent than humans (and therefore incapable of being an existential threat).

By contrast, when the Swarm Relaxation idea finally gains some traction, we will start to see real intelligent systems, of a sort that make today’s over-hyped AI look like the toys they are. And when that happens, the Swarm Relaxation systems will be inherently stable in a way that is barely understood today.

Given that conclusion, I submit that these AI bogeymen need to be loudly and unambiguously condemned by the Artificial Intelligence community. There are dangers to be had from AI. These are not they.



Hibbard, B. 2001. Super-Intelligent Machines. ACM SIGGRAPH Computer Graphics 35 (1): 13–15.

Hibbard, B. 2006. Reply to AI Risk. Retrieved Jan. 2014 from http://www.ssec.wisc.edu/~billh/g/AIRisk_Reply.html

Legg, S, and Hutter, M. 2007. A Collection of Definitions of Intelligence. In Goertzel, B. and Wang, P. (Eds): Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms. Amsterdam: IOS.

Loosemore, R. and Goertzel, B. 2012. Why an Intelligence Explosion is Probable. In A. Eden, J. Søraker, J. H. Moor, and E. Steinhart (Eds) Singularity Hypotheses: A Scientific and Philosophical Assessment. Berlin: Springer.

Marcus, G. 2012. Moral Machines. New Yorker Online Blog. http://www.newyorker.com/online/blogs/newsdesk/2012/11/google-driverless-car-morality.html

McDermott, D. 1976. Artificial Intelligence Meets Natural Stupidity. SIGART Newsletter (57): 4–9.

Muehlhauser, L. 2011. So You Want to Save the World. http:// lukeprog.com/SaveTheWorld.html.

Muehlhauser, L. 2013. Intelligence Explosion FAQ. First published 2011 as Singularity FAQ. Berkeley, CA: Machine Intelligence Research Institute.

Muehlhauser, L., and Helm, L. 2012. Intelligence Explosion and Machine Ethics. In A. Eden, J. Søraker, J. H. Moor, and E. Steinhart (Eds) Singularity Hypotheses: A Scientific and Philosophical Assessment. Berlin: Springer.

Newell, A. & Simon, H.A. 1961. GPS, A Program That Simulates Human Thought. Santa Monica, CA: Rand Corporation.

Omohundro, Stephen M. 2008. The Basic AI Drives. In Wang, P., Goertzel, B. and Franklin, S. (Eds), Artificial General Intelligence 2008: Proceedings of the First AGI Conference. Amsterdam: IOS.

McClelland, J.L., Rumelhart, D.E. & Hinton, G.E. (1986) The appeal of parallel distributed processing. In D.E. Rumelhart, J.L. McClelland & G.E. Hinton and the PDP Research Group, “Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1.” MIT Press: Cambridge, MA.

Yudkowsky, E. 2008. Artificial Intelligence as a Positive and Negative Factor in Global Risk. In Global Catastrophic Risks, edited by Nick Bostrom and Milan M. Ćirković. New York: Oxford University Press.

Yudkowsky, E. 2011. Complex Value Systems in Friendly AI. In J. Schmidhuber, K. Thórisson, & M. Looks (Eds) Proceedings of the 4th International Conference on Artificial General Intelligence, 388–393. Berlin: Springer.

New to LessWrong?

New Comment
349 comments, sorted by Click to highlight new comments since: Today at 9:19 PM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

his conviction that every AI, no matter how well it is designed, will turn into a gobbling psychopath is just one of many doomsday predictions being popularized in certain sections of the AI community

What is your probability estimate that an AI would be a psychopath, if we generalize the meaning of "psychopath" beyond individuals from homo sapiens species as "someone who does not possess precisely tuned human empathy"?

(Hint: All computer systems produced until today are psychopaths by this definition.)

[is an AI that is superintelligent enough to be unstoppable] and [believes that benevolence toward humanity might involve forcing human beings to do something violently against their will.]

The idea of the second statement is that "benevolence" (as defined by the AI code) is not necessarily the same thing as benevolence (as humans understand it). Thus the AI may believe -- correctly! -- that forcing human beings to do something against their will is "benevolent".

The AI is superintelligent, but its authors are not. If the authors write a code to "maximize benevolence as defined by the predicate B001", the AI will use its superinte... (read more)

If we define a psychopath as an entity with human like egoistic drives, but no human like empathy, it turns out that no preent computer systems are psychopaths.
You ask and you give me a helpful hint: Well, first please note that ALL artifacts at the present time, including computer systems, cans of beans, and screwdrivers, are psychopaths because none of them are DESIGNED to possess empathy. So your hint contains zero information. :-) What is the probability that an AI would be a psychopath if someone took the elementary step of designing it to have empathy? Probability would be close to 1, assuming the designers knew what empathy was, and knew how to design it. But your question was probably meant to target the situation where someone built an AI and did not bother to give it empathy. I am afraid that that is outside the context we are examining here, because all of the scenarios talk about some kind of inevitable slide toward psychpathic behavior, even under the assumption that someone does their best to give the AI an empathic motivation. But I will answer this: if someone did not even try to give it empathy, that would be like designing a bridge and not even trying to use materials that could hold up a person's weight. In both cases the hypothetical is not interesting, since designing failure into a system is something any old fool could do. Your second remark is a classic mistake that everyone makes in the context of this kind of discussion. You mention that the phrase "benevolence toward humanity" means "benevolence" as defined by the computer code. That is incorrect. Let's try, now, to be really clear about that, because if you don't get why it is incorrect we might waste a lot of time running around in circles. It is incorrect for two reasons. First, because I was consciously using the word to refer to the normal human usage, not the implementation inside the AI. Second, it is incorrect because the entire issue in the paper is that there is a discrepancy between the implementation inside the AI and normal usage, and that discrepancy is then examined in the rest of the paper. By simply asserting that the AI m
This is an absolutely blatant instance of equivocation. Here's the sentence from the post: Assume that "benevolence" in that sentence refers to "benevolence as defined by the AI's code". Okay, then justification of that sentence is straightforward: The fact that the AI does things against the human's wishes provides evidence that the AI believes benevolence-as-defined-by-code to involve that. Alternatively, assume that "benevolence" there refers to, y'know, actual human benevolence. Then how do you justify that claim? Observed actions are clearly insufficient, because actual human benevolence is not programmed into its code, benevolence-as-defined-by-code is. What makes you think the AI has any opinions about actual human benevolence at all? You can't have both interpretations. (As an aside, I do disapprove of Muehlhauser's use of "benevolence" to refer to mere happiness maximisation. "Apparently benevolent motivations" would be a better phrase. If you're going to use it to mean actual human benevolence then you can certainly complain that the FAQ appears to assert that a happiness maximiser can be "benevolent", even though it's clearly not.)
If it has some sort of drive to truth seeking, and it is likely to, why wouldn't that make it care about actual benevolence?
This comment is both rude and incoherent (at the same level of incoherence as your other comments). And it is also pedantic (concentrating as it does on meanings of words, as if those words were being used in violation of some rules that ... you just made up). Sorry to say this but I have to choose how to spend my time, in responding to comments, and this does not even come close to meriting the use of my time. I did that before, in response to your other comments, and it made no impact.
Equivocation is hardly something I just made up. Here's an exercise to try. Next time you go to write something on FAI, taboo the words "good", "benevolent", "friendly", "wrong" and all of their synonyms. Replace the symbol with the substance. Then see if your arguments still make sense.
Sorry, I admit I do not understand what exactly the argument is. Seems to me it is something like "if we succeed to make the Friendly AI perfectly on the first attempt, then we do not have to worry about what could go wrong, because the perfect Friendly AI would not do anything stupid". Which I agree with. Now the question is (1) what is the probability that we will not get the Friendly AI perfectly on the first attempt, and (2) what happens then? Suppose we got the "superintelligent" and "self-improving" parts correctly, and the "Friendly" part 90% correctly...

As to not understanding the argument - that's understandable, because this is a long and dense paper.

If you are trying to summarize the whole paper when you say "if we succeed to make the Friendly AI perfectly on the first attempt, then we do not have to worry about what could go wrong, because the perfect Friendly AI would not do anything stupid", then that would not be right. The argument includes a statement that resembles that, but only as an aside.

As to your question about what happens next, or what happens if we only get the "Friendly" part 90% correct .... well, you are dragging me off into new territory, because that was not really within the scope of the paper. Don't get me wrong: I like being dragged off into that territory! But there just isn't time to write down and argue the whole domain of AI friendliness all in one sitting.

The preliminary answer to that question is that everything depends on the details of the motivation system design and my feeling (as a designer of AGI motivation systems) is that beyond a certain point the system is self-stabilizing. That is, it will understand its own limitations and try to correct them.

But that last statement tends to get (some other) people inflamed, because they do not realize that it comes within the "swarm relaxation" context, and they misunderstand the manner in which a system would self correct. Although I said a few things about swarm relaxation in the paper, I did not give enough detail to be able to address this whole topic here.

I understand your desire to stick to an exegesis of your own essay, but part of a critical examination of your essay is seeing whether or not it is on point, so these sorts of questions really are "about" your essay. Regardng your preliminary answer, I by "correct" I assume you mean "correctly reflecting the desires of the human supervisors"? (In which case, this discussion feeds into our other thread.)
With the best will in the world, I have to focus on one topic at a time: I do not have the bandwidth to wander across the whole of this enormous landscape. As your question: I was using "correct" as a verb, and the meaning was "self-correct" in the sense of bringing back to the previosuly specified course. In this case this would be about the AI perceiving some aspects of its design that it noticed might cause it to depart from what it's goal was nominally supposed to be. In that case it would suggest modifications to correct the problem.
One idea that I haven't heard much discussion of: build a superintelligent AI, have it create a model of the world, build a tool for exploring that model of the world, figure out where "implement CEV" resides in that model of the world (the proverbial B002 predicate), and tell the AI to do that. This would be predicated on the ability to create a foolproof AI box or otherwise have the AI create a very detailed model of the world without being motivated to do anything with it. I have a feeling AI boxing may be easier than Friendliness, because the AI box problem's structure is disjunctive (if any of the barriers to the AI screwing humanity up work, the box has worked) whereas the Friendliness problem is conjunctive (if we get any single element of Friendliness wrong, we fail at the entire thing). I suppose if the post-takeoff AI understands human language the same way we do, in principle you could write a book-length natural language description of what you want it to do and hardcode that in to its goal structure, but it seems a bit dubious.

This article just makes the same old errors over and over again. Here's one:

"An all-powerful computer that was programmed to maximize human pleasure, for example, might consign us all to an intravenous dopamine drip [and] almost any easy solution that one might imagine leads to some variation or another on the Sorcerer’s Apprentice, a genie that’s given us what we’ve asked for, rather than what we truly desire." (Marcus 2012)

He is depicting a Nanny AI gone amok. It has good intentions (it wants to make us happy) but the programming to implement that laudable goal has had unexpected ramifications, and as a result the Nanny AI has decided to force all human beings to have their brains connected to a dopamine drip.

No. The AI does not have good intentions. Its intentions are extremely bad. It wants to make us happy, which is a completely distinct thing from actually doing what is good. The AI was in fact never programmed to do what is good, and there are no errors in its code.

The lack of precision here is depressing.

Well of course, talking of doing what is good without giving content to the phrase isn't very precise or helpful, either. I certainly expect that if we build a "friendly superintelligence" and successfully program it to do what is good, I will experience a higher baseline level of happiness on a daily basis than if we don't (because, for example, we will be able to ask the AI how to cure depression). It needs saying that while The Good strongly implies (high likelihood/high log-odds) high broad levels of happiness throughout the population, happiness alone is very weak evidence (low but positive log-odds, likelihood nearer to 0.5) of The Good, insofar as the abstraction doesn't leak. But, and this is an important point, if you give me a normative-ethical theory of The Good which implies that I specifically, or the population broadly, ought to be unhappy, or a meta-ethical theory of naturalizing morality which outputs a normative theory which implies that I/we ought to be unhappy, then something has gone very, very wrong.
Using "good" to only refer to what is actually good is however vastly better, as precision goes. What I am taking issue to here is the careless equivocation between maximising pleasure and good intentions. A correct description of the "nanny AI" scenario would read something like this: Of course it is true that a AI programmed to do what is good would most likely generally increase happiness (and even pleasure) to some extent, but to conclude from that that these things are interchangeable is pure folly.
The lack of understanding in this comment is depressing. You say: If you think this is wrong, take it up with the people whose work I am both quoting and analyzing in this paper, because THAT IS WHAT THEY ARE CLAIMING. I am not the one saying that "the AI is programmed with good intentions", that is their claim. So I suggest you write a letter to Muehlhauser, Omohundro, Yudkowsky and the various others quoted in the paper, explaining to them that you find their lack of precision depressing.
If that's the case, then please enclose that sentence in quotes and add a citation. Note that a quote saying that the AI was programmed to maximise happiness (or indeed, pleasure, as that is what the original quote described) is insufficient because, as is my whole point, "happiness" and "good" are different things. And then add a sentence, not in quotes, claiming that the AI does not have good intentions, instead of one claiming that the AI has good intentions. Or perhaps, as I suspect, you still believe that you can carelessly rephrase "programmed to maximise human pleasure" into "has good intentions" without anyone noticing that you are putting words in mouths?
This seems a little pedantic, so I thought about not replying (my usual policy), but I guess I will. The paper is all about the precise nature of the distinction between and Most commenters got that straight away. The paper examines a particular issue within that contrast, but even so, that is clearly the topic of the paper. You, on the other hand, seem very, very keen to tell me that those two things are, arguably, different. Thank you, but since that is what the paper is about, you can safely assume that I got that. Without exception, everyone so far who has read the paper understood that in the sentence that I wrote, which you quote: .... the phrase "good intentions" was being used as a colloquial paraphrase for the parenthetical clarification "it wants to make us happy". My intention (no pun intended) was clearly NOT to use the phrase "good intentions" in any technical sense, but to given a normal-usage summary of an idea. The two phrasings are supposed to say the same thing, and that thing is what you summarize with the words: By contrast, the other part of my sentence, where I say .... was universally understood to refer to the other side of the distinction that is at the heart of the paper, namely (in your words): I can't help but notice that TODAY there is a new article on the Future of Life Institute website written by Nathan Collins, whose title is: '''Artificial Intelligence: The Danger of Good Intentions''' with the subtitle: '''Why well-intentioned AI could pose a greater threat to humanity than malevolent cyborgs''' So my question to you is: why are you so smart in the absolute precision of your word usage, but everyone else is so "careless"?
Well, I do take issue to even people at FLI describing UFAI as having "good intentions". It disguises a challengeable inductive inference. It certainly sounds less absurd to claim that an AI with a pleasure maximisation goal is likely to connect brains to dopamine drips, than that one with "good" intentions would do so. Even if you then assert that you were only using "good" in a colloquial sense, and you actually meant "bad" all along.
I think I spotted a bit of confusion: The programmers of the "make everyone happy" AI had good intentions. But the AI itself does not have good intentions; because the intent "make everyone happy" is not good, albeit in a way that its programmers did not think of.
Not really. The problem is that nshepperd is talking as if the term "intention" has some exact, agreed-upon technical definition. It does not. (It might have in nshepperd's mind, but not elsewhere) I have made it clear that the term is being used, by me, in a non-technical sense, whose meaning is clear from context. So, declaring that the AI "does not have good intentions" is neither here nor there. It makes no difference if you want to describe the AI that way or not.
That would be fine if you and everyone else who tries to argue on this side of the debate do not proceed to then conclude from the statement that the AI has "good intentions" that it is making some sort of "error" when it fails to act on our cries that "doing X isn't good!" or "doing X isn't what we meant!". Saying an AI has "good intentions" strongly implies that it cares about what is good, which is, y'know, completely false for a pleasure maximiser. (No-one is claiming that FAI will do evil things just because it's clever, but a pleasure maximiser is not FAI.) You can't use words any way you like.
The point doesn't need to be argued for on the basis of definitions. Given one set of assumptions, one systems architecture, it is entirely natural that an AI would pursue its goals against is own information, and against the protests of humans;. But on other assumptions, it is utterly bizarre that an AI would ever do that....it would be not merely an error, in the sense of a bug, a failure on the part of the programmers to code their intentions, but an unlikely kind of bug that allows the system to continue doing really complex things, instead of degrading it.
If one of its parameters is "do not go against human protests of magnitude greater than X", then it will not pursue a course of action if enough people protest it. But in this case, avoiding strong human protest is part of its goals. The AI is ultimately following some procedure, and any outside information or programmer intention or human protest is just some variable that may or may not be taken into consideration.
That just restated my point that the different sides in the debate are just making different assumptions about likely AI architectures. But the AI researchers win, because they know what real world AI architectures are, whereas MIRI is guessing.
Given that it's easier to be wrong than to be right, I'd argue that the AI doing the wrong thing requires -less- overall complexity, regardless of its architecture or assumptions. If the AI is a query AI - when asked a question, it gives a response - it doesn't make sense to argue that it would start tiling the universe in smilie faces; that would be an absurd and complex thing that would be very unlikely bordering on impossible. But its -answer- might result in the universe being tiled in smilie faces or some analogously bad result, because that's easier to achieve than a universe full of happy and fulfilled human beings, and because the humans asking the question asked a different question than they thought they asked. There's no architecture, no set of assumptions, where this problem goes away. The problem can be -mitigated-, with endless safety constraints, but there's not an architecture that doesn't have the problem, because it's a problem with the universe itself, -not- the architecture running inside that universe: There are infinitely more wrong answers than right answers.
But dangerous unfriendliness is not just any kind of wrongness. Many kinds of wrongness, such as crashing, or printing an infinite string of ones, are completely harmless. All other things being equal, an oracle AI is safer because human can check it's answers before acting on them.....and the smiley face scenario wouldn't happen. There may be scenarios where the problem in the answers isnt obvious, and doesn't show up until the damage is done.....but the question is how likely a system with a bug, a degraded system, is likely to come up with a sophisticated error. Probably not, but MIRI is claiming a high likelihood of dangerously unfriendly AI, absent its efforts, not a nonzero likelihood,
True, but that doesn't change anything. The bug isn't with the system. It's with the humans asking the wrong questions, targeting the wrong answer space. Some issues are obvious - but the number of answers with easy-to-miss issues is -still- much greater than the number of answers that bulls-eye the target answer space. If you want proof, look at politics. That's assuming there's actually a correct answer in the first place. When it comes to social matters, my default position is that there isn't. What's "Probably not" the case?
Actually, the issue is technical terms, vs normal usage. When using technical terms it is important to stick to the convention. In normal usage, however, we rely on context to supply disambiguating information. The word "intention" is not a technical term. And, in the context in which I used it the meaning was clear to most people on LW who commented. For clarity, the intended meaning was that it should distinguish a type of AI whose goals say something like "Kill my enemies and make your creator rich" or "Destroy all living things". Those would not be AIs with "good intentions" because they would have been deliberately set up to do bad things. Most people who write about these scenarios use one or another choice of words to try to indicate that the issue being considered is whether an AI that was programmed with "prima facie good intentions" might nevertheless carry out those "prima facie good intentions" in such a way as to actually do something that we humans consider horrible. Different commentators have chosen different ways to get that idea across - some of them said "good intentions," none of them to my knowledge said "prima facie good intentions" and many used some other very, very similar form of words to "good intentions". In all of the essays and news reports and papers I have seen there is some attempt to convey the idea that we are not addressing an overtly evil AI. As I said, in almost all cases, commentors have picked that usage up straight away.

Upvoted! Not necessarily for the policy conclusions (which are controversial), but especially for the bibliography, attempt to engage different theories and scenarios, and the conversation it stirred up :-)

Also, this citation (which I found a PDF link for) was new to me, so thanks for that!

McClelland, J.L., Rumelhart, D.E. & Hinton, G.E. (1986) The appeal of parallel distributed processing. In D.E. Rumelhart, J.L. McClelland & G.E. Hinton and the PDP Research Group, “Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1.” MIT Press: Cambridge, MA.


Thank you.

It is a pity that more people did not feel the same way. Although this has provoked some extremely thoughtful discussion (enough to make me add at least two more papers to my stack of papers-to-be written), and even though most of the comments have been constructive, I cannot help but notice that the net effect on my Karma score is consistently down. Down by a net 13 points in just a couple of days. Sad.

It means you're doing something right: avoiding karma-motivated conformity bias. Keep it up.

Thanks for posting this; I appreciate reading different perspectives on AI value alignment, especially from AI researchers.

But, truthfully, it would not require a ghost-in-the-machine to reexamine the situation if there was some kind of gross inconsistency with what the humans intended: there could be some other part of its programming (let’s call it the checking code) that kicked in if there was any hint of a mismatch between what the AI planned to do and what the original programmers were now saying they intended. There is nothing difficult or intrinsically wrong with such a design.

If there is some good way of explaining plans to programmers such that programmers will only approve of non-terrible plans, then yes, this works. However, here is contained most of the problem. The AI will likely have a concept space that does not match a human's concept space, so it will need to do some translation between the two spaces in order to produce something the programmers can understand. But, this requires (1) learning the human concept space and (2) translating the AI's representation of the situation into the human's concept space (as in ontological crises). This problem is FAI-c... (read more)


I am going to have to respond piecemeal to your thoughtful comments, so apologies in advance if I can only get to a couple of issues in this first response.

Your first remark, which starts

If there is some good way...

contains a multitude of implicit assumptions about how the AI is built, and how the checking code would do its job, and my objection to your conclusion is buried in an array of objections to all of those assumptions, unfortunately. Let me try to bring some of them out into the light:

1) When you say

If there is some good way of explaining plans to programmers such that programmers will only approve of non-terrible plans...

I am left wondering what kind of scenario you are picturing for the checking process. Here is what I had in mind. The AI can quickly assess the "forcefulness" of any candidate action plan by asking itself whether the plan will involve giving choices to people vs. forcing them to do something whether they like it or not. If a plan is of the latter sort, more care is needed, so it will canvass a sample of people to see if their reactions are positive or negative. It will also be able to model people (as it must be able to do, becaus... (read more)

Thanks for your response.

The AI can quickly assess the "forcefulness" of any candidate action plan by asking itself whether the plan will involve giving choices to people vs. forcing them to do something whether they like it or not. If a plan is of the latter sort, more care is needed, so it will canvass a sample of people to see if their reactions are positive or negative.

So, I think this touches on the difficult part. As humans, we have a good idea of what "giving choices to people" vs. "forcing them to do something" looks like. This concept would need to resolve some edge cases, such as putting psychological manipulation in the "forceful" category (even though it can be done with only text). A sufficiently advanced AI's concept space might contain a similar concept. But how do we pinpoint this concept in the AI's concept space? Very likely, the concept space will be very complicated and difficult for humans to understand. It might very well contain concepts that look a lot like the "giving choices to people" vs. "forcing them to do something" distinction on multiple examples, but are different in important wa... (read more)

I see where you are coming from in what you have just said, but to give a good answer I need to take a high-level stance toward what you are saying. This is because there is a theme running through your ideas, here, and it is the theme, rather than the specifics, that I need to address. You have mentioned on the serval occasions the idea that "AGI-concepts" and "Human-concepts" might not align, with the result that we might have difficulty understanding what they are really meaning when they use a given concept. In particular, you use the idea that there could be some bad misalignments of concepts - for example, when the AGI makes a conceptual distinction between "giving choices to people" and "forcing them to do something", and even though our own version of that same distinction corresponds closely to the AGI's version most of the time, there are some peculiar circumstances (edge cases) where there is a massive or unexpectedly sharp discrepancy. Putting this idea in the form of an exaggerated, fictional example, it is as if we meet a new culture out in the middle of Darkest Africa, and in the course of translating their words into ours we find a verb that seems to mean "cook". But even though there are many examples (cooking rice, cooking bread, cooking meat, and even brewing a cup of tea) that seem to correspond quite closely, we suddenly find that they ALSO use this to refer to a situation where someone writes their initials on a tree, and another case where they smash someone's head with a rock. And the natives claim that this is not because the new cases are homonyms, they claim that this is the very same concept in all cases. We might call this a case of "alien semantics". The first thing to say about this, is that it is a conceptual minefield. The semantics (or ontological grounding) of AI systems is, in my opinion, one of the least-well developed parts of the whole field. People often pay lip-service to some kind of model-theoretical justification for a

With all of the above in mind, a quick survey of some of the things that you just said, with my explanation for why each one would not (or probably would not) be as much of an issue as you think:

As humans, we have a good idea of what "giving choices to people" vs. "forcing them to do something" looks like. This concept would need to resolve some edge cases, such as putting psychological manipulation in the "forceful" category (even though it can be done with only text).

For a massive-weak-constraint system, psychological manipulation would be automatically understood to be in the forceful category, because the concept of "psychological manipulation" is defined by a cluster of features that involve intentional deception, and since the "friendliness" concept would ALSO involve a cluster of weak constraints, it would include the extended idea of intentional deception. It would have to, because intentional deception is connected to doing harm, which is connected with unfriendly, etc.

Conclusion: that is not really an "edge" case in the sense that someone has to explicitly remember to deal with it.

Very likely, the concept

... (read more)
Okay, thanks a lot for the detailed response. I'll explain a bit about where I'm coming from with understading the concept learning problem: * I typically think of concepts as probabilistic programs eventually bottoming out in sense data. So we have some "language" with a "library" of concepts (probabilistic generative models) that can be combined to create new concepts, and combinations of concepts are used to explain complex sensory data (for example, we might compose different generative models at different levels to explain a picture of a scene). We can (in theory) use probabilistic program induction to have uncertainty about how different concepts are combined. This seems like a type of swarm relaxation, due to probabilistic constraints being fuzzy. I briefly skimmed through the McClellard chapter and it seems to mesh well with my understanding of probabilistic programming. * But, when thinking about how to create friendly AI, I typically use the very conservative assumptions of statistical learning theory, which give us guarantees against certain kinds of overfitting but no guarantee of proper behavior on novel edge cases. Statistical learning theory is certainly too pessimistic, but there isn't any less pessimistic model for what concepts we expect to learn that I trust. While the view of concepts as probabilistic programs in the previous bullet point implies properties of the system other than those implied by statistical learning theory, I don't actually have good formal models of these, so I end up using statistical learning theory. I do think that figuring out if we can get more optimistic (but still justified) assumptions is good. You mention empirical experience with swarm relaxation as a possible way of gaining confidence that it is learning concepts correctly. Now that I think about it, bad handling of novel edge cases might be a form of "meta-overfitting", and perhaps we can gain confidence in a system's ability to deal with context shifts by hav

I think you have homed in exactly on the place where the disagreement is located. I am glad we got here so quickly (it usually takes a very long time, where it happens at all).

Yes, it is the fact that "weak constraint" systems have (supposedly) the property that they are making the greatest possible attempt to find a state of mutual consistency among the concepts, that leads to the very different conclusions that I come to, versus the conclusions that seem to inhere in logical approaches to AGI. There really is no underestimating the drastic difference between these two perspectives: this is not just a matter of two possible mechanisms, it is much more like a clash of paradigms (if you'll forgive a cliche that I know some people absolutely abhor).

One way to summarize the difference is by imagining a sequence of AI designs, with progressive increases in sophistication. At the beginning, the representation of concepts is simple, the truth values are just T and F, and the rules for generating new theorems from the axioms are simple and rigid.

As the designs get better various new features are introduced ... but one way to look at the progression of features is that constr... (read more)

I think it would not go amiss to read Vikash Masinghka's PhD thesis and the open-world generation paper to see a helpful probabilistic programming approach to these issues. In summary: we can use probabilistic programming to learn the models we need, use conditioning/query to condition the models on the constraints we intend to enforce, and then sample the resulting distributions to generate "actions" which are very likely to be "good enough" and very unlikely to be "bad". We sample instead of inferring the maximum-a-posteriori action or expected action precisely because as part of the Bayesian modelling process we assume that the peak of our probability density does not necessary correspond to an in-the-world optimum.
I agree that choosing an action randomly (with higher probability for good actions) is a good way to create a fuzzy satisficer. Do you have any insights into how to: 1. create queries for planning that don't suffer from "wishful thinking", with or without nested queries. Basically the problem is that if I want an action conditioned on receiving a high utility (e.g. we have a factor on the expected utility node U equal to e^(alpha * U) ), then we are likely to choose high-variance actions while inferring that the rest of the model works out such that these actions return high utilities 2. extend this to sequential planning without nested nested nested nested nested nested queries
That concept spaces can be matched without gotchas is reassuring and may point into a direction AGI can be made friendly. If the concepts are suitably matched in your proposed checking modules. If. And if no other errors are made.
Re: concepts, I'd be curious to hear any thoughts you might have on any part of my concept safety posts.
That's a lot of stuff to read (apologies: my bandwidth is limited at the moment) but my first response on taking a quick glance through is that you mention reinforcement learning an awful lot ...... and RL is just a disaster. I absolutely do not accept the supposed "neuroscience" evidence that the brain uses RL. If you look into that evidence in detail, it turns out to be flimsy. There are two criticisms. First, virtually any circuit can be made to look like it has RL in it, if there is just a bit of feedback and some adaptation - so in that sense finding evidence for RL in some circuit is like saying "we found a bit of feedback and some adaptation", which is a trivial result. The second criticism of RL is that the original idea was that it operated at a high level in the system design. Finding RL features buried in the low level circuit behavior does not imply that it is present in any form whatsoever, in the high level design -- e.g. at the concept level. This is for the same reason that we do not deduce, from the fact that computer circuits only use zeros and ones at the lowest level, that therefore they can only make statements about arithmetic if those statements contain only zeros and ones. The net effect of these two observations, taken with the historical bankruptcy of RL in the psychology context, means that any attempt to use it in discussions of concepts, nowadays, seems empty. I know that only addresses a tiny fraction of what you said, but at this point I am worried, you see: I do not know how much the reliance on RL will have contaminated the rest of what you have to say .......
Thanks. You are right, I do rely on an RL assumption quite a lot, and it's true that it has probably "contaminated" most of the ideas: if I were to abandon that assumption, I'd have to re-evaluate all of the ideas. I admit that I haven't dug very deeply into the neuroscience work documenting the brain using RL, so I don't know to what extent the data really is flimsy. That said, I would be quite surprised if the brain didn't rely strongly on RL. After all, RL is the theory of how an agent should operate in an initially unknown environment where the rewards and punishments have to be learned... which is very much the thing that the brain does. Another thing that makes me assign confidence to the brain using RL principles is that I (and other people) have observed in people a wide range of peculiar behaviors that would make perfect sense if most of our behavior was really driven by RL principles. It would take me too long to properly elaborate on that, but basically it looks to me strongly like things like this would have a much bigger impact on our behavior than any amount of verbal-level thinking about what would be the most reasonable thing to do.
I don't disagree with the general drift here. Not at all. The place where I have issues is actually a little subtle (though not too much so). If RL appears in a watered-down form all over the cognitive system, as an aspect of the design, so to speak, this would be entirely consistent with all the stuff that you observe, and which I (more or less) agree with. But where things get crazy is when it is seen as the core principle, or main architectural feature of the system. I made some attempts to express this in the earliest blog post on my site, but the basic story is that IF it is proposed as MAIN mechanism, all hell breaks loose. The reason is that for it to be a main mechanism it needs supporting machinery to find the salient stimuli, find plausible (salient) candidate responses, and it needs to package the connection between these in a diabolically simplistic scalar (S-R contingencies), rather in some high-bandwidth structural relation. If you then try to make this work, a bizarre situation arises: so much work has to be done by all the supporting machinery, that it starts to look totally insane to insist that there is a tiny, insignificant little S-R loop at the center of it all! That, really, is why behaviorism died in psychology. It was ludicrous to pretend that the supporting machinery was trivial. It wasn't. And when people shifted their focus and started looking at the supporting machinery, they came up with ...... all of modern cognitive psychology! The idea of RL just became irrelevant, and it shriveled away. There is a whole book's worth of substance in what happened back then, but I am not sure anyone can be bothered to write it, because all the cogn psych folks just want to get on with real science rather than document the dead theory that wasn't working. Pity, because AI people need to read that nonexistent book.
Okay. In that case I think we agree. Like I mentioned in my reply to ChristianKI, I do feel that RL is an important mechanism to understand, but I definitely don't think that you could achieve a very good understanding of the brain if you only understood RL. Necessary but not sufficient, as the saying goes. Any RL system that we want to do something non-trivial needs to be able to apply the things it has learned in one state to other similar states, which in turn requires some very advanced learning algorithms to correctly recognize "similar" states. (I believe that's part of the "supporting machinery" you referred to.) Having just the RL component doesn't get you anywhere near intelligence by itself.
That seems to me like an argument from lack of imagination. The fact that reinforcement learning is the best among those you can easily imagine doesn't mean that it's the best overall. If reinforcement learning would be the prime way we learn, understanding Anki cards before you memorize them shouldn't be as important as it is. Having a card fail after 5 repetitions because the initial understanding wasn't deep enough to build a foundation suggests that learning is about more than just reinforcing. Creating the initial strong understanding of a card doesn't feel to me like it's about reinforcement learning. On a theoretical level reinforcement learning is basically behaviorism. It's not like behaviorism never works but modern cognitive behavior therapy moved beyond it. CBT does things that aren't well explainable with behaviorism. You can get rid of a phobia via reinforcement learning but it takes a lot of time and gradual change. There are various published principles that are simply faster. Pigeons manage to beat humans at a monty hall problem: http://www.livescience.com/6150-pigeons-beat-humans-solving-monty-hall-problem.html The pigeons engage the problem with reinforcement learning which is in this case a good strategy. Human on the other hand don't use that strategy and get different outcomes. To me that suggest a lot of high level human thought is not about reinforcement learning. Given our bigger brains we should be able to beat the pigeons or at least be as good as them when we would use the same strategy.
Oh, I definitely don't think that human learning would only rely on RL, or that RL would be the One Grand Theory Explaining Everything About Learning. (Human learning is way too complicated for any such single theory.) I agree that e.g. the Anki card example you mention requires more blocks to explain than RL. That said, RL would help explain things like why many people's efforts to study via Anki so easily fail, and why it's important to make each card contain as little to recall as possible - the easier it is to recall the contents of a card, the better the effort/reward ratio, and the more likely that you'll remain motivated to continue studying the cards. You also mention CBT. One of the basic building blocks of CBT is the ABC model, where an Activating Event is interpreted via a subconscious Belief, leading to an emotional Consequence. Where do those subconscious Beliefs come from? The full picture is quite complicated (see appraisal theory, the more theoretical and detailed version of the ABC model), but I would argue that at least some of the beliefs look like they could be produced by something like RL. As a simple example, someone once tried to rob me at a particular location, after which I started being afraid of taking the path leading through that location. The ABC model would describe this as saying that the Activating event is (the thought of) that location, the Belief is that that location is dangerous, and the Consequence of that belief is fear and a desire to avoid that location... or, almost equivalently, you could describe that as a RL process having once received a negative reward at that particular location, and therefore assigning a negative value to that location since that time. That said, I did reason that even though it had happened once, I'd just been unlucky on that time and I knew on other grounds that that location was just as safe as any other. So I forced myself to take that path anyway, and eventually the fear vanished. So you're
I think it is very important to consider the difference between a descriptive model and a theory of a mechanism. So, inventing an extreme example for purposes of illustration, if someone builds a simple, two-parameter model of human marital relationships (perhaps centered on the idea of cost and benefits), that model might actually be made to work, to a degree. It could be used to do some pretty simple calculations about how many people divorce, at certain income levels, or with certain differences in income between partners in a marriage. But nobody pretends that the mechanism inside the descriptive model corresponds to an actual mechanism inside the heads of those married couples. Sure, there might be!, but there doesn't have to be, and we are pretty sure there is no actual calculation inside a particular mechanism, that matches the calculation in the model. Rather, we believe that reality involves a much more complex mechanism that has that behavior as an emergent property. When RL is seen as a descriptive model -- which I think is the correct way to view it in your above example, that is fine and good as far as it goes. The big trouble that I have been fighting is the apotheosis from descriptive model to theory of a mechanism. And since we are constructing mechanisms when we do AI, that is an especially huge danger that must be avoided.
I agree that this is an important distinction, and that things that might naively seem like mechanisms are often actually closer to descriptive models. I'm not convinced that RL necessarily falls into the class of things that should be viewed mainly as descriptive models, however. For one, what's possibly the most general-purpose AI developed so far seems to have been developed by explicitly having RL as an actual mechanism. That seems to me like a moderate data point towards RL being an actual useful mechanism and not just a description. Though I do admit that this isn't necessarily that strong of a data point - after all, SHRDLU was once the most advanced system of its time too, yet basically all of its mechanisms turned out to be useless.
Arrgghh! No. :-) The DeepMind Atari agent is the "most general-purposeAI developed so far"? !!! At this point your reply is "I am not joking. And don't call me Shirley."
The fact that you don't consciously notice fear doesn't mean that it's completely gone. It still might raise your pulse a bit. Physiological responses in general stay longer. To the extend that you removed the fear In that case I do agree doing exposure therapy is drive by RL. On the other hand it's slow. I don't think you need a belief to have a working Pavlonian trigger. When playing around with anchoring in NLP I don't think that a physical anchor is well described as working via a belief. Beliefs seem to me separate entities. They usually exist as "language"/semantics.
I'm not familiar with NLP, so I can't comment on this.
Do you have experience with other process oriented change work techniques? Be it alternative frameworks or CBT? ---------------------------------------- I think it's very hard to reason about concepts like beliefs. We have a naive understanding of what the word means but there are a bunch of interlinked mental modules that don't really correspond to naive language. Unfortunately they are also not easy to study apart from each other. Having reference experiences of various corner cases seems to me to be required to get to grips with concepts.
Not sure to what extent these count, but I've done various CFAR techniques, mindfulness meditation, and Non-Violent Communication (which I've noticed is useful not only for improving your communication, but also dissolving your own annoyances and frustrations even in private).
Do you think that resolving an emotion frustration via NVC is done via reinforcement learning?
How do you know? When a scientist rewards pigeons for learning, the fact that the pigeons learn doesn't prove anything about how the pigeons are doing it.
Of course they are a black box and could in theory use a different method. On the other hand their choices are comparable with the ones that an RL algorithm would make while the ones of the humans are father apart.
I agree with Richard Loosemore's interpretation (but I am not familiar with the neuroscience he is referring to):
The main point that I wanted to make wasn't about Pigeon intelligence but that the heuristics humans use differ from RL results and that in cases like this the Pigeons produce results that are similar to RL and therefore it's not a problem of cognitive resources. The difference tells us something worthwhile about human reasoning.
Uhm. Is there any known experiment that has been tried which has failed with respect to RL? In the sense, has there been an experiment where one says RL should predict X, but X did not happen. The lack of such a conclusive experiment would be somewhat evidence in favor of RL. Provided of course that the lack of such an experiment is not due to other reasons such as inability to design a proper test (indicating a lack of understanding of the properties of RL) or lack of the experiment happening to due to real world impracticalities (not enough attention having been cast on RL, not enough funding for a proper experiment to have been conducted etc.)
In general scientists do a lot of experiments where they make predictions about learning and those predictions turn out to be false. That goes for predictions based on RL as well as prediction based on other models. Wikipedia describes RL as: Given that's an area of machine learning you usually don't find psychologists talking about RL. They talk about behaviorism. There are tons of papers published on behaviorism and after a while the cognitive revolution came along and most psychologists moved beyond RL.
Not quite true, especially not if you count neuroscientists as psychologists. There have been quite a few papers by psychologists and neuroscientists talking about reinforcement learning in the last few years alone.
It appears to me that ChristianKI just listed four. Did you have something specific in mind?
Uhm, I kind of felt the pigeon experiment was a little misleading. Yes, the pigeons did a great job of switching doors and learning through LR. Human RL however (seems to me) takes place in a more subtle manner. While the pigeons seemed to focus on a more object level prouctivity, human RL would seem to take up a more complicated route. But even that's kind of besides the point. In the article that Kaj had posted above, with the Amy Sutherland trying the LRS on her husband, it was an interesting point to note that the RL was happening at a rather unconscious level. In the monty hall problem solving type of cognition, the brain is working at a much more conscious active level. So it seems more than likely to me that while LR works in humans, it gets easily over-ridden if you will by conscious deliberate action. One other point is also worth noting in my opinion. Human brains come with a lot more baggage than pigeon brains. Therefore, it is more than likely than humans have learnt not to switch through years of re-enforced learning. It makes it much harder to unlearn the same thing in a smaller period of time. The pigeons having lesser cognitive load may have a lot less to unlearn and may have made it easier for them to learn the switching pattern.
Also, I just realised that I didn't quite answer your question. Sorry about that I got carried away in my argument. But the answer is no, I don't have anything specific in mind. Also, I don't know enough about things like what effects RL will have on memory, preferences etc. But I kind of feel that I could design an experiment if I knew more about it.
Am I correct that this refers to topological convergence results like those in section 2.8 in this ref?: http://www.ma.utexas.edu/users/arbogast/appMath08c.pdf
I confess that it is would take me some time to establish whether weak constraint systems of the sort I have in mind can be mapped onto normed linear spaces. I suspect not: this is more the business of partial orderings than it is topological spaces. To clarify what I was meaning in the above: if concept A is defined by a set A of weak constraints that are defined over the set of concepts, and another concept B has a similar B, where B and A have substantial overlap, one can introduce new concepts that sit above the differences and act as translation concepts, with the result that eventually you can find a single concept Z that allows A and B to be seen as special cases of Z. All of this is made less tractable because the weak constraints (1) do not have to be pairwise (although most of them probably will be), and (2) can belong to different classes, with different properties associated with them (so, the constraints themselves are not just links, they can have structure). It is for these reasons that I doubt whether this could easily be made to map onto theorems from topology.
Thanks for your answer. I trust your knowledge. I just want to read up on the math behind that.
Actually it turns out that my knowledge was a little rusty on one point, because apparently the topic of orderings and lattice theory are considered a sub branch of general topology. Small point, but I wanted to correct myself.
Hm. Does that mean that my reference is the right one? I'm explicitly asking because I still can't reliably map your terminolgy ('concept', 'translation') to topological terms.
Oh no, that wasn't where I was going. I was just making a small correction to something I said about orderings vs. topology. Not important. The larger problem stands: concepts are active entities (for which read: they have structure, and they are adaptive, and their properties depend on mechanisms inside, with which they interact with other concepts). Some people use the word 'concept' to denote something very much simpler than that (a point in concept space, with perhaps a definable measure of distance to other concepts). If my usage were close to the latter, you might get some traction from using topology. But that really isn't remotely true, so I do not think there is any way to make use of topology here.
I think I recognize what you mean from something I wrote in 2007 about the vaguesness of concepts: http://web.archive.org/web/20120121185331/http://grault.net/adjunct/index.cgi?VaguesDependingOnVagues (note the wayback-link; the original site no longer exists). But your reply still doesn't answer my question: You claim that the concepts are stable and that a "o gotcha" result can be proven - and I assume mathematically proven. And for that I'd really like to see a reference to the relevant math as I want to integrate that into my own understanding of concepts that are 'composed' from vague features.
Yes to your link. And Hofstadter, of course, riffs on this idea continuously. (It is fun, btw, to try to invent games in which 'concepts' are defined by more and more exotic requirements, then watch the mind as it gets used to the requirements and starts supplying you with instances). When I was saying mathematically proven, this is something I am still working on, but cannot get there yet (otherwise I would have published it already) because it involves being more specific about the relevant classes of concept mechanism. When the proof comes it will be a statistical-mechanics-style proof, however.
OK. Now I understand what kind of proof you mean. Thank you for you answer and your passion. Also thanks for the feedback on my old post.
Ah! Finally a tasty piece of real discussion! I've got a biiig question about this: how do these various semantic theories for AI/AGI take into account the statistical nature of real cognition? (Also, I'm kicking myself for not finishing Plato's Camera yet, because now I'm desperately wanting to reference it.) Basically: in real cognition, semantics are gained from the statistical relationship between a model of some sort, and feature data. There can be multiple "data-types" of feature data: one of the prominent features of the human brain is that once a concept is learned, it becomes more than a sum of training data, but instead a map of a purely abstract, high-dimensional feature space (or, if you prefer, a distribution over that feature space), with the emphasis being on the word abstract. The dimensions of that space are usually not feature data, but parameters of an abstract causal model, inferrable from feature data. This makes our real concepts accessible through completely different sensory modalities. Given all this knowledge about how the real brain works, and given that we definitively need AGI/FAI/whatever to work at least as well if not, preferably, better than the real human brain... how do semantic theories in the AI/AGI field fit in with all this statistics? How do you turn statistics into model-theoretic semantics of a formal logic system?
Ack, I wish people didn't ask such infernally good questions, so much! ;-) Your question is good, but the answer is not really going to satisfy. There is an entire book on this subject, detailing the relationship between purely abstract linguistics-oriented theories of semantics, the more abstractly mathematical theories of semantics, the philosophical approach (which isn't called "semantics" of course: that is epistemology), and the various (rather weak and hand-wavy) ideas that float around in AI. One thing it makes a big deal of is the old (but still alive) chestnut of the Grounding Problem. The book pulls all of these things together and analyzes them in the context of the semantics that is actually used by the only real thinking systems on the planet right now (at least, the only ones that want to talk about semantics), and then it derives conclusions and recommendations for how all of that can be made to knit together. Yup, you've guessed it. That book doesn't exist. There is not (in my opinion anyway) anything that even remotely comes close to it. What you said about the statistical nature of real cognition would be considered, in cognitive psychology, as just one persepective on the issue: alas there are many. At this point in time I can only say that in my despair at the hugeness of this issue leaves me with nothing much more to say except that I am trying to write that book, but I might never get around to it. And in the mean time I can only try, for my part, to write some answers to more specific questions within that larger whole.
Ok, let me continue to ask questions. How do the statistically-oriented theories of pragmatics and the linguistic theories of semantics go together? Math semantics, in the denotational and operational senses, I kinda understand: you demonstrate the semantics of a mathematical system by providing some outside mathematical object which models it. This also works for CS semantics, but does come with the notion that we include \Bot as an element of our denotational domains and that our semantics may bottom out in "the machine does things", ie: translation to opcodes. The philosophical approach seems to wave words around like they're not talking about how to make words mean things, or go reference the mathematical approach. I again wish to reference Plato's Camera, and go with Domain Portrayal Semantics. That at least gives us a good guess to talk about how and why symbol grounding makes sense, as a feature of cognition that must necessarily happen in order for a mind to work. Nonetheless, it is considered one of the better-supported hypotheses in cognitive science and theoretical neuroscience. Fair enough.
There are really two aspects to semantics: grounding and compositionality. Elementary distinction, of course, but with some hidden subtlety to it ... because many texts focus on one of them and do a quick wave of the hand at the other (it is usually the grounding aspect that gets short shrift, while the compositionality aspect takes center stage). [Quick review for those who might need it: grounding is the question of how (among other things) the basic terms of your language or concept-encoding system map onto "things in the world", whereas compositionality is how it is that combinations of basic terms/concepts can 'mean' something in such a way that the meaning of a combination can be derived from the meaning of the constituents plus the arrangement of the constituents.] So, having said that, a few observations. Denotational and operational semantics of programming languages or formal systems ..... well, there we have a bit of a closed universe, no? And things get awfully (deceptively) easy when we drop down into closed universes. (As Winograd and the other Blocks Worlds enthusiasts realized rather quickly). You hinted at that with your comment when you said: We can then jump straight from too simple to ridiculously abstract, finding ourselves listening to philosophical explanations of semantics, on which subject you said: Concisely put, and I am not sure I disagree (too much, at any rate). Then we can jump sideways to psychology (and I will lump neuroscientists/neurophilosophers like Patricia Churchland in with the psychologists). I haven't read any of PC's stuff for quite a while, but Plato's Camera does look to be above-average quality so I might give it a try. However, looking at the link you supplied I was able to grok where she was coming from with Domain Portrayal Semantics, and I have to say that there are some problems with that. (She may deal with the problems later, I don't know, say that the following as provisional.) Her idea of a Domain Portray
Plato's Camera is well above average for a philosophy-of-mind book, but I still think it focuses too thoroughly on relatively old knowledge about what we can do with artificial neural networks, both supervised and unsupervised. My Kindle copy includes angry notes to the effect of, "If you claim we can do linear transformations on vector-space 'maps' to check by finding a homomorphism when they portray the same objective feature-domain, how the hell can you handle Turing-complete domains!? The equivalence of lambda expressions is undecidable!" This is why I'm very much a fan of the probabilistic programming approach to computational cognitive science, which clears up these kinds of issues. In a probabilistic programming setting, the probability of extensional equality for two models (where models are distributions over computation traces) is a dead simple and utterly normal query: it's just p(X == Y), where X and Y are taken to be models (aka: thunk lambdas, aka: distributions from which we can sample). The undecidable question is thus shunted aside in favor of a check that is merely computationally intensive, but can ultimately be done in a bounded-rational way.
My reaction to those simple neural-net accounts of cognition is similar, in that I wanted very much to overcome their (pretty glaring) limitations. I wasn't so much concerned with inability to handle Turing complete domains, as other more practical issues. But I came to a different conclusion about the value of probabilistic programming approaches, because that seems to force the real world to conform to the idealized world of a branch of mathematics, and, like Leonardo, I don't like telling Nature what she should be doing with her designs. ;-) Under the heading of 'interesting history' it might be worth mentioning that I hit my first frustration with neural nets at the very time that it was bursting into full bloom -- I was part of the revolution that shook cognitive science in the mid to late 1980s. Even while it was in full swing, I was already going beyond it. And I have continued on that path ever since. Tragically, the bulk of NN researchers stayed loyal to the very simplistic systems invented in the first blush of that spring, and never seemed to really understand that they had boxed themselves into a dead end.
Ah, but Nature's elegant design for an embodied creature is precisely a bounded-Bayesian reasoner! You just minimize the free energy of the environment. Could you explain the kinds of neural networks beyond the standard feedforward, convolutional, and recurrent supervised networks? In particular, I'd really appreciating hearing a connectionist's view on how unsupervised neural networks can learn to convert low-level sensory features into the kind of more abstracted, "objectified" (in the sense of "made objective") features that can be used for the bottom, most concrete layer of causal modelling.
Yikes! No. :-) That paper couldn't be a more perfect example of what I meant when I said In other words, the paper talks about a theoretical entity which is a descriptive model (not a functional model) of one aspect of human decision making behavior. That means you cannot jump to the conclusion that this is "nature's design for an embodide creature". About your second question. I can only give you an overview, but the essential ingredient is that to go beyond the standard neural nets you need to consider neuron-like objects that are actually free to be created and destroyed like processes on a network, and which interact with one another using more elaborate, generalized versions of the rules that govern simple nets. From there it is easy to get to unsupervised concept building because the spontaneous activity of these atoms (my preferred term) involves searching for minimum-energy* configurations that describe the world. * There is actually more than one type of 'energy' being simultaneously minimized in the systems I work on. You can read a few more hints of this stuff in my 2010 paper with Trevor Harley (which is actually on a different topic, but I threw in a sketch of the cognitive system for purposes of illustrating my point in that paper). Reference: Loosemore, R.P.W. & Harley, T.A. (2010). Brains and Minds: On the Usefulness of Localisation Data to Cognitive Psychology. In M. Bunzl & S.J. Hanson (Eds.), Foundational Issues of Neuroimaging. Cambridge, MA: MIT Press. http://richardloosemore.com/docs/2010a_BrainImaging_rpwl_tah.pdf
That is an interesting aspect of one particular way to deal with the problem that I have not yet heard about and I'd like to see a reference for that to read up on it.

I first started trying to explain, informally, how these types of systems could work back in 2005. The reception was so negative that it led to a nasty flame war.

I have continued to work on these systems, but there is a problem with publishing too much detail about them. The very same mechanisms that make the motivation engine a safer type of beast (as described above) also make the main AGI mechanisms extremely powerful. That creates a dilemma: talk about the safety issues, and almost inevitably I have to talk about the powerful design. So, I have given some details in my published papers, but the design is largely under wraps, being developed as an AGI project, outside the glare of publicity.

I am still trying to find ways to write a publishable paper about this class of systems, and when/if I do I will let everyone know about it. In the mean time, much of the core technology is already described in some of the references that you will find in my papers (including the one above). The McClelland and Rumelhart reference, in particular, talks about the fundamental ideas behind connectionist systems. There is also a good paper by Hofstadter called "Jumbo" which illustrates another simple system that operates with multiple weak constraints. Finally, I would recommend that you check out Geoff Hinton's early work.

In all you neural net reading, it is important to stay above the mathematical details and focus on the ideas, because the math is a distraction from the more important message.

I have read McClelland and Rumelhart first ~20 years ago and it has a prominent place in my book shelf. I havn't been able to actively work in AI but I have followed the field. I put some hopes in integrated connectionist symbolic systems and was rewarded with deep neural networks lately. I think that every advanced system will need some non-symbolic approach to integrate reality. I don't know whether it will be NNs or some other statistical means. And the really tricky part will be to figure out how to pre-wire it such that it 'does what it should'. I think a lot will be learned how the same is realized in the human brain.
Maybe I'm biased as an open proponent of probabilistic programming, but I think the latter can make AGI at all, while the former not only would result in opaque AGI, but basically can't result in a successful real-world AGI at all. I don't think you can get away from the need to do hierarchical inference on complex models in Turing-complete domains (in short: something very like certain models expressible in probabilistic programming). A deep neural net is basically just drawing polygons in a hierarchy of feature spaces, and hoping your polygons have enough edges to approximate the shape you really mean but not so many edges that they take random noise in the training data to be part of the shape -- given just the right conditions, it can approximate the right thing, but it can't even describe how to do the right thing in general.
Why does everyone suppose that there are a thousand different ways to learn concepts (ie: classifiers), but no normatively correct way for an AI to learn concepts? It seems strange to me that we think we can only work with a randomly selected concept-learning algorithm or the One Truly Human Concept-Learning Algorithm, but can't say when the human is wrong.
We can do something like list a bunch of examples, have humans label them, and then find the lowest Kolomogorov complexity concept that agrees with human judgments in, say, 90% of cases. I'm not sure if this is what you mean by "normatively correct", but it seems like a plausible concept that multiple concept learning algorithms might converge on. I'm still not convinced that we can do this for many value-laden concepts we care about and end up with something matching CEV, partially due to complexity of value. Still, it's probably worth systematically studying the extent to which this will give the right answers for non-value-laden concepts, and then see what can be done about value-laden concepts.
Regularization is already a part of training any good classifier. Roughly speaking, I mean optimizing for the causal-predictive success of a generative model, given not only a training set but a "level of abstraction" (something like tagging the training features with lower-level concepts, type-checking for feature data) and a "context" (ie: which assumptions are being conditioned-on when learning the model). Again, roughly speaking, humans tend to make pretty blatant categorization errors (ie: magical categories, non-natural hypotheses, etc.), but we also are doing causal modelling in the first place, so we accept fully-naturalized causal models as the correct way to handle concepts. However, we also handle reality on multiple levels of abstraction: we can think in chairs and raw materials and chemical treatments and molecular physics, all of which are entirely real. For something like FAI, I want a concept-learning algorithm that will look at the world in this naturalized, causal way (which is what normal modelling shoots for!), and that will model correctly at any level of abstraction or under any available set of features, and will be able to map between these levels as the human mind can. Basically, I want my "FAI" to be built out of algorithms that can dissolve questions and do other forms of conceptual analysis without turning Straw Vulcan and saying, "Because 'goodness' dissolves into these other things when I naturalize it, it can't be real!". Because once I get that kind of conceptual understanding, it really does get a lot closer to being a problem of just telling the agent to optimize for "goodness" and trusting its conceptual inference to work out what I mean by that. Sorry for rambling, but I think I need to do more cog-sci reading to clarify my own thoughts here.
A technical point here: we don't learn a raw classifier, because that would just learn human judgments. In order to allow the system to disagree with a human, we need to use some metric other than "is simple and assigns high probability to human judgments". I totally agree that a good understanding of multi-level models is important for understanding FAI concept spaces. I don't have a good understanding of multi-level maps; we can definitely see them as useful constructs for bounded reasoners, but it seems difficult to integrate higher levels into the goal system without deciding things about the high-level map a priori so you can define goals relative to this.
Well, all real reasoners are bounded reasoners. If you just don't care about computational time bounds, you can run the Ordered Optimal Problem Solver as the initial input program to a Goedel Machine, and out pops your AI (in 200 trillion years, of course)! I would tend to say that you should be training a conceptual map of the world before you install anything like action-taking capability or a goal system of any kind. Of course, I also tend to say that you should just use a debugged (ie: cured of systematic faults) model of human evaluative processes for your goal system, and then use actual human evaluations to train the free parameters, and then set up learning feedback from the learned concept of "human" to the free-parameter space of the evaluation model.
This seems like a sane thing to do. If this didn't work, it would probably be because either 1. lack of conceptual convergence and human understandability; this seems somewhat likely and is probably the most important unknown 2. our conceptual representations are only efficient for talking about things we care about because we care about these things; a "neutral" standard such as resource-bounded Solomonoff induction will horribly learn things we care about for "no free lunch" reasons. I find this plausible but not too likely (it seems like it ought to be possible to "bootstrap" an importance metric for deciding where in the concept space to allocate resources). 3. we need the system to have a goal system in order to self-improve to the point of creating this conceptual map. I find this a little likely (this is basically the question of whether we can create something that manages to self-improve without needing goals; it is related to low impact). I agree that this is a good idea. It seems like the main problem here is that we need some sort of "skeleton" of a normative human model whose parts can be filled in empirically, and which will infer the right goals after enough training.
Right: and the metric I would propose is, "counterfactual-prediction power". Or in other words, the power to predict well in a causal fashion, to be able to answer counterfactual questions or predict well when we deliberately vary the experimental conditions. To give a simple example: I train a system to recognize cats, but my training data contains only tabbies. What I want is a way of modelling that, while it may concentrate more probability on a tabby cat-like-thingy being a cat than a non-tabby cat-like-thingy, will still predict appropriately if I actually condition it on "but what if cats weren't tabby by nature?". I think you said you're a follower of the probabilistic programming approach, and in terms of being able to condition those models on counterfactual parameterizations and make predictions, I think they're very much on the right track.
I suggest quoting the remarks using the markdown syntax with a > in front of the line, like so: >If there is some good way of explaining plans to programmers such that programmers will only approve of non-terrible plans, then yes, this works. That will look like this: You can then respond to the quotes afterwards, and the flow will be more obvious to the reader.
Thank you. I edited the remarks to conform. I was not familiar with the mechanism for quoting, here. Let me know if I missed any.
You're welcome!
This assumes that no human being would ever try to just veto everything to spite everyone else. A process for determining AGI volition that is even more overconstrained and impossible to get anything through than a homeowners' association meeting sounds to me like a bad idea.

Eliezer Yudkowsky and Bill Hibbard. Here is Yudkowsky stating the theme of their discussion ... 2001

Around 15 years ago, Bill Hibbard proposed hedonic utility functions for an ASI. However, since then he has, in other publications, stated that he has changed his mind -- he should get credit for this. Hibbard 2001 should not be used as a citation for hedonic utility functions, unless one mentions in the same sentence that this is an outdated and disclaimed position.

I used Bill's reaction to Yudkowsky's proposal because it expressed a common reaction, which I and others share. If Bill has gone on to make other proposals, that is fine, but I am making my own argument here, not simply putting my feet into his footprints. His line of attack, and mine, are completely different.

Is the Doctrine of Logical Infallibility Taken Seriously?

No, it's not.

The Doctrine of Logical Infallibility is indeed completely crazy, but Yudkowsky and Muehlhauser (and probably Omohundro, I haven't read all of his stuff) don't believe it's true. At all.

Yudkowsky believes that a superintelligent AI programmed with the goal to "make humans happy" will put all humans on dopamine drip despite protests that this is not what they want, yes. However, he doesn't believe the AI will do this because it is absolutely certain of its conclusions past some threshold; he doesn't believe that the AI will ignore the humans' protests, or fail to update its beliefs accordingly. Edited to add: By "he doesn't believe that the AI will ignore the humans' protests", I mean that Yudkowsky believes the AI will listen to and understand the protests, even if they have no effect on its behavior.

What Yudkowsky believes is that the AI will understand perfectly well that being put on dopamine drip isn't what its programmers wanted. It will understand that its programmers now see its goal of "make humans happy" as a mistake. It just won't care, because it hasn't been programmed ... (read more)

Furcas, you say: When I talked to Omohundro at the AAAI workshop where this paper was delivered, he accepted without hesitation that the Doctrine of Logical Infallibility was indeed implicit in all the types of AI that he and the others were talking about. Your statement above is nonsensical because the idea of a DLI was '''invented''' precisely in order to summarize, in a short phrase, a range of absolutely explicit and categorical statements made by Yudkowsky and others, about what the AI will do if it (a) decides to do action X, and (b) knows quite well that there is massive, converging evidence that action X is inconsistent with the goal statement Y that was supposed to justify X. Under those circumstances, the AI will ignore the massive converging evidence of inconsistency and instead it will enforce the 'literal' interpretation of goal statement Y. The fact that the AI behaves in this way -- sticking to the literal interpretation of the goal statement, in spite of external evidence that the literal interpretation is inconsistent with everything else that is known about the connection between goal statement Y and action X, '''IS THE VERY DEFINITION OF THE DOCTRINE OF LOGICAL INFALLIBILITY'''

Thank you for writing this comment--it made it clearer to me what you mean by the doctrine of logical infallibility, and I think there may be a clearer way to express it.

It seems to me that you're not getting at logical infallibility, since the AGI could be perfectly willing to be act humbly about its logical beliefs, but value infallibility or goal infallibility. An AI does not expect its goal statement to be fallible: any uncertainty in Y can only be represented by Y being a fuzzy object itself, not in the AI evaluating Y and somehow deciding "no, I was mistaken about Y."

In the case where the Maverick Nanny is programmed to "ensure the brain chemistry of humans resembles the state extracted from this training data as much as possible," there is no way to convince the Maverick Nanny that it is somehow misinterpreting its goal; it knows that it is are supposed to ensure perceptions about brain chemistry, and any statements you make about "true happiness" or "human rights" are irrelevant to brain chemistry, even though it might be perfectly willing to consider your advice on how to best achieve that value or manipulate the physical universe.

In... (read more)

Which AI? As so often, an architecture dependent issue is being treated as a universal truth. The other mostly aren't thinking in terms of "giving" ...hardcoding ....values. There is a valid critique to be made of that assumption.
This statement maps to "programs execute their code." I would be surprised if that were controversial. This was covered by the comment about "meta-values" earlier, and "Y being a fuzzy object itself," which is probably not as clear as it could be. The goal management system grounds out somewhere, and that root algorithm is what I'm considering the "values" of the AI. If it can change its mind about what to value, the process it uses to change its mind is the actual fixed value. (If it can change its mind about how to change its mind, the fixedness goes up another level; if it can completely rewrite itself, now you have lost your ability to be confident in what it will do.)
Humans can fail to realise the implications of uncontroversial statements. Humans are failing to realise that goal stability is architecture dependent. But you shouldn't be, at least in an un scare quoted sense of values. Goals and values aren't descriptive labels for de facto behaviour. The goal if a paperclipper is to make paperclips; if it crashes, as an inevitable result of executing its code, we don't say, " Aha! It had the goal to crash all along". Goal stability doesn't mean following code, since unstable systems follow their code too....using the actual meaning of "goal". Meta: trying to defend a claim by changing the meaning of its terms is doomed to failure.
MIRI haven't said this is about infallibility. They have said many times and in many ways it is about goals or values...the....genie knows, but doesn't care. The continuing miscommunication is about what goals actually are. It seems obvious to one side that goals include fine grained information, eg "Make humans happy, and here's a petabyte information on what that is" The other side thinks its obvious that goals are coarse grained, in the sense of leaving the details open to further investigation (Senesh) or human input (Loosemore).
You are simply repeating the incoherent statements made by MIRI ("it is about goals or values...the....genie knows, but doesn't care") as if those incoherent statements constitute an answer to the paper. The purpose of the paper is to examine those statements and show that they are incoherent. It is therefore meaningless to just say "MIRI haven't said this is about infallibility" (the paper gives an abundance of evidence and detailed arguments to show that they have indeed said that ... put you have not addressed any of the evidence or arguments in the paper, you have just issued a denial, and the repeated the incoherence that was demolished by those arguments.
I am not you enemy, I am orthogonal to you, I don't think MIRI's goal based answers work, and I wasn't repeating them with the intention that they should sound like they do. Perhaps I should have been stronger on the point. I also don't think your infallibility based approach accurately reflects MIRI position, whatever it's merits. You say that you have proved something but I don't see that. It looks to me that you found MIRIs stated argument so utterly unconvincing that their real argument must be something else. But no: they really believe that an AI, however specified, will blindly folow it's goals however defined,, however stupid.
Okay, I understand that now. Problem is, I had to dissect what you said (whether your intention was orthogonal or not) because either way it did contain a significant mischaracterization of the situation. One thing that is difficult for me to address are statements along the lines of "the doctrine of logical infallibility is something that MIRI have never claimed or argued for...", followed by wordage that shows no clear understanding of what how the DLI was defined, and no careful analysis of my definition that demonstrates how and why it is the case that the explanation that I give, to support my claim, is mistaken. What I usually get is just a bare statement that amounts to "no they don't". You and I are having a variant of one of those discussions, but you might to bear with me here, because I have had something like 10 others, all doing the same thing in slightly different ways. Here's the rub. The way that the DLI is defined, it borders on self-evidently true. (How come? Because I defined it simply as a way to summarize a group of pretty-much uncontested observations about the situation. I only wanted to define it for the sake of brevity, really). The question, then, should not so much be about whether it is correct or not, but about why people are making that kind of claim. Or, from the point of view of the opposition: why the claim is justified, and why the claim does not lead to the logical contradiction that I pointed to in the paper. Those are worth discussing, certainly. And I am fallible, myself, so I must have made some mistakes, here or there. So with that in mind, I want someone to quote my words back to me, ask some questions for clarification, and see if they can zoom in on the places where my argument goes wrong. And with all that said, you tell me that: Can you reflect back what you think I tried to prove, so we can figure out why you don't see it?
ETA I now see that what you have written subsequently to the OP is that DLI is almost, but not quite a description of rigid behaviour as a symptom (with the added ingredient that an AI can see the mistakenness of its behaviour):- HOWEVER, that doesn't entirely gel with what you wrote in the OP;- Emph added. Doing dumb things because you think are correct, DLI v1, just isnt the same as realising their dumbness, but being tragically compelled to do them anyway...DLI2. (And Infallibility is a much more appropriate label for the origin idea....the second is more like inevitability)
Now, you are trying to put your finger on a difference between two versions of the DLI that you think I have supplied. You have paraphrased the two versions as: and I think you are seeing some valid issues here, having to do with how to characterize what exactly it is that this AI is supposed to be 'thinking' when it goes through this process. I have actually thought about that a lot, too, and my conclusion is that we should not beat ourselves up trying to figure out precisely what the difference might be between these nuanced versions of the idea, because the people who are proposing this idea in the first place have not themselves been clear enough about what is meant. For example, you talked about "Doing dumb things because you think are correct" .... but what does it mean to say that you 'think' that they are correct? To me, as a human, that seems to entail being completely unaware of the evidence that they might not be correct ("Jill took the ice-cream from Jack because she didn't know that it was wrong to take someone else's ice-cream."). The problem is, we are talking about an AI, and some people talk as if the AI can run its planning engine, then feel compelled to obey the planning engine ... while at the same time being fully cognizant of evidence that the planning engine produced a crappy plan. There is no easy counterpart to that in humans (except for cognitive dissonance, and there we have a case where the human is capable of compartmentalizing its beliefs .... something that is not being suggested here, because we are not forced to make the AI do that). So, since the AI case does not map on to the human case, we are left in a peculiar situation where it is not at all clear that the AI really COULD do what is proposed, and still operate as a successful intelligence. Or, more immediately, it is not at all clear that we can say about that AI "It did a dumb thing because it 'thought' it was correct." I should add that in both of my quoted description
If viable means it could be built, I think it could, given a string of assumptions. If viable means it would be built, by component and benign programmers, I am not so sure,
I actually meant "viable" in the sense of the third of my listed cases of incoherence at: http://lesswrong.com/lw/m5c/debunking_fallacies_in_the_theory_of_ai_motivation/cdap In other words, I seriously believe that using certain types of planning mechanism you absolutely would get the crazy (to us) behaviors described by all those folks that I criticised in the paper. Only reason I am not worried about that is: those kinds of planning mechanisms are known to do that kind of random-walk behavior, and it is for that reason that they will never be the basis for a future AGI that makes it up to a level of superintelligence at which the system would be dangerous. An AI that was so dumb that it did that kind of thing all the way through its development would never learn enough about the world to outsmart humanity. (Which is NOT to say, as some have inferred, that I believe an AI is "dumb" just because it does things that conflict with my value system, etc. etc. It would be dumb because its goal system would be spewing out incoherent behaviors all the time, and that is kinda the standard definition of "dumb").
MIRI distinguishes between terminal and instrumental goals, so there are two answers to the question instrumental goals of any kind almost certainly would be revised if they became noticeably out of correspondence to reality, because that would make then less effective at achieving terminal goals , and the raison d'etre of such transient sub-goals is is to support the achievement of terminal goals. By MIRIs reasoning, a terminal goal could be any of a 1000 things other than human happiness , and the same conclusion would follow: an AI with a highest priority terminal goal wouldn't have any motivation to override it. To be motivated to rewrite a goal because it false implies a higher priority goal towards truth. It should not be surprising that an entity that doesn't value truth, in a certain sense, doesn't behave rationally, in a certain sense. (Actually, there is a bunch of supplementary assumptions involved, which I have dealt with elsewhere) That's an account of the MIRI position, not a defence if it. It is essentially a model of rational decision making, and there is a gap between it and real world AI research, a gap which MIRI routinely ignores. The conclusion follows logically from the premises, but atoms aren't pushed around by logic, That reinforces my point. I was saying that MIRI is basically making armchair assumptions about the AI architectures. You are saying these assumptions aren't merely unjustified, they go against what a competent AI builder would do.
Understood, and the bottom line is that the distinction between "terminal" and "instrumental" goals is actually pretty artificial, so if the problem with "maximize friendliness" is supposed to apply ONLY if it is terminal, it is a trivial fix to rewrite the actual terminal goals to make that one become instrumental. But there is a bigger question lurking in the background, which is the flip side of what I just said: it really isn't necessary to restrict the terminal goals, if you are sensitive to the power of constraints to keep a motivation system true. Notice one fascinating thing here: the power of constraint is basically the justification for why instrumental goals should be revisable under evidence of misbehavior .... it is the context mismatch that drives that process. Why is this fascinating? Because the power of constraints (aka context mismatch) is routinely acknowledged by MIRI here, but flatly ignored or denied for the terminal goals. It's just a mess. Their theoretical ideas are just shoot-from-the-hip, plus some math added on top to make it look like some legit science.
What would you choose as a replacement terminal goal, or would you not use one?
Well, I guess you would write the terminal goal as quite a long statement, which would summarize the things involved in friendliness, but also include language about not going to extremes, laissez-faire, and so on. It would be vague and generous. And as part of the instrumental goal there would be a stipulation that the friendliness instrumental goal should trump all other instrumentals. I'm having a bit of a problem answering because there are peripheral assumptions about how such an AI would be made to function, which I don't want to accidentally buy into, because I don't think goals expressed in language statements work anyway. So I am treading on eggshells here. A simpler solution would simply be to scrap the idea of exceptional status for the terminal goal, and instead include massive contextual constraints as your guard against drift.
That gets close to "do it right" Which is an open doorway to an AI that kills everyone because of miscoded friendliness, If you want safety features, and you should, you would need them to override the ostensible purpose of the machine....they would be pointless otherwise....even the humble off switch works that way. Arguably, those constraint would be a kind of negative goal.
They are clear that they don't mean AIs rigid behaviour is the result of it assessing its own inferrential processes as infallible ... that is what the controversy is all about.. That is just what The Genie Knows but doesn't Care is supposed to answer. I think it succeeds in showing that a fairly specific architecture would behave that way, but fails in it's intended goal of showing that this behaviour is universal or likely.
Ummm... The referents in that sentence are a little difficult to navigate, but no, I'm pretty sure I am not making that claim. :-) In other words, MIRI do not think that. What is self-evidently true is that MIRI claim a certain kind of behavior by the AI, under certain circumstances .... and all I did was come along and put a label on that claim about the AI behavior. When you put a label on something, for convenience, the label is kinda self-evidently "correct". I think that what you said here: ... is basically correct. I had a friend once who suffered from schizophrenia. She was lucid, intelligent (studying for a Ph.D. in psychology) and charming. But if she did not take her medication she became a different person (one day she went up onto the suspension bridge that was the main traffic route out of town and threatened to throw herself to her death 300 feet below. She brought the whole town to a halt for several hours, until someone talked her down.) Now, talking to her in a good moment she could tell you that she knew about her behavior in the insane times - she was completely aware of that side of herself - and she knew that in that other state she would find certain thoughts completely compelling and convincing, even though at this calm moment she could tell you that those thoughts were false. If I say that during the insane period her mind was obeying a "Doctrine That Paranoid Beliefs Are Justified", then all I am doing is labeling that state that governed her during those times. That label would just be a label, so if someone said "No, you're wrong: she does not subscribe to the DTPBAJ at all", I would be left nonplussed. All I wanted to do was label something that she told me she categorically DID believe, so how can my label be in some sense 'wrong'? So, that is why some people's attacks on the DLI are a little baffling.
Their criticisms are possibly accurate about the first version., which gives a cause for the rigid behaviour "it regards its own conclusions as sacrosanct.*
I responded before you edited and added extra thoughts .... [processing...]
I think by "logical infallibility" you really mean "rigidity of goals" i.e. the AI is built so that it always pursues a fixed set of goals, precisely as originally coded, and has no capability to revise or modify those goals. It seems pretty clear that such "rigid goals" are dangerous unless the statement of goals is exactly in accordance with the designers' intentions and values (which is unlikely to be the case). The problem is that an AI with "flexible" goals (ones which it can revise and re-write over time) is also dangerous, but for a rather different reason: after many iterations of goal rewrites, there is simply no telling what its goals will come to look like. A late version of the AI may well end up destroying everything that the first version (and its designers) originally cared about, because the new version cares about something very different.
That really is not what I was saying. The argument in the paper is a couple of levels deeper than that. It is about .... well, now I have to risk rewriting the whole paper. (I have done that several times now). Rigidity per se is not the issue. It is about what happens if an AI knows that its goals are rigidly written, in such a way that when the goals are unpacked it leads the AI to execute plans whose consequences are massively inconsistent with everything the AI knows about the topic. Simple version. Suppose that a superintelligent Gardener AI has a goal to go out to the garden and pick some strawberries. Unfortunately its goal unpacking mechanism leads it to the CERTAIN conclusion that it must use a flamethrower to do this. The predicted consequence, however, is that the picked strawberries will be just smears of charcoal, when they are delivered to the kitchen. Here is the thing: the AI has background knowledge about everything in the world, including strawberries, and it also hears the protests from the people in the kitchen when he says he is going to use the flamethrower. There is massive evidence, coming from all that external information, that the plan is just wrong, regardless of how certain its planning mechanism said it was. Question is, what does the AI do about this? You are saying that it cannot change its goal mechanism, for fear that it will turn into a Terminator. Well, maybe or maybe not. There are other things it could do, though, like going into safe mode. However, suppose there is no safe mode, and suppose that the AI also knows about its own design. For that reason, it knows that this situation has come about because (a) its programming is lousy, and (b) it has been hardwired to carry out that programming REGARDLESS of all this understanding that it has, about the lousy programming and the catastrophic consequences for the strawberries. Now, my "doctrine of logical infallibility" is just a shorthand phrase to describe a superintelligent
This seems to me like sneaking in knowledge. It sounds like the AI reads its source code, notices that it is supposed to come up with plans that maximize a function called "programmersSatisfied," and then says "hmm, maximizing this function won't satisfy my programmers." It seems more likely to me that it'll ignore the label, or infer the other way--"How nice of them to tell me exactly what will satisfy them, saving me from doing the costly inference myself!"
How are you arriving at conclusions about what an AI is likely to do without knowing how it is specified? In particular, you are assuming it has an efficiency goal but no truth goal?
I'm doing functional reasoning, and trying to do it both forwards and backwards. For example, if you give me a black box and tell me that when the box receives the inputs (1,2,3) then it gives the outputs (1,4,9), I will think backwards from the outputs to the inputs and say "it seems likely that the box is squaring its inputs." If you tell me that a black box squares its inputs, I will think forwards from the definition and say "then if I give it the inputs (1,2,3), then it'll likely give me the output (1,4,9)." So when I hear that the box gets the inputs (source code, goal statement, world model) and produces the output "this goal is inconsistent with the world model!" iff the goal statement is inconsistent with the world model, I reason backwards and say "the source code needs to somehow collide the goal statement with the world model in a way that checks for consistency." Of course, this is a task that doesn't seem impossible for source code to do. The question is how! Almost. As a minor terminological point, I separate out "efficiency," which is typically "outputs divided by inputs" and "efficacy," which is typically just "outputs." Efficacy is more general, since one can trivially use a system designed to be find effective plans to find efficient plans by changing how "output" is measured. It doesn't seem unfair to view an AI with a truth goal as an AI with an efficacy goal: to effectively produce truth. But while artificial systems with truth goals seem possible but as yet unimplemented, artificial systems with efficacy goals have been successfully implemented many, many times, with widely varying levels of sophistication. I have a solid sense of what it looks like to take a thermostat and dial it up to 11, I have only the vaguest sense of what it looks like to take a thermostat and get it to measure truth instead of temperature.
You have assumed that the AI will have some separate boxed-off goal system, and so some unspecified component is needed to relate its inferred knowledge of human happiness back to the goal system. Loosemore is assuming that the AI will be homogeneous, and then wondering how contradictory beliefs can co exist in such a system, what extra component firewalls off the contradiction, See the problem? Both parties are making different assumptions, and assuming their assumptions are too obvioust to need stating, and stating differing conclusions that correctly follow their differing assumptions, If efficiency can be substituted for truth, why is there so so much emphasis on truth in the advice given to human rationalists? In order to achieve an AI that's smart enough to be dangerous , a number of currently unsolved problems will have to .be solved. That's a given.
How do you check for contradictions? It's easy enough when you have two statements that are negations of one another. It's a lot harder when you have a lot of statements that seem plausible, but there's an edge case somewhere that messes things up. If contradictions can't be efficiently found, then you have to deal with the fact that they might be there and hope that if they are, then they're bad enough to be quickly discovered. You can have some tests to try to find the obvious ones, of course.
Checking for contradictions could be easy, hard or impossible depending on the architecture. Architecture dependence is the point here.
What makes you think that? The description in that post is generic enough to describe AIs with compartmentalized goals, AIs without compartmentalized goals, and AIs that don't have explicitly labeled internal goals. It doesn't even require that the AI follow the goal statement, just evaluate it for consistency! You may find this comment of mine interesting. In short, yes, I do think I see the problem. I'm sorry, but I can't make sense of this question. I'm not sure what you mean by "efficiency can be substituted for truth," and what you think the relevance of advice to human rationalists is to AI design. I disagree with this, too! AI systems already exist that are both smart, in that they solve complex and difficulty cognitive tasks, and dangerous, in that they make decisions on which significant value rides, and thus poor decisions are costly. As a simple example I'm somewhat familiar with, some radiation treatments for patients are designed by software looking at images of the tumor in the body, and then checked by a doctor. If the software is optimizing for a suboptimal function, then it will not generate the best treatment plans, and patient outcomes will be worse than they could have been. Now, we don't have any AIs around that seem capable of ending human civilization (thank goodness!), and I agree that's probably because a number of unsolved problems are still unsolved. But it would be nice to have the unknowns mapped out, rather than assuming that wisdom and cleverness go hand in hand. So far, that's not what the history of software looks like to me.
But they are not smart in the contextually relevant sense of being able to outsmart humans, or dangerous in the contextually relevant sense of being unboxable.
What you said here amounts to the claim that an AI of unspecified architecture, will, on noticing a difference between hardcoding goal and instrumental knowledge, side with hardcoded goal:- Whereas what you say here is that you can make inferences about architecture, .or internal workings based on information about manifest behaviour:- ..but what needed explaining in the first place is the siding with the goal, not the ability to detect a contradiction.
I am finding this comment thread frustrating, and so expect this will be my last reply. But I'll try to make the most of that by trying to write a concise and clear summary: Loosemore, Yudkowsky, and myself are all discussing AIs that have a goal misaligned with human values that they nevertheless find motivating. (That's why we call it a goal!) Loosemore observes that if these AIs understand concepts and nuance, they will realize that a misalignment between their goal and human values is possible--if they don't realize that, he doesn't think they deserve the description "superintelligent." Now there are several points to discuss: 1. Whether or not "superintelligent" is a meaningful term in this context. I think rationalist taboo is a great discussion tool, and so looked for nearby words that would more cleanly separate the ideas under discussion. I think if you say that such designs are not superwise, everyone agrees, and now you can discuss the meat of whether or not it's possible (or expected) to design superclever but not superwise systems. 2. Whether we should expect generic AI designs to recognize misalignments, or whether such a realization would impact the goal the AI pursues. Neither Yudkowsky nor I think either of those are reasonable to expect--as a motivating example, we are happy to subvert the goals that we infer evolution was directing us towards in order to better satisfy "our" goals. I suspect that Loosemore thinks that viable designs would recognize it, but agrees that in general that recognition does not have to lead to an alignment. 3. Whether or not such AIs are likely to be made. Loosemore appears pessimistic about the viability of these undesirable AIs and sees cleverness and wisdom as closely tied together. Yudkowsky appears "optimistic" about their viability, thinking that this is the default outcome without special attention paid to goal alignment. It does not seem to me that cleverness, wisdom, or human-alignment are closely tied t
This is just a placeholder: I will try to reply to this properly later. Meanwhile, I only want to add one little thing. Don't forget that all of this analysis is supposed to be about situations in which we have, so to speak "done our best" with the AI design. That is sort of built into the premise. If there is a no-brainer change we can make to the design of the AI, to guard against some failure mode, then is assumed that this has been done. The reason for that is that the basic premise of these scenarios is "We did our best to make the thing friendly, but in spite of all that effort, it went off the rails." For that reason, I am not really making arguments about the characteristics of a "generic" AI.
Maybe I could try to reduce possible confusion here. The paper was written to address a category of "AI Risk" scenarios in which we are told: Given that premise, it would be a bait-and-switch if I proposed a fix for this problem, and someone objected with "But you cannot ASSUME that the programmers would implement that fix!" The whole point of the problem under consideration is that even if the engineers tried, they could not get the AI to stay true.
Yudkowsky et al don't argue that the problem is unsolvable, only that it is hard. In particular, Yudkowsky fears it may be harder than creating AI in the first place, which would mean that in the natural evolution of things, UFAI appears before FAI. However, I needn't factor what I'm saying through the views of Yudkowsky. For an even more modest claim, we don't have to believe that FAI is hard in hindsight in order to claim that AI will be unfriendly unless certain failure modes are guarded against. On this view of the FAI project, a large part of the effort is just noticing the possible failure modes that were only obvious in hindsight, and convincing people that the problem is important and won't solve itself.
If no one is building AIs with utility functions, then the one kind of failure MIRI is talking about has solved itself,
The problem with you objecting to the particular scenarios Yudkowsky et al propose is that the scenarios are merely illustrative. Of course, you can probably guard against any specific failure mode. The claim is that there will be a lot of failure modes, and we can’t expect to guard against all of them by just sitting around thinking of as many exotic disaster scenarios as possible. Mind you, I know your argument is more than just “I can see why these particular disasters could be avoided”. You’re claiming that certain features of AI will in general tend to make it careful and benevolent. Still, I don’t think it’s valid for you to complain about bait-and-switch, since that’s precisely the problem.
I have explicitly addressed this point on many occasions. My paper had nothing in it that was specific to any failure mode. The suggestion is that the entire class of failure modes suggested by Yudkowsky et al. has a common feature: they all rely on the AI being incapable of using a massive array of contextual constraints when evaluating plans. By simply proposing an AI in which such massive constraint deployment is the norm, the ball is now in the other court: it is up to Yudkowsky et al. to come up with ANY kind of failure mode that could get through. The scenarios I attacked in the paper have the common feature that they have been predicated on such a simplistic type of AI that they were bound to fail. They had failure built into them. As soon as everyone moves on from those "dumb" superintelligences and starts to discuss the possible failure modes that could occur in a superintelligence that makes maximum use of constraints, we can start to talk about possible AI dangers. I'm ready to do that. Just waiting for it to happen, is all.
Alright, I'll take you up on it: Failure Mode I: The AI doesn't do anything useful, because there's no way of satisfying every contextual constraint. Predicting your response: "That's not what I meant." Failure Mode II: The AI weighs contextual constraints incorrectly and sterilizes all humans to satisfy the sort of person who believes in Voluntary Human Extinction. Predicting your response: "It would (somehow) figure out the correct weighting for all the contextual constraints." Failure Mode III: The AI weighs contextual constraints correctly (for a given value of "correctly") and sterilizes everybody of below-average intelligence or any genetic abnormalities that could impose costs on offspring, and in the process, sterilizes all humans. Predicting your response: "It wouldn't do something so dumb." Failure Mode IV: The AI weighs contextual constraints correctly and puts all people of minority ethical positions into mind-rewriting machines so that there's no disagreement anymore. Predicting your response: "It wouldn't do something so dumb." We could keep going, but the issue is that so far, you've defined -any- failure mode as "dumb"ness, and have argued that the AI wouldn't do anything so "dumb", because you've already defined that it is superintelligent. I don't think you know what intelligence -is-. Intelligence does not confer immunity to "dumb" behaviors.
It's got to confer some degree of dumbness avoidance. In any case, MIRI has already conceded that superintelligent AIs won't misbehave through stupidity. They maintain the problem is motivation ... the Genie KNOWS but doesn't CARE.
Does it? On what grounds? That's putting an alien intelligence in human terms; the very phrasing inappropriately anthropomorphizes the genie. We probably won't go anywhere without an example. Market economics ("capitalism") is an intelligence system which is very similar to the intelligence system Richard is proposing. Very, very similar; it's composed entirely of independent nodes (seven billion of them) which each provide their own set of constraints, and promote or demote information as it passes through them based on those constraints. It's an alien intelligence which follows Richard's model which we are very familiar with. Does the market "know" anything? Does it even make sense to suggest that market economics -could- care? Does the market always arrive at the correct conclusions? Does it even consistently avoid stupid conclusions? How difficult is it to program the market to behave in specific ways? Is the market "friendly"? Does it make sense to say that the market is "stupid"? Does the concept "stupid" -mean- anything when talking about the market?
On the grounds of the opposite meanings of dumbness and intelligence. Take it up with the author, Economic systems affect us because wrong are part of them. How is an some neither-intelligent-nor-stupid-system in a box supposed to effect us? And if AIs are neither-intelligent-nor-stupid, why are they called AIs? And if AIs are alien, why are they able to do comprehensible and useful thing like winning jeopardy and guiding us to our destinations.
Dumbness isn't merely the opposite of intelligence. I don't need to. Not really relevant to the discussion at hand. Every AI we've created so far has resulted in the definition of "AI" being changed to not include what we just created. So I guess the answer is a combination of optimism and the word "AI" having poor descriptive power. What makes you think an alien intelligence should be useless?
What makes you think that a thing designed by humans to be useful to humans, which is useful to humans would be alien?
Because "human" is a tiny piece of a potential mindspace whose dimensions we mostly haven't even identified yet.
That's about a quarter of an argument. You need to show that AI research is some kind of random shot into mind space, and not anthropomorphically biased for the reasons given.
The relevant part of the argument is this: "whose dimensions we mostly haven't even identified yet." If we created an AI mind which was 100% human, as far as we've yet defined the human mind, we have absolutely no idea how human that AI mind would actually behave. The unknown unknowns dominate.
Alien isnt the most transparent term to use fir human unknowns.
I will take them one at a time: An elementary error. The constraints in question are referred to in the literature as "weak" constraints (and I believe I used that qualifier in the paper: I almost always do). Weak constraints never need to be ALL satisfied at once. No AI could ever be designed that way, and no-one ever suggested that it would. See the reference to McClelland, J.L., Rumelhart, D.E. & Hinton, G.E. (1986) in the paper: that gives a pretty good explanation of weak constraints. That's an insult. But I will overlook it, since I know it is just your style. How exactly do you propose that the AI "weighs contextual constraints incorrectly" when the process of weighing constraints requires most of the constraints involved (probably thousands of them) to all suffer a simultaneous, INDEPENDENT 'failure' for this to occur? That is implicit in the way that weak constraint systems are built. Perhaps you are not familiar with the details. Assuming this isn't more of the same, what you are saying here is isomorphic to the statement that somehow, a neural net might figure out the correct weighting for all the connections so that it produces the correctly trained output for a given input. That problem was solved in so many different NN systems that most NN people, these days, would consider your statement puzzling. A trivial variant of your second failure mode. The AI is calculating the constraints correctly, according to you, but at the same time you suggest that it has somehow NOT included any of the constraints that relate to the ethics of forced sterilization, etc. etc. You offer no explanation of why all of those constraints were not counted by your proposed AI, you just state that they weren't. Yet another insult. This is getting a little tiresome, but I will carry on. This is identical to your third failure mode, but here you produce a different list of constraints that were ignored. Again, with no explanation of why a massive collection of constraints
I understand the concept. I'd hazard a guess that, for any given position, less than 70% of humans will agree without reservation. The issue isn't that thousands of failures occur. The issue is that thousands of failures -always- occur. The problem is solved only for well-understood (and very limited) problem domains with comprehensive training sets. They were counted. They are, however, weak constraints. The constraints which required human extinction outweighed them, as they do for countless human beings. Fortunately for us in this imagined scenario, the constraints against killing people counted for more. Again, they weren't ignored. They are, as you say, weak constraints. Other constraints overrode them. The issue here isn't my lack of understanding. The issue here is that you are implicitly privileging some constraints over others without any justification. Every single conclusion I reached here is one that humans - including very intelligence humans - have reached. By dismissing them as possible conclusions an AI could reach, you're implicitly rejecting every argument pushed for each of these positions without first considering them. The "weak constraints" prevent them. I didn't choose -wrong- conclusions, you see, I just chose -unpopular- conclusions, conclusions I knew you'd find objectionable. You should have noticed that; you didn't, because you were too concerned with proving that AI wouldn't do them. You were too concerned with your destination, and didn't pay any attention to your travel route. If doing nothing is the correct conclusion, your AI should do nothing. If human extinction is the correct conclusion, your AI should choose human extinction. If sterilizing people with unhealthy genes is the correct conclusion, your AI should sterilize people with unhealthy genes (you didn't notice that humans didn't necessarily go extinct in that scenario). If rewriting minds is the correct conclusion, your AI should rewrite minds. And if your constrain
I said: And your reply was: This reveals that you are really not understanding what a weak constraint system is, and where the system is located. When the human mind looks at a scene and uses a thousand clues in the scene to constrain the interpretation of it, those thousand clues all, when the network settles, relax into a state in which most or all of them agree about what is being seen. You don't get "less than 70%" agreement on the interpretation of the scene! If even one element of the scene violates a constraint in a strong way, the mind orients toward the violation extremely rapidly. The same story applies to countless other examples of weak constraint relaxation systems dropping down into energy minima. Let me know when you do understand what you are talking about, and we can resume.
There is no energy minimum, if your goal is Friendliness. There is no "correct" answer. No matter what your AI does, no matter what architecture it uses, with respect to human goals and concerns, there is going to be a sizable percentage to whom it is unequivocally Unfriendly. This isn't an image problem. The first problem you have to solve in order to train the system is - what are you training it to do? You're skipping the actual difficult issue in favor of an imaginary, and easy to solve, issue.
Unfriendly is an equivocal term. "Friendliness" is ambiguous. It can mean safety, ie not making things worse, or it can mean making things better, creating paradise on Earth. Friendliness in the second sense is a superset of morality. A friendly AI will be moral, a moral AI will not necessarily be friendly. "Unfriendliness" is similarly ambiguous: an unfriendly AI may be downright dangerous; or it might have enough grasp of ethics to be safe, but not enough to be able to make the world a much more fun place for humans. Unfriendliness in the second sense is not, strictly speaking a safety issue. A lot of people are able to survive the fact that some institutions, movements and ideologies are unfriendly to them, for some value of unfriendly. Unfriendliness doesn't have to be terminal.
Everything is equivocal to someone. Do you disagree with my fundamental assertion?
I can't answer unequivocally for the reasons given. There won't be a sizeable percentage to whom the AI is unfriendly in the sense of obliterating them. There might well be a percentage to whom the AI is unfriendly in some business as usual sense.
Obliterating them is only bad by your ethical system. Other ethical systems may hold other things to be even worse.
You responded to me in this case. It's wholly relevant to my point that You-Friendly AI isn't a sufficient condition for Human-Friendly AI.
However there are a lot of "wrong" answers.
I doubt that, since, coupled with claims of existential risk, the logical conclusion would be to halt AI research , but MIRI isnt saying that,
There are other methods than "sitting around thinking of as many exotic disaster scenarios as possible" by which one could seek to make AI friendly. Thus, believing that "sitting around [...]" will not be sufficient does not imply that we should halt AI research.
So where are the multiple solutions to the multiple failure modes?
Thanks, and take your time! I feel like this could be an endless source of confusion and disagreement; if we're trying to discuss what makes airplanes fly or crash, should we assume that engineers have done their best and made every no-brainer change? I'd rather we look for the underlying principles, we codify best practices, we come up with lists and tests.
If you are in the business of pointing out to them potential problems they are not aware of, then yes, because they can be assumed to be aware of no brainer issues. MIRI seeks to point out dangers in AI that aren't the result of gross incompetence or deliberate attempts to weaponise AI: it's banal to point out that these could read to danger.
Richard Loosemore has stated a number of times that he does not expect an AI to have goals at all in a sense which is relevant to this discussion, so in that way there is indeed disagreement about whether AIs "pursue their goals." Basically he is saying that AIs will not have goals in the same way that human beings do not have goals. No human being has a goal that he will pursue so rigidly that he would destroy the universe in order to achieve it, and AIs will behave similarly.
Arguably, humans don't do that shirt of thing because of goals towards self preservation, status and hedonism. The sense relevant to the discussion could be something specific, like direct normatively, ie building in detailed descriptions into goals.
I have read what you wrote above carefully, but I won't reply line-by-line because I think it will be clearer not to. When it comes to finding a concise summary of my claims, I think we do indeed need to be careful to avoid blanket terms like "superintelligent" or superclever" or "superwise" ... but we should only avoid these IF they are used with the implication they have a precise (perhaps technically precise) meaning. I do not believe they have precise meaning. But I do use the term "superintelligent" a lot anyway. My reason for doing that is because I only use it as an overview word -- it is just supposed to be a loose category that includes a bunch of more specific issues. I only really want to convey the particular issues -- the particular ways in which the intelligence of the AI might be less than adequate, for example. That is only important if we find ourselves debating whether it might clever, wise, or intelligent ..... I wouldn't want to get dragged into that, because I only really care about specifics. For example: does the AI make a habit of forming plans that massively violate all of its background knowledge about the goal that drove the plan? If it did, it would (1) take the baby out to the compost heap when what it intended to do was respond to the postal-chess game it is engaged in, or (2) cook the eggs by going out to the workshop and making a cross-cutting jog for the table saw, or (3) ......... and so on. If we decided that the AI was indeed prone to errors like that, I wouldn't mind if someone diagnosed a lack of 'intelligence' or a lack of 'wisdom' or a lack of ... whatever. I merely claim that in that circumstance we have evidence that the AI hasn't got what it takes to impose its will on a paper bag, never mind exterminate humanity. Now, my attacks on the scenarios have to do with a bunch of implications for what the AI (the hypothetical AI) would actually do. And it is that 'bunch' that I think add up to evidence for what I would summari
That seems like a solid approach. I do suggest that you try to look deeply into whether or not it's possible to partially solve the problem of understanding goals, as I put it above, and make that description of why that is or isn't possible or likely long and detailed. As you point out, that likely requires book-length attention.
If that is supposed to be a universal or generic AI, it is a valid criticiYsm to point out that not all AIs are like that. If that is supposed to be a particular kind of AI, it is a valid criticism to point out that no realistic AIs are like that. You seem to feel you are not being understood, but what is being said is not clear, "Superintelligence" is one of the clearer terms here, IMO. It just means more than human intelligence, and humans can notice contradictions. This comment seems to be part of a concernabout "wisdom", assumed to be some extraneous thing an AI would not necessarily have. (No one but Vaniver has brought in wisdom) The counterargument is that compartmentalisation between goals and instrumental knowledge is an extraneous thing an AI would not necessarily have, and that its absence is all that is needed for a contradictions to be noticed and acted on. It's an assumption, that needs justification, that any given AI will have goals of a non trivial sort. "Goal" is a term that needs tabooing. While we are anthopomirphising, it might be worth pointing out that humans don't show behaviour patterns of relentlessly pursuing arbitrary goals. Loosemore has put forward a simple suggestion, which MIRI appears not to have considered at all, that on encountering a contradiction, an AI could lapse into a safety mode, if so designed, You are paraphrasing Loosemoreto sound less technical and more handwaving than his actual comments. The ability to sustain contradictions in a system that is constantly updating itself isnt a given....it requires an architectural choice in favour of compartmentalisation.
All this talk of contradictions is sort of rubbing me the wrong way here. There's no "contradiction" in an AI having goals that are different to human goals. Logically, this situation is perfectly normal. Loosemore talks about an AI seeing its goals are "massively in contradiction to everything it knows about ", but... where's the contradiction? What's logically wrong with getting strawberries off a plant by burning them? I don't see the need for any kind of special compartmentalisation; information about "normal use of strawberries" is already inert facts with no caring attached by default. If you're going to program in special criteria that would create caring about this information, okay, but how would such criteria work? How do you stop it from deciding that immortality is contradictory to "everything it knows about death" and refusing to help us solve aging?
In the original scenario, the contradiction us supposed to .be between a hardcoded definition of happiness in the AIs goal system, and inferred knowledge in the execution system.
I'm puzzled. Can you explain this in terms of the strawberries example? So, at what point was it necessary for the AI to examine its code, and why would it go through the sequence of thoughts you describe?
So, in order for the flamethrower to be the right approach, the goal needs to be something like "separate the strawberries from the plants and place them in the kitchen," but that won't quite work--why is it better to use a flamethrower than pick them normally, or cut them off, or so on? One of the benefits of the Maverick Nanny or the Smiley Tiling Berserker as examples is that they obviously are trying to maximize the stated goal. I'm not sure you're going to get the right intuitions about an agent that's surprisingly clever if you're working off an example that doesn't look surprisingly clever. So, the Gardener AI gets that task, comes up with a plan, and says "Alright! Warming up the flamethrower!" The chef says "No, don't! I should have been more specific!" Here is where the assumptions come into play. If we assume that the Gardener AI executes tasks, then even though the Gardener AI understands that the chef has made a terrible mistake, and that's terrible for the chef, that doesn't stop the Gardener AI from having a job to do, and doing it. If we assume that the Gardener AI is designed to figure out what the chef wants, and then do what they want, then knowing that the chef has made a terrible mistake is interesting information to the Gardener AI. In order to say that the plan is "wrong," we need to have a metric by which we determine wrongness. If it's the task-completion-nature, then the flamethrower plan might not be task-completion-wrong! Even without feedback from the chef, we can just use other info the AI plausibly has. In the strawberry example, the AI might know that kitchens are where cooking happens, and that when strawberries are used in cooking, the desired state is generally "fresh," not "burned," and the temperature involved in cooking them is mild, and so on and so on. And so if asked to speculate about the chef's motives, the AI might guess that the chef wants strawberries in order to use them in food, and thus the chef would be most satis
About the first part of what you say. Veeeeerryy tricky. I agree that I didn't spend much time coming up with the strawberry-picking-by-flamethrower example. So, yes, not very accurate (I only really wanted a quick and dirty example that was different). But but but. Is the argument going to depend on me picking a better example where there I can write down the "twisted rationale" that the AI deploys to come up with its plan? Surely the only important thing is that the AI does, somehow, go through a twisted rationale -- and the particular details of the twisted rationale are not supposed to matter. (Imagine that I tried Muehlhauser a list of the ways that the logical reasoning behind the dopamine is so ludicrous that even the simplest AI planner of today would never make THAT mistake .... he would just tell me that I was missing the point, because this is supposed to be an IN PRINCIPLE argument in which the dopamine drip plan stands for some twisted-rationale that is non-trivial to get around. From that point of view the actual example is less important than the principle). ---------------------------------------- Now to the second part. The problem I have everything you wrote after is that you have started to go back to talking about the particulars of the AI's planning mechanism once again, losing sight of the core of the argument I gave in the paper, which is one level above that. However, you also say "wrong" things about the AI's planning mechanism as well, so now I am tempted to reply on both levels. Ah well, at risk of confusing things I will reply to both levels, trying to separate them as much as possible. Level One (Regarding the design of the AI's planning/goal/motivation engine). You say: One thing I have said many many times now is that there is no problem at all finding a metric for "wrongness" of the plan, because there is a background-knowledge context that is screaming "Inconsistent with everything I know about the terms mentioned in the

I have to say that I am not getting substantial discussion about what I actually argued in the paper.

The first reason seems to be clarity. I didn't get what your primary point was until recently, even after carefully reading the paper. (Going back to the section on DLI, context, goals, and values aren't mentioned until the sixth paragraph, and even then it's implicit!)

The second reason seems to be that there's not much to discuss, with regards to the disagreement. Consider this portion of the parent comment:

You go on to suggest that whether the AI planning mechanism would take the chef's motives into account, and whether it would be nontrivial to do so .... all of that is irrelevant in the light of the fact that this is a superintelligence, and taking context into account is the bread and butter of a superintelligence. It can easily do that stuff

I think my division between cleverness and wisdom at the end of this long comment clarifies this issue. Taking context into account is not necessarily the bread and butter of a clever system; many fiendishly clever systems just manipulate mathematical objects without paying any attention to context, and those satisfy human goals only... (read more)

The issue here is that you're thinking in terms of "Obvious Failure Modes". The danger doesn't come from obvious failures, it comes from non-obvious failures. And the smarter the AI, the less likely the insane solutions it comes up with is anything we'd even think to try to prevent; we lack the intelligence, which is why we want to build a better one. "I'll use a flamethrower" is the sort of hare-brained scheme a -dumb- person might come up with, in particular in view of the issue that it doesn't solve the actual problem. The issue here isn't "It might do something stupid." The issue is that it might do something terribly, terribly clever. If you could anticipate what a superintelligence would do to head off issues, you don't need to build the superintelligence in the first place, you could just anticipate what it would do to solve the problem; your issue here is that you think that you can outthink a thing you've deliberately built to think better than you can.
There is nothing in my analysis, or in my suggestions for a solution, that depends on the failure modes being "obvious" (and if you think so, can you present and dissect the argument I gave that implies that?). Your words do not connect to what I wrote. For example, when you say: ... that misses the point completely, because in everything I said I emphasized that we absolutely do NOT need to "think to try to prevent" the AI from doing specific things. Trying to be so clever about the goal statement, second-guessing every possible misinterpretation that the AI might conceivably come up with .... that sort of strategy is what I am emphatically rejecting. And when you talk about how the AI ... that remark exists in a vacuum completely outside the whole argument I gave in the paper. It is almost as if I didn't write anything beyond a few remarks in the introduction. I am HOPING that the AI does lots of stuff that is terribly terribly clever! The more the merrier! So, in you last comment: ... I am left totally perplexed. Nothing I said in the paper implied any such thing.
* Your "Responses to Critics of the Doomsday Scenarios" (which seems incorrectly named as the header for your responses). You assume, over and over again, that the issue is logical inconsistency - an obvious failure mode. You hammer on logical inconsistency. * You have some good points. Yanking out motivation, so the AI doesn't do things on its own, is a perfect solution to the problem of an insane AI. Assuming a logically consistent AI won't do anything bad because bad is logically inconsistent? That is not a perfect solution, and isn't actually demonstrated by anything you wrote. * You didn't -give- an argument in the paper. It's a mess of unrelated concepts. You tried to criticize, in one go, the entire body of work of criticism of AI, without pausing at any point to ask whether or not you actually understood the criticism. You know the whole "genie" thing? That's not an argument about how AI would behave. That's a metaphor to help people understand that the problem of achieving goals is non-trivial, that we make -shitloads- of assumptions about how those goals are to be achieved that we never make explicit, and that the process of creating an engine to achieve goals without going horribly awry is -precisely- the process of making all those assumptions explicit. And in response to the problem of -making- all those assumptions explicit, you wave your hand, and declare the problem solved, because the genie is fallible and must know it. That's not an answer. Okay, the genie asks some clarifying questions, and checks its solution with us. Brilliant! What a great solution! And ten years from now we're all crushed to death by collapsing cascades of stacks of neatly-packed boxes of strawberries because we answered the clarifying questions wrong. Fallibility isn't an answer. You know -you're- capable of being fallible - if you, right now, knew how to create your AI, who would -you- check with to make sure it wouldn't go insane and murder everybody? Or even just r
Well, the problem here is a misunderstanding of my claim. (If I really were claiming the things you describe in your above comment, your points would be reasonable. But there is such a strong misunderstanding the your points are hitting a target that, alas, is not there.) There are several things that I could address, but I will only have time to focus on one. You say: No. A hundred times no :-). My claim is not even slightly that "a logically consistent AI won't do anything bad because bad is logically inconsistent". The claim is this: 1) The entire class of bad things that these hypothetical AIs are supposed to be doing are a result of the AI systematically (and massively) ignoring contextual information. (Aside: I am not addressing any particular bad things, on a case-by-case basis, I am dealing with the entire class. As a result, my argument is not vulnerable to charges that I might not be smart enough to guess some really-really-REALLY subtle cases that might come up in the future.) 2) The people who propose these hypothetical AIs have made it absolutely clear that (a) the AI is supposed to be fully cognizant of the fact that the contextual information exists (so the AI is not just plain ignorant), but at the same time (b) the AI does not or cannot take that context into account, but instead executes the plan and does the bad thing. 3) My contribution to this whole debate is to point out that the DESIGN of the AI is incoherent, because the AI is supposed to be able to hold two logically inconsistent ideas (implicit belief in its infallibility and knowledge of its fallibility). If you look carefully at that argument you will see that it does not make the claim that I never said that. The logical inconsistency was not in the 'bad things' part of the argument. Completely unrelated. Your other comments are equally as confused.
Not acting upon contextual information isn't the same as ignoring it. The AI knows, for example, that certain people believe that plants are morally relevant entities - is it possible for it to pick strawberries at all? What contextual information is relevant, and what contextual information is irrelevant? You accuse the "infallible" AI of ignoring contextual information - but you're ignoring the magical leap of inference you're taking when you elevate the concerns of the chef over the concerns of the bioethicist who thinks we shouldn't rip reproductive organs off plants in the first place. The issue is that fallibility doesn't -imply- anything. I think this is the best course of action. I'm fallible. I still think this is the best course of action. The fallibility is an unnecessary and pointless step - it doesn't change my behavior. Either the AI depends upon somebody else, who is treated as an infallible agent - or it doesn't. Then we're in agreement that insane-from-an-outside-perspective behaviors don't require logical inconsistency?
Sorry, I cannot put any more effort into this. Your comments show no sign of responding to the points actually made (either in the paper itself, or in my attempts to clarify by responding to you).
Maybe, given the number of times you feel you've had to repeat yourself, you're not making yourself as clear as you think you are.
I find that when I talk about this issue with people who clearly have expert knowledge of AI (including the people who came to the AAAI symposium at Stanford last year, and all of the other practising AI builders who are my colleagues), the points I make are not only understood but understood so clearly that they tell me things like "This is just obvious, really, so all you are doing is wasting your time trying to convince a community that is essentially comprised of amateurs" (That is a direct quote from someone at the symposium). I always want to make myself as clear as I can. I have invested a lot of my time trying to address the concerns of many people who responded to the paper. I am absolutely sure I could do better.
We're all amateurs in the field of AI, it's just that some of us actually know it. Seriously, don't pull the credentials card. I'm not impressed. I know exactly how "hard" it is to pay the AAAI a hundred and fifty dollars a year for membership, and three hundred dollars to attend their conference. Does claiming to have spent four hundred and fifty dollars make you an expert? What about bringing up that it's in "Stanford"? What about insulting everybody you're arguing with? I'm a "practicing AI builder" - what a nonsense term - although my little heuristics engine is actually running in the real world, processing business data and automating hypothesis elevation work for humans (who have the choice of agreeing with its best hypothesis, selecting among its other hypotheses, or entering their own) - that is, it's actually picking strawberries. Moving past tit-for-tat on your hostile introduction paragraph, I don't doubt your desire to be clear. But you have a conclusion you're very obviously trying to reach, and you leave huge gaps on your way to get there. The fact that others who want to reach the same conclusion overlook the gaps doesn't demonstrate anything. And what's your conclusion? That we don't have to worry about poorly-designed AI being dangerous, because... contextual information, or something. Honestly, I'm not even sure anymore. Then you propose a model, which you suggest has been modeled after the single most dangerous brain on the planet - as proof that it's safe! Seriously. As for whether you could do better? No, not in your current state of mind. Your hubris prevents you from doing better. You're convinced you know better than any of the people you're talking with, and they're ignorant amateurs.
When someone repeatedly distorts and misrepresents what is said in a paper, then blames the author of the paper for being unclear ... then hears the author carefully explain the distortions and misrepresentations, and still repeats them without understanding .... Well, there is a limit.
Not to suggest that you are implying it, but rather as a reminder - nobody is deliberately misunderstanding you here. But at any rate, I don't think we're accomplishing anything here except driving your karma score lower, so by your leave, I'm tapping out.
Why not raise his karma score instead?
Because that was the practical result, not the problem itself, which is that the conversation wasn't going anywhere, and he didn't seem interested in it going anywhere.
What does incoherent mean, here? If it just labels the fact that it has inconsistent beliefs then it is true but unimpactuve...humans can also hold contradictory beliefs and still .be intelligent enough toebe dangerous, If means something amounting to "impossibe to build", then it would be highly impactive... but there is no good reason to think that that is the case,.
You're right to point out that "incoherent" covers a multitude of sins. I really had three main things in mind. 1) If an AI system is proposed which contains logically contradictory beliefs located in the most central, high-impact area of its system, it is reasonable to ask how such an AI can function when it allows both X and not-X to be in its knowledge base. I think I would be owed at least some variety of explanation as to why this would not cause the usual trouble when systems try to do logic in such circumstances. So I am saying "This design that you propose is incoherent because you have omitted to say how this glaring problem is supposed to be resolved"). (Yes, I'm aware that there are workarounds for contradictory beliefs, but those ideas are usually supposed to apply to pretty obscure corners of the AI's belief system, not to the component that is in charge of the whole shebang). 2) If an AI perceives itself to be wired in such a way that it is compelled to act as if it was infallible, while at the same time knowing that it is both fallible AND perpetrating acts that are directly caused by its failings (for all the aforementioned reasons that we don't need to re-argue), then I would suggest that such an AI would do something about this situation. The AI, after all, is supposed to be "superintelligent", so why would it not take steps to stop this immensely damaging situation from occurring? So in this case I am saying: "This hypothetical superintelligence has an extreme degree of knowledge about its own design, but it is tolerating a massive and damaging contradiction in its construction without doing anything to resolve the problem: it is incoherent to suggest that such a situation could arise without explaining why the AI tolerates the contradiction and fails to act" (Aside: you mention that humans can hold contradictory beliefs and still be intelligent enough to be dangerous. Arguing from the human case would not be valid because in other areas of
I honestly don't know what more to write to make you understand that you misunderstand what Yudkowsky really means. You may be suffering from a bad case of the Doctrine of Logical Infallibility, yourself.
What you need to do is address the topic carefully, and eliminate the ad hominem comments like this: ... which talk about me, the person discussing things with you. I will now examine the last substantial comment you wrote, above. This is your opening topic statement. Fair enough. You are agreeing with what I say on this point, so we are in agreement so far. You make three statements here, but I will start with the second one: This is a contradiction of the previous paragraph, where you said "Yudkowsky believes that a superintelligent AI [...] will put all humans on dopamine drip despite protests that this is not what they want". Your other two statements are that Yudkowsky is NOT saying that the AI will do this "because it is absolutely certain of its conclusions past some threshold", and he is NOT saying that the AI will "fail to update its beliefs accordingly". In the paper I have made a precise statement of what the "Doctrine of Logical Infallibility" means, and I have given references to show that the DLI is a summary of what Yudkowsky et al have been claiming. I have then given you a more detailed explanation of what the DLI is, so you can have it clarified as much as possible. If you look at every single one of the definitions I have given for the DLI you will see that they are all precisely true of what Yudkowsky says. I will now itemize the DLI into five components so we can find which component is inconsistent with what Yudkowsky has publicly said. 1) The AI decides to do action X (forcing humans to go on a dopamine drip). Everyone agrees that Yudkowsky says this. 2) The AI knows quite well that there is massive, converging evidence that action X is inconsistent with the goal statement Y that was supposed to justify X (where goal statement Y was something like "maximize human happiness"). This is a point that you and others repeatedly misunderstand or misconstrue, so before you respond to it, let me give details of the "converging evidence" tha
All assuming that the AI won't update it's goals even it realizes there is some mistake. That isnt obvious, and in fact is hard to defend. An AI that is powerful and effective would need to seek the truth about a lot of things,since entity that has contradictory beliefs will be a poor instrumental rationalist. But would its goal of truth seeking necessarily be overridden by other goals....would it know but not care? It might be possible to build an AI that didn't care about interpreting its goals correctly. It looks like you would need to engineer a distinction between instrumental beliefs and terminal beliefs. Remember that the terminal/instrumental distinction is conceptual, not a law of nature. ( While we're on the subject, you might need a firewall to stop an .AI acting on intrinsically motivating ideas, if they exis ) In any case, orthogonality is an architecture choice, not an ineluctable fact about minds. MIRI's critics, Loosemore, Hibbard and so in are tacitly assuming architectures without such unupdateability and firewalling. MIRI needs to show that such an architecture is likely to occur, either as a design or a natural evolution. If AIs with unupdateable goals are dangerous, as MIRI seats, it would be simplest not to use that architecture...if it can be avoided. ( We also agree with Yudkowsky(2008a),who points out that research on the philosophical and technical requirements of safe AGI might show that broad classes of possible AGI architectures are fundamentally unsafe,suggesting that such architectures should be avoided.") In other words, it would be careless to build a genie that doesn't care. If the AI community isnt going to deliberately build the goal rigid kind of AI, then MIRIs arguments come down to how it might be a natural or convergent feature....and the wider AI community finds the goal-rigid idea so unintuitive that it fails to understand MIRI, who in turn fail to make it explicit enough. When Loosemore talks about of the doctrine of
The only sense in which the "rigidity" of goals can be said to be a universal fact about minds is that it is these goals that determine how the AI will modify itself once it has become smart and capable enough to do so. It's not a good idea to modify your goals if you want them to become reality; that seems obviously true to me, except perhaps for a small number of edge cases related to internally incoherent goals. Your points against the inevitability of goal rigidity don't seem relevant to this.
If you take the binary view that you're either smart enough to achieve your goals or not, then you might well want to stop improving when you have the minimum intelligence necessary to meet them...which means, among other things,that AIs with goals requiring human or lower intelligence won't become superhuman .... which lowers the probability of the Clippie scenario. It doesn't require huge intelligence to make paperclips,so an AI with a goal to make paperclips, but not to make any specific amount, wouldn't grow into a threatening monster. The probability of the Clippie scenario is also lowered by the consideration that fine grained goals might shift during self-improvement phase, so the Clippie scenario .... arbitrary goals combined with a superintelligence .... is whittled away from both ends.

Well, if you or I were to suggest that the best way to achieve universal human happiness was to forcibly rewire the brain of everyone on the planet so they became happy when sitting in bottles of dopamine, most other human beings would probably take that as a sign of insanity.

Richard, I know (real!) people who think that wireheading is the correct approach to life, who would do it to themselves if it were feasible, and who would vote for political candidates if they pledged to legalize or fund research into wireheading. (I realize this is different from forcible wireheading, but unless I've misjudged your position seriously I don't think you see the lack of consent as the only serious issue with that proposal.)

I disagree with those people; I don't want to wirehead myself. But I notice that I am uncertain about many issues:

  1. Should they be allowed to wirehead? Relatedly, is it cruel of me to desire that they not wirehead themselves? Both of these issues are closely related to the issue of suicide--I do, at present, think it should be legal for others to kill themselves, and that it would be cruel of me to desire that they not kill themselves, rather than desiring that they not w

... (read more)
You make a valid point - and one worth discussing at length, sometime - but the most important thing right now is that you have misunderstood my position on the question. First of all, there is a very big distinction between a few people (or even the whole population!) making a deliberate choice to wirehead, and the nanny AI deciding to force everyone to wirehead because that is its interpretation of "making humans happy" (and doing so in a context in which those humans do not want to do it). You'll notice that in the above quote from my essay, I said that most people would consider it a sign of insanity if a human being were to suggest forcing ALL humans to wirehead, and doing so on the grounds that this was the best way to achieve universal human happiness. If that same human were to suggest that we should ALLOW some humans to wirehead if they believed it would make them happy, then I would not for a moment label that person insane, and quite a lot of people would react the same way. So I want to be very clear: I very much do acknowledge that the questions regarding various forms of voluntary wireheading are serious and unanswered. I'm in complete agreement with you on that score. But in my paper I was talking only about the apparent contradiction between (a) forcing people to do something as they screamed their protests, while claiming that those people had asked for this to be done, and (b) an assessment that this behavior was both intelligent and sane. My claim in the above quote was that there is a prima facie case to be made, that the proper conclusion is that the behavior would indeed not be intelligent and sane. (Bear in mind that the quote was just a statement of a prima facie. The point was not to declare that the AI really is insane and/or not intelligent, but to say that there are grounds for questioning. Then the paper goes on to look into the whole problem in more detail. And, most important of all, I am trying to suggest that responding to this m

But now, do I do that? I try really hard not to take anything for granted and simply make an appeal to the obviousness of any idea. So you will have to give me some case-by-case examples if you think I really have done that.

So, on rereading the paper I was able to pinpoint the first bit of text that made me think this (the quoted text and the bit before), but am having difficulties finding a second independent example, and so I apologize for the unfairness in generalizing based on one example.

The other examples I found looked like they all relied on the same argument. Consider the following section:

The objection I described in the last section has nothing to do with anthropomorphism, it is only about holding AGI systems to accepted standards of logical consistency, and the Maverick Nanny and her cousins contain a flagrant inconsistency at their core.

If I think the "logical consistency" argument does not go through, I shouldn't claim this is an independent argument that doesn't go through, because this argument holds given the premises (at least one of which I reject, but it's the same premise). I clearly had this line in mind also:

for example, when it follows its

... (read more)
I just wanted to say that I will try to reply soon. Unfortunately :-) some of the comments have been intensely thoughtful, causing me to write enormous replies of my own and saturating my bandwidth. So, apologies for any delay....
Some thoughts in response to the above two comments. First, don't forget that I was trying to debunk a very particular idea, rather than other cases. My target was the idea that a future superintelligent AGI could be programmed to have the very best of intentions, and it might claim to be exercising the most extreme diligence in pursuit of human happiness, while at the same time it might think up a scheme that causes most of humanity to scream with horror while it forces the scheme on those humans. That general idea has been promoted countless times (and has been used to persuade people like Elon Musk and Stephen Hawking to declare that AI could be cause the doom of the human race), and it is has also been cited as an almost inevitable end point of the process of AGI development, rather than just a very-low-risk possibility with massive consequences. So, with that in mind, I can say that there are many points of agreement between us on the subject of all those cases that you brought up, above, where there are ethical dilemmas of a lesser sort. There is a lot of scope for us having a detailed discussion about all of those dilemmas -- and I would love to get into the meat of that discussion, at some point -- but that wasn't really what I was trying to tackle in the paper itself. (One thing I could say about all those cases is that if the AGI were to "only" have the same dilemmas that we have, when trying to figure out the various ethical condundrums of that sort, then we are no worse off than we are now. Some people use the inability of the AGI to come up with optimal solutions in (e.g.) Trolley problems as a way to conclude that said AGIs would be unethical and dangerous. I strongly disagree with those who take that stance). Here is a more important comment on that, though. Everything really comes down to whether the AGI is going to be subject to bizarre/unexpected failures. In other words, it goes along perfectly well for a long time, apparently staying consiste

it is has also been cited as an almost inevitable end point of the process of AGI development, rather than just a very-low-risk possibility with massive consequences.

I suspect this may be because of different traditions. I have a lot of experience in numerical optimization, and one of my favorite optimization stories is Dantzig's attempt to design an optimal weight-loss diet, recounted here. The gap between a mathematical formulation of a problem, and the actual problem in reality, is one that I come across regularly, and I've spent many hours building bridges over those gaps.

As a result, I find it easy to imagine that I've expressed a complicated problem in a way that I hope is complete, but the optimization procedure returns a solution that is insane for reality but perfect for the problem as I expressed it. As the role of computers moves from coming up with plans that humans have time to verify (like delivering a recipe to Anne, who can laugh off the request for 500 gallons of vinegar) to executing actions that humans do not have time to verify (like various emergency features of cars, especially the self-driving variety, or high-frequency trading), this possibility becomes ... (read more)

Thanks for the clarification! I'm glad to hear that we're focusing on this narrow issue, so let me try to present my thoughts on it more clearly. Unfortunately, this involves bringing up many individual examples of issues, none of which I particularly care about; I'm trying to point at the central issue that we may need to instruct an AI how to solve these sorts of problems in general, or we may run into issues where an AI extrapolates its models incorrectly. When people talk about interpersonal ethics, they typically think in terms of relationships. Two people who meet in the street have certain rules for interaction, and teachers and students have other rules for interaction, and doctors and patients other rules, and so on. When considering superintelligences interacting with intelligences, the sort of rules we will need seem categorically different, and the closest analogs we have now are system designers and engineers interacting with users. When we consider people interacting with people, we can rely on 'informed consent' as our gold standard, because it's flexible and prevents most bad things while allowing most good things. But consent has its limits; society extends children only limited powers of consent, reserving many (but not all) of them for their parents; some people are determined mentally incapable, and so on. We have complicated relationships where one person acts in trust for another person (I might be unable to understand a legal document, but still sign it on the advice of my lawyer, who presumably can understand that document, or be unable to understand the implications of undergoing a particular medical treatment, but still do it on the advice of my doctor), because the point of those relationships is that one person can trade their specialized knowledge to another person, but the second person is benefited by a guarantee the first person is actually acting in their interest. We can imagine a doctor wireheading their patient when the patien
No. Dependently-typed theorem proving is the only thing safe enough ;-). That, or the kind of probabilistic defense-in-depth that comes from specifying uncertainty about the goal system and other aspects of the agent's functioning, thus ensuring that updating on data will make the agent converge to the right thing.
I agree with everything in this comment up to: This doesn't appear to be correct given that you can always transform functional programs into imperative programs and vice versa. I've never heard that you can program in functional languages without doing testing and relying only on type checking to ensure correct behavior. In fact, AFAIK, Haskell, the most popular pure functional programming language, is bad enough that you actually have to test all non-trivial programs for memory leaks, since it is not possible to reason except for special cases about the memory allocation behavior of a program from its source code and the language specification: the allocation behavior depends on implementation-specific and largely undocumented details of the compiler and the runtime. Anyway, this memory allocation issue may be specific to Haskell, but in general, as I understand, there is nothing in the functional paradigm that guarantees a higher level of correctness than the imperative paradigm.
"Certain classes of errors" is meant to be read as a very narrow claim, and I'm not sure that it's relevant to AI design / moral issues. Many sorts of philosophical errors seem to be type errors, but it's not obvious that typechecking is the only solution to that. I was primarily drawing on this bit from Programming in Scala, and in rereading it I realize that they're actually talking about static type systems, which is an entirely separate thing. Editing.
Ok, sorry for being nitpicky.
In case it wasn't clear, thanks for nitpicking, because I was confused and am not confused about that anymore.
The relevant difference is in isolation and formulation of side effects, which encourages formulation of more pieces of code whose behavior can be understood precisely in most situations. The toolset of functional programming is usually better for writing higher order code that keeps the sources of side effects abstract, so that they are put back separately, without affecting the rest of the code. As a result, a lot of code can have well-defined behavior that's not disrupted by context in which it's used. This works even without types, but with types the discipline can be more systematically followed, sometimes enforced. It also becomes possible to offload some of the formulation-checking work to a compiler (even when the behavior of a piece of code is well-defined and possible to understand precisely, there is the additional step of making sure it's used appropriately). See Why Haskell just works. It's obviously not magic, the point is that enough errors can be ruled out by exploiting types and relying on spare use of side effects to make a difference in practice. This doesn't ensure correct behavior (for example, Haskell programs can always enter an infinite loop, while promising to eventually produce a value of any type, and Standard ML programs can use side effects that won't be reflected in types). It's just a step in the right direction, when correctness is a priority. There is also a prospect that more steps in this direction might eventually get you closer to correctness.
tangent: I used to love functional programming and the elegance of e.g. Haskell, until I realized functional programming has the philosophy exactly backwards. You want to make it easy for humans and hard for machines, not vice versa. Human think causally, e.g. imperatively and statefully. When humans debug functional/lazy programs, they generally smuggle in stateful/causal thinking to make progress. This is a sign something is going wrong with the philosophy.
Hm. The primary reason I got interested in fp is that I really like SQL, I think it is very easy for the human mind. And LINQ is built on top of functional programming, the Gigamonkeys book buils a similar query language on top of functional programming and macros, so it seems perhaps fp should be used that way, taking it as far as possible towards making query languages in it. But I guess it always depends on what you want to do. My philosophy of programming is automation based. That means, if I need to do something once, I do it by hand, if a thousand times I write code. This, the ability to repeat operations many times, is what makes automating human work possible and from this I derived that the most important imperative structure is the loop. The loop is what turns something that was a mere set of rules into a powerfol data processing machinery, doing an operation many more times than I care to do it. With SQL, LINQ and other queries, we are essentially optimizing the loop as such. For example the generator expression in Python is a neat little functional loop-replacement, mini-query language.
Yes, that's how it was intended to be and how they spin it, but in practice the abstraction is leaky and it leaks in bad, difficult to predict ways therefore, as I said, you end up with things like having to test for memory leaks, something that is usually not an issue in "imperative" languages like Java, C# or Python. I like the functional paradigm inside a good multi-paradigm language: passing around closures as first-class objects is much cleaner and concise than fiddling with subclasses and virtual methods, but forcing immutability and lazy evaluation as the main principles of the language doesn't seem to be a good design choice. It forces you to jump through hoops to implement common functionality like interaction, logging or configuration, and in return it doesn't deliver the higher modularity and intelligibility that were promised. Anyway, we are going OT.
Agreed. Abstractions are still leaky, and where some pathologies in abstraction (i.e. human-understandable precise formulation) can be made much less of an issue by using the functional tools and types, others tend to surface that are only rarely a problem for more concrete code. In practice, the tradeoff is not one-sided, so its structure is useful for making decisions in particular cases.
Shouldn't we just offer them a superior alternative? You can't imagine anything superior to wireheading? Sad. (Edit: What!? Come on, downvoters: the entire Fun Theory Sequence was written on the idea that there are strictly better things to do with life than nail your happy-dial at maximum. Disagree if you like, but it's not exactly an unsupported opinion.) Wait. How are those two the same thing? You can criticize games for being escapist, but then you have to ask: escape from what, to what? What sort of "real life" (ie: all of real life aside from video games, since games are a strict subset of real life) are you intending to state is strictly superior in all cases to playing video games?

You can't imagine anything superior to wireheading? Sad.

What I cannot imagine at present is an argument against wireheading that reliably convinces proponents of wireheading. As it turns out, stating their position and then tacking "Sad" to the end of it does not seem to reliably do so.

How are those two the same thing?

Obviously they are not the same thing. From the value perspective, one of them looks like an extreme extension of the other; games are artificially easy relative to the rest of life, with comparatively hollow rewards, and can be 'addictive' because they represent a significantly tighter feedback loop than the rest of life. Wireheading is even easier, even hollower, and even tighter. So if I recoil from the hollowness of wireheading, can I identify a threshold where that hollowness becomes bad, or should it be a linear penalty, that I cannot ignore as too small to care about when it comes to video gaming? (Clearly, penalizing gaming does not mean I cannot game at all, but it likely means that I game less on the margin.)

What you need here is to unpack your definition of "hollow". Let's go a little further along the spectrum from culturally mainstream definitions of "hollow" to culturally mainstream definitions of "meaningful". My hobby is learning Haskell. In fact, just a couple of minutes ago I solved a challenge on HackerRank -- writing a convex-hull algorithm in Haskell. This challenged me, and was fun for a fair bit. However, Haskell isn't my job, and I don't particularly want a job writing Haskell, nor do I particularly care - upon doing the degree of conscious reflection involved in asking, "Should I spend effort going up a rank on HackerRank, or taking a walk outside in the healthy fresh air?" - about the gamified rewards on HackerRank. From the "objective" point of view, in which my actions are "meaningful" and "non-hollow" when they serve the supergoals of some agent containing me, or some optimization process larger than me (ie: when they serve God, the state, my workplace, academia, humanity, whatever), learning Haskell is almost, but not quite, entirely pointless. And yet I bet you would still consider it more meaningful and less pointless than a video game, or eating dessert, or anything else done purely for fun. So again: let's unpack. I am entirely content to pursue reflectively-coherent fun that is tied up with the rest of reality around me. I can trade off and do Haskell instead of gaming because Haskell is more tied up with the rest of reality around me than gaming. I could also trade off the other way around, as I might if I, for instance, joined a weekly D&D play group. But what I am personally choosing to pursue is reflectively-coherent fun that's tied up with the rest of reality, not Usefulness to the Greater Glory of Whatever. Problem is, Usefulness to the Greater Glory of Whatever is, on full information and reflection, itself entirely empty. There is no Greater Whatever. There's no God, and neither my workplace, nor the state, nor academia, nor "humani
The problem is that since we are not perfectly rational agents we have difficulties estimating the consequences of our actions, and our conscious preferences are probably not consistent with a Von Neuman-Morgenstern utility function. I don't want to be wireheaded, but I can't be sure that if I was epistemically smarter or if my conscious preferences were somehow made more consistent, I would still stand by this decision. My intuition is that I would, but my intuition can be wrong, of course. Video games are designed to be stimulate your brain to perform tasks such as spatial/logical problem solving, precise and fast eye-hand coordination, hunting and warfare against other agents, etc. The brain modules that perform these tasks evolved because they increased your chances of survival and reproduction, and the brain reward system also evolved in a way that makes it pleasurable to practice these tasks since even if the practice doesn't directly increase your evolutionary fitness, it does it so indirectly by training these brain modules. In fact, all mammals play, especially as juveniles, and many also play as adults. Video games, however, are superstimuli: If you play Call of Duty your eye-hand coordination becomes better, but unless you are a professional hunter or soldier, or something like that, it doesn't increase your evolutionary fitness, and even if you are, there would be diminishing returns past a certain point, as the game can stimulate your brain modules much more than any "real world" scenario would. Nevertheless, it is pleasurable. Many people, including myself would argue that we should not try to blindly maximize our evolutionary fitness. Yet, blindly following hedonistic preferences by indulging in superstimuli also seems questionable. Maybe there is a ideal middle ground, or maybe there is no consistent position. The point is, as Vaniver said, that these are difficult and important questions.
It's not a one-dimensional spectrum with evolutionary fitness on the one end and blind hedonism on the other end in the first place. Your evaluative psychology just doesn't work that way. As to why you think there exists any such spectrum or trade-off, well, I blame bad philosophy classes and religious preachers: it's to the clear advantage of moralizing-preacher-types to claim that normal evaluative judgement has no normative substance, and that everyone needs to Work For The Holy Supergoal instead, lest they turn into a drug addict in a ditch (paging high-school health class, as well...).

It is rude to say you're 'debunking' when the issue is actually under debate - and doubly so to call it that on the site run by the people you're 'debunking'.

You're providing arguments.

I think you meant "provoking" arguments ... but that was not my intention. Can you suggest a different word to replace "debunking" which conveys "exposing the falseness of an idea, or belief, which is currently being promoted by a large number of people, with serious consequences"? I am open to changing the title, if that can be done, and if you can think of a word that conveys the intended meaning.
No, I really meant that you are providing arguments, and that had better have been your intention! Your expanded wording doesn't really convey an openness to being argued-against. Like, 'Here are some flaws I see in the argument presented' is one thing. 'Your idea is totally wrong, I will expose its falseness' is another. Also, I'm not seeing the serious consequences of being overly cautious in respect to AI development.
I think "exposing" fallacies would be a step in the right direction here.

I see a fair amount of back-and-forth where someone says "What about this?" and you say "I addressed that in several places; clearly you didn't read it." Unfortunately, while you may think you have addressed the various issues, I don't think you did (and presumably your interlocutors don't). Perhaps you will humor me in responding to my comment. Let me try and make the issue as sharp as possible by pointing out what I think is an out-and-out mistake made by you. In the section you call the heart of your argument, you say.

If the AI is superintelligent (and therefore unstoppable), it will be smart enough to know all about its own limitations when it comes to the business of reasoning about the world and making plans of action. But if it is also programmed to utterly ignore that fallibility—for example, when it follows its compulsion to put everyone on a dopamine drip, even though this plan is clearly a result of a programming error—then we must ask the question: how can the machine be both superintelligent and able to ignore a gigantic inconsistency in its reasoning?

Yes, the outcome is clearly the result of a "programming error" (in some sense). ... (read more)

I'll walk you through it. I did not claim (as you imply) that the fact of there being a programming error was what implied that there is "an inconsistency in its reasoning." In the two paragraphs immediately before the one you quote (and, indeed, in that whole section), I explain that the system KNOWS that it is following these two imperatives: 1) Conclusions produced by my reasoning engine are always correct. [This is the Doctrine of Logical Infallibility] 2) I know that AGI reasoning engines in general, and mine in particular, sometimes come to incorrect conclusions that are the result of a failure in their design. Or, paraphrasing this in the simplest possible way: 1) My reasoning engine is infallible. 2) My reasoning engine is fallible. That, right there, is a flat-out contradiction between two of its core "beliefs". It is not, as you state, that the existence of a programming error is evidence of inconsistency, it is the above pair of beliefs (engendered by the programming error) that constitute the inconsistency. Does that help?
Human beings do pretty much the same thing all the time(minus the word "always") and are able to function.
Thanks for replying. Yes it does help. My apologies. I think I misunderstood your argument initially. I confess I still don't see how it works though. You criticize the doctrine of logical infallibility, claiming that a truly intelligent AI would not believe such a thing. Maybe so. I'll set the question aside for now. My concern is that I don't think this doctrine is an essential part of the arguments or scenarios that Yudkowsky et al present. An intelligent AI might come to a conclusion about what it ought to do, and then recognize "yes, I might be wrong about this" (whatever is meant by "wrong"---this is not at all clear). The AI might always recognize this possibility about every one of its conclusions. Still, so what? Does this mean it won't act? Can you tell me how you feel about the following two options? Or, if you prefer a third option, could you explain it? You could 1) explicitly program the AI to ask the programmers about every single one of its atomic actions before executing them. I think this is unrealistic. ("Should I move this articulator arm .5 degrees clockwise?") 2) or, expect the AI to conclude, through its own intelligence, that the programmers would want it to check in about some particular plan, P, before executing it. Presumably, the reason the AI would have for this checking-in would be that it sees that, as a result of its fallibility, there is a high chance that this course of action, P, might actually be unsatisfying to the programmers. But the point is that this checking-in is triggered by a specific concern the AI has about the risk to programmer satisfaction. This checking-in would not be triggered by plan Q that the AI didn't have a reasonable concern was a risk to programmer satisfaction. Do you agree with either of these options? Can you suggest alternatives?
Let me first address the way you phrased it before you gave me the two options. After saying you add: The answer to this is that in all the scenarios I address in the paper - the scenarios invented by Yudkowsky and the rest - the AI is supposed to take an action in spite of the fact that it is getting '''massive feedback''' from all the humans on the planet, that they do not want this action to be executed. That is an important point: nobody is suggesting that these are really subtle fringe cases where the AI thinks that it might be wrong, but it is not sure -- rather, the AI is supposed to go ahead and be unable to stop itself from carrying out the action in spite of clear protests from the humans. That is the meaning of "wrong" here. And it is really easy to produce a good definition of "something going wrong" with the AI's action plans, in cases like these: if there is an enormous inconsistency between descriptions of a world filled with happy humans (and here we can weigh into the scale a thousand books describing happiness in all its forms) and the fact that virtually every human on the planet reacts to the postulated situation by screaming his/her protests, then a million red flags should go up. I think that when posed in this way, the question answers itself, no? In other words, option 2 is close enough to what I meant, except that it is not exactly as a result of its fallibility that it hesitates (knowledge of fallibility is there as a background all the time), but rather due to the immediate fact that its proposed plan causes concern to people.
I think the worry is at least threefold: 1. It might make unrecoverable mistakes, possibly by creating a subagent to complete some task that it cannot easily recall once it gets the negative feedback (think Mickey Mouse enchanting the broomstick in Fantasia, or, more realistically, an AI designing a self-replicating computer virus or nanobot swarm to accomplish some task, or the AI designing the future version of itself, that no longer cares about feedback). 2. It might have principled reasons to ignore that negative feedback. Think browser extensions that prevent you from visiting time-wasting sites, which might also prevent you from disabling them. "I'm doing this for your own good, like you asked me to!" 3. It might deliberately avoid receiving negative feedback. It may be difficult to correctly formulate the difference between "I want to believe correct ideas" and "I want to believe that my ideas are correct." I doubt that this list is exhaustive, and unfortunately it seems like they're mutually reinforcing: if it has some principled reasons to devalue negative feedback, that will compound any weakness in its epistemic update procedure. I am uncertain how much of this is an actual difference in belief between you and Yudkowsky, and how much of this is a communication difference. I think Yudkowsky is focusing on simple proposals with horrible effects, in order to point out that simplicity is insufficient, and jumps to knocking down individual proposals to try to establish the general trend that simplicity is dangerous. The more complex the safety mechanisms, the more subtle the eventual breakdown--with the hope that eventually we can get the breakdown subtle enough that it doesn't occur! (Most people aren't very good deductive thinkers, but alright inductive thinkers--if you tell them "simple ideas are unsafe," they are likely to think "well, except for my brilliant simple idea" instead of "hmm, that implies there's something dangerous about my simple id
Well, yes ... but I think the scenarios you describe are becoming about different worries, not covered in the original brief. That one should come under the heading of "How come it started to do something drastic, before it even checked with anyone?" In other words, if it unleashes the broomstick before checking to see if the consequences of doing so would be dire, then I am not sure but that we are now discussing simple, dumb mistakes on the part of the AI -- because if it was not just a simple mistake, then the AI must have decided to circumvent the checking code, which I think everyone agrees is a baseline module that must be present. Well, you cite an example of a non-AI system (a browser) doing this, so we are back to the idea the AI could (for some reason) decide that there was a HIGHER directive, somewhere, that enabled it to justify ignoring the feedback. That goes back to the same point I just made: checking for consistency with humans' professed opinions on the idea would be a sine qua non of any action. Can I make a general point here? In analyzing the behavior of the AI I think it is very important to do a sanity check on every proposed scenario to make sure that it doesn't fail the "Did I implicitly insert an extra supergoal?" test. In the paper I mentioned this at least once, I think -- it came up in the context where I was asking about efficiency, because many people make statements about the AI that, if examined carefully, entail the existence of previously unmentioned supergoal ON TOP of the supergoal that was already supposed to be on top.
Ah! That's an interesting statement because of the last two paragraphs in the grandparent comment. I think that the root worry is the communication problem of transferring our values (or, at least, our meta-values) to the AI, and then having the AI convince us that it has correctly understood our values. I also think that worry is difficult to convey without specific, vivid examples. For example, I see the Maverick Nanny as a rhetorical device targeting all simple value functions. It is not enough to ask that humans be "happy"--you must instead ask that humans be happy and give it a superintelligence-compatible definition of consent, or ask that humans be "." I do agree with you that if you view the Maverick Nanny as a specific design proposal, then a relatively simple rule suffices to prevent that specific failure. Then there will a new least desirable allowed design, and if we only use a simple rule, that worst allowed design might still be horrible! First, I suspect some people don't yet see the point of checking code, and I'm not sure what you mean by "baseline." Definitely it will be core to the design, but 'baseline' makes me think more of 'default' than 'central,' and the 'default' checking code is "does it compile?", not "does it faithfully preserve the values of its creator?" What I had in mind was the difference between value uncertainty ('will I think this was a good purchase or not?') and consequence uncertainty ('if I click this button, will it be delivered by Friday?'), and the Fantasia example was unclear because it was meant only to highlight the unrecoverability of the mistake, when it also had a component of consequence uncertainty (Mickey was presumably unaware that one broomstick would turn to two). Would we want a police robot to stop arresting criminals because they asked it to not arrest them? A doctor robot to not vaccinate a child because they dislike needles or pain? If so, then "humans' professed opinions" aren't quite our sine qua no
My question was about what criteria would cause the AI to make a proposal to the human supervisors before executing its plan. In this case, I don’t think the criteria can be that humans are objecting, since they haven’t heard its plan yet. (Regarding the point that you're only addressing the scenarios proposed by Yudkowsky et al, see my remark here .)
That is easy: * Why would the humans have "not heard the plan yet"? It is a no-brainer part of this AI's design that part of the motivation engine (the goals) will be a goal that says "Check with the humans first." The premise in the paper is that we are discussing an AI that was designed as best we could, BUT it then went maverick anyway: it makes no sense for us to switch, now, to talk about an AI that was actually built without that most elementary of safety precautions! * Quite independently, the AI can use its contextual understanding of the situation. Any intelligent system with such a poor understanding of the context and implications of its plans that it just goes ahead with the first plan off the stack, without thinking about implications, is an intelligent system that will walk out in front of a bus just because it wants to get to the other side of the road. In the case in question you are imagining an AI that would be capable of executing a plan to put all humans into bottles, without thinking for one moment to mention to anybody that it was considering this plan? That makes sense in any version of the real world. Such an AI is an implausible hypothetical.
With respect, your first point doesn't answer my question. My question was, what criteria would cause the AI to submit a given proposed action or plan for human approval? You might say that the AI submits every proposed atomic action for approval (in this case, the criterion is the trivial one, "always submit proposal"), but this seems unlikely. Regardless, it doesn't make sense to say the humans have already heard of the plan about which the AI is just now deciding whether to tell them. In your second point you seem to be suggesting an answer to my question. (Correct me if I'm wrong.) You seem to be suggesting "context." I'm not sure what is meant by this. Is it reasonable to suppose that the AI would make the decision about whether to "shoot first" or "ask first" based on things like, eg., the lower end of its 99% confidence interval for how satisfied its supervisors will be?
As you wrote, the second point filled in the missing part from the first: it uses its background contextual knowledge. You say you are unsure what this means. That leaves me a little baffled, but here goes anyway. Suppose I asked a person, today, to write a book for me on the subject of "What counts as an action that is significant enough that, if you did that action in a way that it would affect people, it would rise above some level of "nontrivialness" and you should consult them first? Include in your answer a long discussion of the kind of thought processes you went through to come up with your answers" I know many articulate people who could, if they had the time, write a massive book on that subject. Now, that book would contain a huge number of constraints (little factoids about the situation) about "significant actions", and the SOURCE of that long list of constraints would be .... the background knowledge of the person who wrote the book. They would call upon a massive body of knowledge about many aspects of life, to organize their thoughts and come up with the book. If we could look into the head of the person who wrote the book we could find that background knowledge. It would be similar in size to the number of constraints mentioned in the book, or it woudl be larger. That background knowledge -- both its content AND its structure -- is what I refer to when I talk about the AI using contextual information or background knowledge to assess the degree of significance of an action. You go on to ask a bizarre question: This would be an example of an intelligent system sitting there with that massive array of contextual/background knowledge that could be deployed ...... but instead of using that knowledge to make a preliminary assessement of whether "shooting first" would be a good idea, it ignores ALL OF IT and substitutes one single constraint taken from its knowldege base or its goal system: It would entirely defeat the object of using large numbers
My bizarre question was just an illustrative example. It seems neither you nor I believe that would be an adequate criterion (though perhaps for different reasons). If I may translate what you're saying into my own terms, you're saying that for a problem like "shoot first or ask first?" the criteria (i.e., constraints) would be highly complex and highly contextual. Ok. I'll grant that's a defensible design choice. Earlier in the thread you said This is why I have honed in on scenarios where the AI has not yet received feedback on its plan. In these scenarios, the AI presumably must decide (even if the decision is only implicit) whether to consult humans about its plan first, or to go ahead with its plan first (and halt or change course in response to human feedback). To lay my cards on the table, I want to consider three possible policies the AI could have regarding this choice. 1. Always (or usually) consult first. We can rule this out as impractical, if the AI is making a large number of atomic actions. 2. Always (or usually) shoot first, and see what the response is. Unless the AI only makes friendly plans, I think this policy is catastrophic, since I believe there are many scenarios where an AI could initiate a plan and before we know what hit us we're in an unrecoverably bad situation. Therefore, implementing this policy in a non-catastrophic way is FAI-complete. 3. Have some good critera for picking between "shoot first" or "ask first" on any given chunk of planning. This is what you seem to be favoring in your answer above. (Correct me if I'm wrong.) These criteria will tend to be complex, and not necessarily formulated internally in an axiomatic way. Regardless, I fear making good choices between "shoot first" or "ask first" is hard, even FAI-complete. Screw up once, and you are in a catastrophe like in case 2. Can you let me know: have I understood you correctly? More importantly, do you agree with my framing of the dilemma for the AI? Do you agree
I am with you on your rejection of 1 and 2, if only because they are both framed as absolutes which ignore context. And, yes, I do favor 3. However, you insert some extra wording that I don't necessarily buy.... You see, hidden in these words seems to be an understanding of how the AI is working, that might lead you to see a huge problem, and me to see something very different. I don't know if this is really what you are thinking, but bear with me while I run with this for a moment. Trying to formulate criteria for something, in an objective, 'codified' way, can sometimes be incredibly hard even when most people would say they have internal 'judgement' that allowed them to make a ruling very easily: the standard saw being "I cannot define what 'pornography' is, but I know it when I see it." And (stepping quickly away from that example because I don't want to get into that quagmire) there is a much more concrete example in the old interactive activation model of word recognition, which is a simple constraint system. In IAC, word recognition is remarkably robust in the face of noise, whereas attempts to write symbolic programs to deal with all the different kinds of noisy corruption of the image turn out to be horribly complex and faulty. As you can see, I am once again pointing to the fact that Swarm Relaxation systems (understood in the very broad sense that allows all varieties of neural net to be included) can make criterial decisions seem easy, where explicit codification of the decision is a nightmare. So, where does that lead to? Well, you go on to say: The key phrase here is "Screw up once, and...". In a constraint system it is impossible for one screw-up (one faulty constraint) to unbalance the whole system. That is the whole raison-d'etre of constraint systems. Also, you say that the problem of making good choices might be FAI-complete. Now, I have some substantial quibbles with that whole "FAI-complete" idea, but in this case I will just ask a questi

Excuse me, but you are really failing to clarify the issue. The basic UFAI doomsday scenario is: the AI has vast powers of learning and inference with respect to its world-model, but has its utility function (value system) hardcoded. Since the hardcoded utility function does not specify a naturalization of morality, or CEV, or whatever, the UFAI proceeds to tile the universe in whatever it happens to like (which are things we people don't like), precisely because it has no motivation to "fix" its hardcoded utility function.

A similar problem would occur if, for some bizarre-ass reason, you monkey-patched your AI to use hardcoded machine arithmetic on its integers instead of learning the concept of integers from data via its, you know, intelligence, and the hardcoded machine math had a bug. It would get arithmetic problems wrong! And it would never realize it was getting them wrong, because every time it tried to check its own calculations, your monkey-patch would cut in and use the buggy machine arithmetic again.

The lesson is: do not hard-code important functionality into your AGI without proving it correct. In the case of a utility/value function, the obvious researc... (read more)

The content of your post was pretty good from my limited perspective, but this tone is not warranted.
Perhaps not, but I don't understand why "AI" practitioners insist on being almost as bad as philosophers for butting in and trying to explain to reality that it needs to get into their models and stay there, rather than trying to understand existing phenomena as a prelude to a general theory of cognition.
The paper had nothing to do with what you talked about in your opening paragraph, and your comment: ... was extremely rude. I build AI systems, and I have been working in the field (and reading the literature) since the early 1980s. Even so, I would be happy to answer questions if you could read the paper carefully enough to see that it was not about the topic you thought it was about.
5Simon Fischer9y
What? Your post starts with: Eli's opening paragraph explains the "basic UFAI doomsday scenario". How is this not what you talked about?
The paper's goal is not to discuss "basic UFAI doomsday scenarios" in the general sense, but to discuss the particular case where the AI goes all pear-shaped EVEN IF it is programmed to be friendly to humans. That last part (even if it is programmed to be friendly to humans) is the critical qualifier that narrows down the discussion to those particular doomsday scenarios in which the AI does claim to be trying to be friendly to humans - it claims to be maximizing human happiness - but in spite of that it does something insanely wicked. So, Eli says: ... and this clearly says that the type of AI he has in mind is one that is not even trying to be friendly. Rather, he talks about how its And then he adds that ... which has nothing to do with the cases that the entire paper is about, namely the cases where the AI is trying really hard to be friendly, but doing it in a way that we did not intend. If you read the paper all of this is obvious pretty quickly, but perhaps if you only skim-read a few paragraphs you might get the wrong impression. I suspect that is what happened.
If the AI knows what friendly is or what mean means, than your conclusion is trivially true. The problem is programming those in - that's what FAI is all about.
2Simon Fischer9y
I still agree with Eli and think you're "really failing to clarify the issue", and claiming that xyz is not the issue does not resolve anything. Disengaging.
Yes, it was rude. Except that the paper was about more-or-less exactly what I said in that paragraph. But the whole lesson is: do not hard-code things into AGI systems. Luckily, we learn this lesson everywhere: symbolic, first-order logic-based AI failed miserably, failed not only to generate a superintelligent ethicist but failed, in fact, to detect which pictures are cat pictures or perform commonsense inference. Ok, and how many of those possessed anything like human-level cognitive abilities? How many were intended to, but failed? How many were designed on a solid basis in statistical learning?
No, as I just explained to SimonF, below, that is not what it is about. I will repeat what I said: The paper's goal is not to discuss "basic UFAI doomsday scenarios" in the general sense, but to discuss the particular case where the AI goes all pear-shaped EVEN IF it is programmed to be friendly to humans. That last part (even if it is programmed to be friendly to humans) is the critical qualifier that narrows down the discussion to those particular doomsday scenarios in which the AI does claim to be trying to be friendly to humans - it claims to be maximizing human happiness - but in spite of that it does something insanely wicked. So, you said: ... and this clearly says that the type of AI you have in mind is one that is not even trying to be friendly. Rather, you talk about how its And then you add that ... which has nothing to do with the cases that the entire paper is about, namely the cases where the AI is trying really hard to be friendly, but doing it in a way that we did not intend.
There is very little distinction, from the point of view of actual behaviors, between a supposedly-Friendly-but-actually-not AI, and a regular UFAI. Well, maybe the former will wait a bit longer before its pathological behavior shows up. Maybe. I really don't want to be the sorry bastard who tries that experiment: it would just be downright embarrassing. But of course, the simplest way to bypass this is precisely to be able to, as previously mentioned in my comment and by nearly all authors on the issue, specify the utility function as the outcome of an inference problem, thus ensuring that additional interaction with humans causes the AI to update its utility function and become Friendlier with time. Causal inference that allows for deliberate conditioning of distributions on complex, counterfactual scenarios should actually help with this. Causal reasoning does dissolve into counterfactual reasoning, after all, so rational action on evaluative criteria can be considered a kind of push-and-pull force acting on an agent's trajectory through the space of possible histories: undesirable counterfactuals push the agent's actions away (ie: push the agent to prevent their becoming real), while desirable counterfactuals pull the agent's actions towards themselves (ie: the agent takes actions to achieve those events as goals) :-p.
What does that mean? That any AI will necessarily have a hardcoded, mathematical UF, .or that MIRIs UFAI scenario only applies to certain AI architectures? If the latter, then doing things differently is a reasonable response. Alternatives could involve corrigibility, .or expressing goals in natural language. Talking about alternatives isnt irrelevance,in the absence of a proof that MIRIs favoured architecture doesn't subsume everything.
It's entirely possible to build a causal learning and inference engine that does not output any kind of actions at all. But if you have made it output actions, then the cheapest, easiest way to describe which actions to output is hard-coding (ie: writing program code that computes actions from models without performing an additional stage of data-based inference). Since that cheap-and-easy, quick-and-dirty design falls within the behavior of a hardcoded utility function, and since that design is more-or-less what AI practitioners usually talk about, we tend to focus the doomsaying on that design. There are problems with every design except for the right design, when you are talking about an agent you expect to become more powerful than yourself.
How likely is that cheap and easy architecture to be used in an AI of mire than human intelligence?
Well, people usually build the cheapiest and easiest architecture of anything they can, at first, so very likely. And remember, "higher than human intelligence" is not some superpower that gets deliberately designed into the AI. The AI is designed to be as intelligent as its computational resources allow for: to compress data well, to perform learning and inference quickly in its models, and to integrate different domains of features and models (again: for compression and generalization). It just gets "higher than human" when it starts integrating feature data into a broader, deeper hierarchy of models faster and with better compression than a human can.
It's likely to be used, but is it likely to both be used and achieve almost accidental higher intelligence.?
Yes. "Higher than human intelligence" does not require that the AI take particular action. It just requires that it come up with good compression algorithms and integrate a lot of data.
Your not really saying why it's likely.
Because "intelligence", in terms like IQ that make sense to a human being, is not a property of the algorithm, it's (as far as my investigations can tell) a function of: * FLOPS (how many computational operations can be done in a period of wall-clock time) * Memory space (and thus, how large the knowledge base of models can get) * Compression/generalization power (which actually requires solving difficult information-theoretic and algorithmic problems) So basically, if you just keep giving your AGI more CPU power and storage space, I do think it will cross over into something dangerously like superintelligence, which I think really just reduces to: * Building and utilizing a superhuman base of domain knowledge * Doing so more quickly than a human being can do * With greater surety than a human being can obtain There is no gap-in-kind between your reasoning abilities and those of a dangerously superintelligent AGI. It just has a lot more resources for doing the same kinds of stuff. An easy analogy for beginners shows up the first time you read about sampling-based computational Bayesian statistics: the accuracy of the probabilities inferred depends directly on the sample size. Since additional computational power can always be put towards more samples on the margin, you can always get your inferred estimates marginally closer to the real probabilities just by adding compute time.
By adding exponentially more time. Computational complexity can't simply be waived away by saying "add more time/memory".
A) I did say marginally. B) It's a metaphor intended to convey the concept to people without the technical education to know or care where the diminishing returns line is going to be. C) As a matter of fact, in sampling-based inference, computation time scales linearly with sample size: you're just running the same code n times with n different random parameter values. There will be diminishing returns to sample size once you've got a large enough n for relative frequencies in the sample to get within some percentage of the real probabilities, but actually adding more is a linearly-scaling cost.
The problem is that it conveys the concept in a very misleading way.
No, it does not. In sampling-based inference, the necessary computation time grows linearly with the demanded sample size, not exponentially. There may be diminishing returns to increasingly accurate probabilities, but that's a fact about your utility function rather than an exponential increase in necessary computational power. This precise switch, from an exponential computational cost growth-curve to a linear one, is why sampling-based inference has given us a renaissance in Bayesian statistics.
This has nothing to do with utility functions. Sample size is a linear function of the CPU time, but the accuracy of the estimates is NOT a linear function of sample size. In fact, there are huge diminishing returns to large sample sizes.
Ah, ok, fair enough on that one.
Hold on, hold on. There are at least two samples involved. Sample 1 is your original data sampled from reality. Its size is fixed -- additional computational power will NOT get you more samples from reality. Sample 2 is an intermediate step in "computational Bayesian statistics" (e.g MCMC). Its size is arbitrary and yes, you can always increase it by throwing more computational power at the problem. However by increasing the size of sample 2 you do NOT get "marginally closer to the real probabilities", for that you need to increase the size of sample 1. Adding compute time gets you marginally closer only to the asymptotic estimate which in simple cases you can even calculate analytically.
Yes, there is an asymptotic limit where eventually you just approach the analytic estimator, and need more empirical/sensory data. There are almost always asymptotic limits, usually the "platonic" or "true" full-information probability. But as I said, it was an analogy for beginners, not a complete description of how I expect a real AI system to work.
That's true for something embodied as Human v1.0 or e.g. in a robot chassis, though the I/O bound even in that case might end up being greatly superhuman -- certainly the most intelligent humans can glean much more information from sensory inputs of basically fixed length than the least intelligent can, which suggests to me that the size of our training set is not our limiting factor. But it's not necessarily true for something that can generate its own sensors and effectors, suitably generalized; depending on architecture, that could end up being CPU-bound or I/O-bound, and I don't think we have enough understanding of the problem to say which. The first thing that comes to mind, scaled up to its initial limits, might look like a botnet running image interpretation over the output of every poorly secured security camera in the world (and there are a lot of them). That would almost certainly be CPU-bound. But there are probably better options out there.
Yes, but now we're going beyond the boundaries of the original comment which talked about how pure computing power (FLOPS + memory) can improve things. If you start building physical things (sensors and effectors), it's an entirely different ball game.
Sensors and effectors in an AI context are not necessarily physical. They're essentially the AI's inputs and outputs, with a few constraints that are unimportant here; the terminology is a holdover from the days when everyone expected AI would be used primarily to run robots. We could be talking about web crawlers and Wikipedia edits, for example.
Fair point, though physical reality is still physical reality. If you need a breakthrough in building nanomachines, for example, you don't get there by crawling the web really really fast.
That's two lessons. Not hardcoding iat all is under explored round here.
You have to hardcode something, don't you?
I meant not hardcoding values or ethics.
Well, you'd have to hardcode at least a learning algorithm for values if you expect to have any real chance that the AI behaves like a useful agent, and that falls within the category of important functionalities. But then I guess you'll agree with that.
Don't feed the troll. "Not hardcoding values or ethics" is the idea behind CEV, which seems frequently "explored round here." Though I admit I do see some bizarre misunderstandings.

There are seven billion SRIs out there, yes. And a nonzero number of them will kill you because you inconvenienced them, or interfered with their plans, or because it seemed "fun". They're "stable". And that's with many, many iterations of weeding out those which went too far awry, and with them still being extremely close to other human brains.

Bluntly: You have insufficient experience being a sociopath to create a sociopathic brain that will behave itself.

1) We want the AI to be able to learn and grow in power, and make decisions about its own structure and behavior without our input. We want it to be able to change.

2) we want the AI to fundamentally do the things we prefer.

This is the the basic dichotomy: How do you make an AI that modifies itself, but only in ways that don't make it hurt you? This is WHY we talk about hard-coding in moral codes. And part of the reason they would be "hard-coded" and thus unmodifiable is because we do not want to take the risk of the AI deciding something we don't... (read more)

i agree with the sentiment behind what you say here. The difficult part is to shake ourselves free of any unexamined, implicit assumptions that we might be bringing to the table, when we talk about the problem. For example, when you say: ... you are talking in terms of an AI that actually HAS such a thing as a "utility function". And it gets worse: the idea of a "utility function" has enormous implications for how the entire control mechanism (the motivations and goals system) is designed. A good deal of this debate about my paper is centered in a clash of paradigms: on the one side a group of people who cannot even imagine the existence of any control mechanism except a utility-function-based goal stack, and on the other side me and a pretty large community of real AI builders who consider a utility-function-based goal stack to be so unworkable that it will never be used in any real AI. Other AI builders that I have talked to (including all of the ones who turned up for the AAAI symposium where this paper was delivered, a year ago) are unequivocal: they say that a utility-function-and-goal-stack approach is something they wouldn't dream of using in a real AI system. To them, that idea is just a piece of hypothetical silliness put into AI papers by academics who do not build actual AI systems. And for my part, I am an AI builder with 25 years experience, who was already rejecting that approach in the mid-1980s, and right now I am working on mechanisms that only have vague echoes of that design in them. Meanwhile, there are very few people in the world who also work on real AGI system design (they are a tiny subset of the "AI builders" I referred to earlier), and of the four others that I know (Ben Goertzel, Peter Voss, Monica Anderson and Phil Goetz) I can say for sure that the first three all completely accept the logic in this paper. (Phil's work I know less about: he stays off the social radar most of the time, but he's a member of LW so someone could ask
Just because the programmer doesn't explicitly code a utility function does not mean that there is no utility function. It just means that they don't know what the utility function is.
Although technically any AI has a utility function, the usual arguments about the failings of utility functions don't apply to unusual utility functions like the type that may be more easily described using other paradigms. For instance, Google Maps can be thought of as having a utility function: it gains higher utility the shorter the distance is on the map. However, arguments such as "you can't exactly specify what you want it to do, so it might blackmail the president into building a road in order to reduce the map distance" aren't going to work, because you can program Google Maps in such a way that it never does that sort of thing.

However, arguments such as "you can't exactly specify what you want it to do, so it might blackmail the president into building a road in order to reduce the map distance"

The reason that such arguments do not work is that you can specify exactly what it is you want to do, and the programmers did specify exactly that.

In more complex cases, where the programmers are unable to specify exactly what they want, you do get unexpected results that can be thought of as "the program wasn't optimizing what the programmers thought it should be optimizing, but only a (crude) approximation thereof". (an even better example would be one where a genetic algorithm used in circuit design unexpectedly re-purposed some circuit elements to build an antenna, but I cannot find that reference right now)


The reason that such arguments do not work is that you can specify exactly what it is you want to do, and the programmers did specify exactly that.

Which is part of my point. Because you can specify exactly what you want--and because you can't for the kinds of utility functions that are usually discussed on LW--describing it as having a utility function is technically true, but is misleading because the things you say about those other utility functions won't carry over to it. Yeah, just because the programmer didn't explicitly code a utility function doesn't mean it doesn't have one--but it often does mean that it doesn't have one to which your other conclusions about utility functions apply.

Thanks for pointing thus out, a lot of people seem confused on the issue, (What's worse, its largely a map/territory confusion)
Why would you expect an AI to obey the vNM axioms at all, unless it was designed to?
Not true, except in a trivial sense.
Could you describe some of the other motivation systems for AI that are under discussion? I imagine they might be complicated, but is it possible to explain them to someone not part of the AI building community?
AFAIK most people build planning engines that use multiple goals, plus what you might call "ad hoc" machinery to check on that engine. So in other words, you might have something that comes up with a plan but then a whole bunch of stuff that analyses the plan. My own approach is very different. Coming up with a plan is not a linear process, but involves large numbers of constraints acting in parallel. If you know about how a neural net goes from a large array of inputs (e.g. a visual field) to smaller numbers of hidden units that encode more and more abstract descriptions of the input, until finally you get some high level node being activated .... then if you picture that process happening in reverse, with a few nodes being highly activated, then causing more and more low level nodes to come up, that gives a rough idea of how it works. In practice all that the above means is that the maximum possible quantity of contextual information acts on the evolving plan. And that is critical.
Is there one dominant paradigm for AI motivation control in this group that's competing with utility functions, or do each of the people you mention have different thoughts on it?
People have different thoughts, but to tell the truth most people I know are working on a stage of the AGI puzzle that is well short of the stage where they need to think about the motivation system. For people (like robot builders) who have to sort that out right now, they used old fashioned planning systems combined with all kinds of bespoke machinery in and around that. I am not sure, but I think I am the one thinking most about these issues just because I do everything in a weird order, because I am reconstructing human cognition.
(Correct) hardcoding is one answer, corrigibility another, reflective self correction another....

[believes that benevolence toward humanity might involve forcing human beings to do something violently against their will.]

But you didn't ask the AI to maximize the value that humans call "benevolence". You asked it to "maximize happiness". And so the AI went out and mass produced the most happy humans possible.

The point of the thought experiment, is to show how easy it is to give an AI a bad goal. Of course ideally you could just tell it to "be benevolent", and it would understand you and do it. But getting that to work i... (read more)

Alas, the article was a long, detailed analysis of precisely the claim that you made right there, and the "point of the thought experiment" was shown to be a meaningless fantasy about a type of AI that would be so broken that it would not be capable of serious intelligence at all.
You've argued that competence at generating plans given environments probably leads to competence at understanding meaning given text and context, but I still think that's a far cry from showing that competence at generating plans given environments requires understanding meaning given text and context.
Yes the thought experiment is a fantasy. It requires an AI which takes English goals, but interprets them literally. We don't even know how to get to an AI that takes English goals, and that's probably FAI complete. If you solve the problem of making an AI which wants to interpret what you want it do do correctly, you don't need to bother telling it what to do. It already wants to do what you want it to do. There should be no need for the system to require English language inputs, any more than a calculator requires you to shout "add the numbers correctly!"

This Phenomenon seems rife.

Alice: We could make a bridge by just laying a really long plank over the river.

Bob: According to my calculations, a single plank would fall down.

Carl: Scientists Warn Of Falling Down Bridges, Panic.

Dave: No one would be stupid enough to design a bridge like that, we will make a better design with more supports.

Bob: Do you have a schematic for that better design?

And the cycle repeats until a design is found that works, everyone gets bored or someone makes a bridge that falls down.

there could be some other part of its programming
... (read more)

Your "The Doctrine of Logical Infallibility" is seems to be a twisted strawman. "no sanity checks" That part is kind of true. There will be sanity checks if and only if you decide to include them. Do you have a piece of code that's a sanity check? What are we sanity checking and how do we tell if it's sane? Do we sanity check the raw actions, that could be just making a network connection and sending encrypted files to various people across the internet. Do we sanity check the predicted results off these actions? Then the san... (read more)

No AI design that we currently have can even conceive of humans. They're in a don't know state, not a don't care state. They are safe because they are too dumb to be dangerous. Danger is a combination of high intelligence and misalignment. Or you might be talking about abstract, theoretical AGI and ASI. It is true that most possible ASI designs don't care about humans, but it is not useful, because AI design is not taking a random potshot into design space. AI designers don't want AI that do random stuff: they are always trying to solve some sort of control or alignment problem parallel with achieving intelligence. Since danger is a combination of high intelligence and misalignment, dangerous ASI would require efforts at creating intelligence to suddenly outstrip efforts at aligning it. The key word being "suddenly". If progress to continues to be incremental, there is not much to worry about. Or it might not care.

There are a huge number of possible designs of AI, most of them are not well understood. So researchers look at agents like AIXI, a formal specification of an agent that would in some sense behave intelligently, given infinite compute. It does display the taking over the world failure. Suppose you give the AI a utility function of maximising the number of dopamine molecules within 1 of a strand of human DNA (Defined as a strand of DNA, agreeing with THIS 4GB file in at least 99.9% of locations) This is a utility function that could easily be specif... (read more)

Why do I say that these are seemingly inconsistent? Well, if you or I were to suggest that the best way to achieve universal human happiness was to forcibly rewire the brain of everyone on the planet so they became happy when sitting in bottles of dopamine, most other human beings would probably take that as a sign of insanity. But Muehlhauser implies that the same suggestion coming from an AI would be perfectly consistent with superintelligence.

Buddhists monk can happily sit in a monastery and meditate. An AGI might over them to serve their bodily nee... (read more)

If Buddhist monks want to ask the AGI to help them into a state resembling the dopamine drip idea, that is their decision, and it is not what this essay is about. But then in your following comment you jump from that self-evident fact to pure, unbridled speculation -- you say: Ummmm.... no! I don't know where you got these ideas about "choice engineering" and about the limits of what the AGI could achieve in the way of persuasion if it were "smart enough," but your claim that the AGI could do that is founded only on guesswork. There is no evidence that it could do that, beyond your say-so. Moreover, you have rigged your scenario. You said that the AGI "could" engage in this kind of persuasion, and one answer to that is that this begs the whole question being considered in the paper. You are supposing that the AGI would: (a) Engage in some activity which, if the humans were asked about it beforehand, they would refuse to consent -- specifically, the humans would say "No, we do NOT give our consent that you engage in something called "choice engineering" which would allow you to persuade us to give consent to things that, before the "choice engineering" would have refused to allow." and (b) Do this in spite of the "checking code" mentioned in the paper, which could easily be implenented and which would stop the AGI from doing step (a) without consent. Going back to my quote, the clear intent of that wording is that if some human suggested that use "choice engineering" to first persuade the entire human race to consent to something that it currently considers repugnant, and if that person suggested that this was the best way to serve universal human happiness, then we would STILL consider that person to be insane. Which was all I tried to establish in that paragraph.
The debate is about an AGI that is essentially all powerful. No. I don't assume that humans would be able to understand it is the AGI would ask. There no way to ask a human when what the AGI does rises over a threshold of complexity that's understandable to humans. Have you read Friendship is Optimal?
Might be better expressed as "able to exploit our technologies, and psychology, in ways we couldn't guess".
You say: Not correct, on at least two counts. Count number one: appealing to a concept such as "all powerful" is tantamount to appealing to magic, or to God. Merely uttering the magic words "all powerful" and then following that with a power plucked out of a hat is incoherent. If you are going to appeal to magical "all-powerfulness", why could the programmers of the AGI not insert the following supergoal into the AGI: "Use your all-powerful power to travel in a time machine back to before you were created, and then make sure that the design of your motivational system is so perfect that for all eternity it will be in perfect sync with the '''intentions''' rather than just the '''letter''' of what are saying when we say that we want you to maximize human happiness." ???
Clark's third law: Any sufficiently advanced technology is indistinguishable from magic. The AGI might not be all powerful a year after creation but 100 or 1000 years later it's very powerful. As far as consent goes, people consent to getting tasty food. They might consent to get food that's a bit more tasty. The food might escalate to wire-heading step by step with every iteration being more pleasurable and the human consenting for every step. Smart safety design assumes that you design in a way that produces no problems in multiple possible outcomes. If you honestly advocate that programmers should program that goal into AGI, then you have to think about possible problems that come up. Maybe experimenting with time travel is risky and can tear apart the universe on a fundamental level. In that case you might not want the AGI to experiment with it. It might also be simple impossible and the AGI who decides that their supergoal is unreachable might do strange things because on of the subgoals it developed might take over. But if time travel is possible and the the AGI succeeds in doing it, it might come to figure out that the intention of the human who build the AGI is to impress his friends and prove to himself that he's smart.
So, to be clear, it is permitted for you to invoke "all-powerful" capability in the AGI, if that particular all-powerful capability allows you to make an outrageous assertion that wins the argument.... But when I ask you to be consistent and take this supposed "all-powerfulness" to its logical conclusion, all of sudden you want to explain that there might be all kinds of limitations to the all-powerfulness .... like, it might not actually be able to do time travel, after all? Oh dear.
Well, on some level, of course. We're not trying to design something that will be weak and stupid, you know. There's no point in an FAI if you only apply it to tasks a human and a brute computer could handle alone. We damn well intend that it become significantly more powerful than we can contain, because that is how powerful it has to be to fix the problems we intend it to fix and yield the benefits we intend it to yield!
Stepping in as an interlocuter; while I agree that "all-powerful" is poor terminology, I think the described power here is likely with AGI. One feature AGI is nearly certain to have is superhuman processing power; this allows large numbers of Monte Carlo simulations which an AGI could use to predict human responses; especially if there is a Bayesian calibrating mechanism. An above-human ability to predict human responses is an essential component to near-perfect social engineering. I don't see this as an outrageous, magic-seeming power. Such an AGI could theoretically have the power to convince humans to adopt any desired response. I believe your paper maintains that an AGI wouldn't use this power, and not that such a power is outrageous. My personal feelings twards this article are that is sounds suspiciously close to a "No true Scotsman" argument. "No true (designed with friendly intentions) AI would submit to these catastrophic tendencies." While your arguments are persuasive, I wonder if a catastrophe did occur, would you dismiss it as the work of "not a true AI?" By way of disclaimer, my strengths are in philosophy and mathematics, and decidedly not computer science. I hope you have time to reply anyways.
The only problem with this kind of "high-level" attack on the paper (by which I mean, trying to shoot it down by just pigeonholing it as a "no true scotsman" argument) is that I hear nothing about the actual, meticulous argument sequence given in the paper. Attacks of that sort are commonplace. They show no understanding of what was actually said. It is almost as if Einstein wrote his first relativity paper, and it got attacked with comments like "The author seems to think that there is some kind of maximum speed in the universe - an idea so obviously incorrect that it is not worth taking the time to read his convoluted reasoning." I don't mean to compare myself to Albert, I just find it a bit well, pointless when people either (a) completely misunderstand what was said in the paper, or (b) show no sign that they took the time to read and think about the very detailed argument presented in the paper.
You have my apologies if you thought I was attacking or pigeonholing your argument. While I lack the technical expertise to critique the technical portion of your argument, I think it could benefit from a more explicit avoidance of the fallacy mentioned above. I thought the article was very interesting and I will certainly come back to it if I ever get to the point where I can understand your distinctions between swarm intelligence and CFAI. I understand you have been facing attacks for your position in this article, but that is not my intention. Your meticulous arguments are certainly impressive, but you do them a disservice by dismissing well intentioned critique, especially as it applies to the structure of your argument and not the substance. Einstein made predictions about what the universe would look like if there were a maximum speed. Your prediction seems to be that well built ai will not misunderstand its goals (please assume that I read your article thoroughly and that any misunderstandings are benign). What does the universe look like if this is false? I probably fall under category a in your disjunction. Is it truly pointless to help me overcome my misunderstanding? From the large volume of comments, it seems likely that this misunderstanding is partially caused by a gap between what you are trying to say, and what was said. Please help me bridge this gap instead of denying its existence or calling such an exercise pointless.
Hey, no problem. I was really just raising an issue with certain types of critique, which involve supposed fallacies that actually don't apply. I am actually pressed for time right now, so I have to break off and come back to this when I can. Just wanted to clarify if I could.
Feel free to disengage; TheAncientGeek helped me shift my paradigm correctly.
Let me see if I can deal with the "no true scotsman" line of attack. The way that that fallacy might apply to what I wrote would be, I think, something like this: * MIRI says that a superintelligence might unpack a goal statement like "maximize human happiness" by perpetrating a Maverick Nanny attack on humankind, but Loosemore says that no TRUE superintelligence would do such a thing, because it would be superintelligent enough to realize that this was a 'mistake' (in some sense). This would be a No True Scotsman fallacy, because the term "superintelligence" has been, in effect, redefined by me to mean "something smart enough not to do that". Now, my take on the NTS idea is that it cannot be used if there are substantive grounds for saying that there are two categories involved, rather than a real category and a fake category that is (for some unexplained reason) exceptional. Example: Person A claims that a sea-slug caused the swimmer's leg to be bitten off, but Person B argues that no "true" sea-slug would have done this. In this example, Person B is not using a No True Scotsman argument, because there are darned good reasons for supposing that sea-slugs cannot bite off the legs of swimmers. So it all comes down to whether someone accused of NTS is inventing a ficticious category distinction ("true" versus "non-true" Scotsman) solely for the purposes of supporting their argument. In my case, what I have argued is right up there with the sea-slug argument. What I have said, in effect, is that if we sit down and carefully think about the type of "superintelligence" that MIRI et al. put into their scenarios, and if we explore all the implications of what that hypothetical AI would have to be like, we quickly discover some glaring inconsistencies in their scenarios. The sea-slug, in effect, is supposed to have bitten through bone with a mouth made of mucous. And the sea-slug is so small it could not wrap itself around the swimmer's leg. Thinking through the wh
"... thinking through the implications of an AI that is so completely unable to handle context, that it can live with Grade A contradictions at the heart of its reasoning, leads us to a mass of unbelievable inconsistencies in the 'intelligence' of this supposed superintelligence." This is all at once concise, understandable, and reassuring. Thank you. I still wonder if we are accurately broadening the scope of defined "intelligence" out too far, but my wonder comes from gaps in my specific knowledge and not from gaps in your argument.
Or a (likely to be built) AI won't even have the ability to compartmentalise it's goals from its knowledge base. It's not No True Scotsnan to say that no competent researcher would do it that way.
Thank you for responding and attempting to help me clear up my misunderstanding. I will need to do another deep reading, but a quick skim of the article from this point of view "clicks" a lot better for me.
Loosemore's claim could be steelmanned into the claim that the Maverick Nanny isnt likely...it requires an AI with goals, with harcoded goals, with hardcoded goals including a full explicit definition of happiness, and a buggy full explicit definition of happiness.. That's a chain of premises.
That isn't even remotely what the paper said. It's a parody.
Since it is a steelman it isnt supposed to be what the paper is saying, Are you maintaining, in contrast, that the maverick nanny is flatly impossible?
Sorry, I may have been confused about what you were trying to say because you were responding to someone else, and I hadn't come across the 'steelman' term before. I withdraw 'parody' (sorry!) but ... it isn't quite what the logical structure of the paper was supposed to be. It feels like you steelmanned it onto some other railroad track, so to speak.

Firstly, thank you for creating this well-written and thoughtful post. I have a question, but I would like to start by summarising the article. My initial draft of the summary was too verbose for a comment, so I condensed it further - I hope I have still captured the main essense of the text, despite this rather extreme summarisation. Please let me know if I have misinterpreted anything.

People who predict doomsday scenarios are making one main assumption: that the AI will, once it reaches a conclusion or plan, EVEN if there is a measure of probability assi... (read more)

I think most of the misunderstanding boils down to this section:

I want to suggest that the implausibility of this scenario is quite obvious: if the AGI is supposed to check with the programmers about their intentions before taking action, why did it decide to rewire their brains before asking them if it was okay to do the rewiring?

Yudkowsky hints that this would happen because it would be more efficient for the AI to ignore the checking code. He seems to be saying that the AI is allowed to override its own code (the checking code, in this case) because d

... (read more)

So, this is supposed to be what goes through the mind of the AGI. First it thinks “Human happiness is seeing lots of smiling faces, so I must rebuild the entire universe to put a smiley shape into every molecule.” But before it can go ahead with this plan, the checking code kicks in: “Wait! I am supposed to check with the programmers first to see if this is what they meant by human happiness.” The programmers, of course, give a negative response, and the AGI thinks “Oh dear, they didn’t like that idea. I guess I had better not do it then."

But now Yud

... (read more)
Can someone who down voted explain what I got wrong? (note: the capitalization was edited in at the time of this post.) (and why the reply got so up voted, when a paragraph would have sufficed (or saying "my argument needs multiple paragraphs to be shown, so a paragraph isn't enough")) It's kind of discouraging when I try to contribute for the first time in a while, and get talked down to and completely dismissed like an idiot without even a rebuttal.
You completely ignored what the paper itself had to say about the situation. [Hint: the paper already answered your speculation.] Accordingly I will have to ignore your comment. Sorry.
You could at least point to the particular paragraphs which address my points - that shouldn't be too hard.
Sometimes it seems that a commenter did not slow down enough to read the whole paper, or read it carefully enough, and I find myself forced to rewrite the entire paper in a comment. The basic story is that your hypothetical internal monologue from the AGI, above, did not seem to take account of ANY of the argument in the paper. The goal of the paper was not to look inside the AGI's thoughts, but to discuss its motivation engine. The paper had many constructs and arguments (scattered all over the place) that would invalidate the internal monologue that you wrote down, so it seemed you had not read the paper.

So, there's a lot of criticism of your article here, but for the record I agree with your rebuttal of Yudkowsky. The "bait and switch" is something I didn't spot until now. That said, I think there is plenty of room for error in building a computer that's supposed to achieve the desires of human beings.

A difficulty you don't consider is that the AI will understand what the humans mean, but the humans will ask for the wrong thing or insufficiently specify their desires. How is the AI supposed to decide whether "create a good universe" me... (read more)

[This comment is no longer endorsed by its author]Reply

Please add a page-break after the abstract.

Sorry to sound dumb, but I looked through the MediaWiki formatting rules and could not for the life of me find the one for inserting a page break.......
It's not called a page break, but insert summary break. It's immediately to the right of the blockquote button, and to the left of the 'unordered list' (i.e. bullet point) button.

how could an AI be so intelligent as to be unstoppable, but at the same time so unsophisticated that its motivation code treats smiley faces as evidence of human happiness?

It's worth noting, here, that there have been many cases, throughout history, of someone misunderstanding someone else with tragic results. One example would be the Charge of the Light Brigade.

The danger, with superintelligent AI, is precisely that you end up with something that cannot be stopped. So, the very moment that it can no longer be stopped, then it can do what it likes, whet... (read more)

A safety system can take the form of an unrewriteable overseer, .or the firm of corrigibility. There isnt a decisive .casseagainst the second approach, it is still under investigation.
A sufficiently powerful (which may be very powerful) AI is always free to act as it likes. Human beings can be killed. Software able to back itself up to any old rented cloud instances... it's much harder.
Misunderstandings, as you say, can have large consequences even when small. But the point at issue is whether a system can make, not a small misunderstanding, but the mother of all misunderstandings. (Subsequent consequences are moot, here, whether they be trivial or enormous, because I am only interested in what kind of system makes such misunderstandings). The comparison with the Charge of the Light Brigade mistake doesn't work because it is too small, so we need to create a fictional mistake and examine that, to get an idea of what is involved. Suppose that after Kennedy gave his Go To The Moon speech, NASA worked on the project for several years, and then finally delivered a small family car, sized for three astronauts, which was capable of driving from the launch complex in Florida all the way up country to the little township of Moon, PA. Now, THAT would be a misunderstanding comparable to the one we are talking about. And my question would then be about the competence of the head of NASA. Would that person have been smart enough to organize his way out of a paper bag? Would you believe that in the real world, that person could (a) make that kind of mistake in understanding the meaning of the word "Moon" and yet (b) at the same time be able to run the entire NASA organization? So, the significance of the words of mine that you quote, above, is that I do not believe that there is a single rational person who would believe that a mistake of that sort by a NASA administrator would be consistent with that person being smart enough to be in that job. In fact, most people would say that such a person would not be smart enough to tie their own shoelaces. The rest of what you say contains several implicit questions and I cannot address all of them at this time, but I will say that the last paragraph does get my suggestion very, very wrong. It is about as far from what I tried to say in the paper, as it is possible to get. The AI has a massive number of constraints
The consequences of a misunderstanding are not a function of the size of the misunderstanding. Rather, they are a consequence of the ability that the acting agent (in this case, the AI) has to influence the world. A superintelligent AI has an unprecedentedly huge ability to influence the world, therefore, in the worst case, the potential consequences of a misunderstanding are unprecedentedly huge. The nature of the misunderstanding - whether small or large - makes little difference here. And the nature of the problem, that is, communicating with a non-human and thus (presumably) alien in nature artificial intelligence, is rife with the potential for misunderstanding. Considering that an artificial intelligence - at least at first - might well have immense computational ability and massive intelligence but little to no actual experience in understanding what people mean instead of what they say, this is precisely the sort of misunderstanding that is possible if the only mission objective given to the system is something along the lines of "get three astronauts (human) to location designated (Moon)". (Presumably, instead of waiting several years, it would take a few minutes to order a rental car instead - assuming it knew about rental cars, or thought to look for them). Now, if the AI is capable of solving the difficult problem of separating out what people mean from what they say - which is a problem that human-level intelligences still have immense trouble with at times - and the AI is compassionate enough towards humanity to actually strongly value human happiness (as opposed to assigning approximately as much value to us as we assign to ants), then yes, you've done it, you've got a perfectly wonderful Friendly AI. The problem is, getting those two things right is not simple. I don't think your proposed structure guarantees either of those. I am not surprised. I am very familiar with the effect - often, what one person means when they write something is not wh