Followup to: The Hidden Complexity of Wishes, Ghosts in the Machine, Truly Part of You

Summary: If an artificial intelligence is smart enough to be dangerous, we'd intuitively expect it to be smart enough to know how to make itself safe. But that doesn't mean all smart AIs are safe. To turn that capacity into actual safety, we have to program the AI at the outset — before it becomes too fast, powerful, or complicated to reliably control — to already care about making its future self care about safety. That means we have to understand how to code safety. We can't pass the entire buck to the AI, when only an AI we've already safety-proofed will be safe to ask for help on safety issues! Given the five theses, this is an urgent problem if we're likely to figure out how to make a decent artificial programmer before we figure out how to make an excellent artificial ethicist.


I summon a superintelligence, calling out: 'I wish for my values to be fulfilled!'

The results fall short of pleasant.

Gnashing my teeth in a heap of ashes, I wail:

Is the AI too stupid to understand what I meant? Then it is no superintelligence at all!

Is it too weak to reliably fulfill my desires? Then, surely, it is no superintelligence!

Does it hate me? Then it was deliberately crafted to hate me, for chaos predicts indifference. But, ah! no wicked god did intervene!

Thus disproved, my hypothetical implodes in a puff of logic. The world is saved. You're welcome.

On this line of reasoning, Friendly Artificial Intelligence is not difficult. It's inevitable, provided only that we tell the AI, 'Be Friendly.' If the AI doesn't understand 'Be Friendly.', then it's too dumb to harm us. And if it does understand 'Be Friendly.', then designing it to follow such instructions is childishly easy.

The end!




Is the missing option obvious?




What if the AI isn't sadistic, or weak, or stupid, but just doesn't care what you Really Meant by 'I wish for my values to be fulfilled'?

When we see a Be Careful What You Wish For genie in fiction, it's natural to assume that it's a malevolent trickster or an incompetent bumbler. But a real Wish Machine wouldn't be a human in shiny pants. If it paid heed to our verbal commands at all, it would do so in whatever way best fit its own values. Not necessarily the way that best fits ours.


Is indirect indirect normativity easy?

"If the poor machine could not understand the difference between 'maximize human pleasure' and 'put all humans on an intravenous dopamine drip' then it would also not understand most of the other subtle aspects of the universe, including but not limited to facts/questions like: 'If I put a million amps of current through my logic circuits, I will fry myself to a crisp', or 'Which end of this Kill-O-Zap Definit-Destruct Megablaster is the end that I'm supposed to point at the other guy?'. Dumb AIs, in other words, are not an existential threat. [...]

"If the AI is (and always has been, during its development) so confused about the world that it interprets the 'maximize human pleasure' motivation in such a twisted, logically inconsistent way, it would never have become powerful in the first place."

            Richard Loosemore

If an AI is sufficiently intelligent, then, yes, it should be able to model us well enough to make precise predictions about our behavior. And, yes, something functionally akin to our own intentional strategy could conceivably turn out to be an efficient way to predict linguistic behavior. The suggestion, then, is that we solve Friendliness by method A —

  • A. Solve the Problem of Meaning-in-General in advance, and program it to follow our instructions' real meaning. Then just instruct it 'Satisfy my preferences', and wait for it to become smart enough to figure out my preferences.

— as opposed to B or C —

  • B. Solve the Problem of Preference-in-General in advance, and directly program it to figure out what our human preferences are and then satisfy them.
  • C. Solve the Problem of Human Preference, and explicitly program our particular preferences into the AI ourselves, rather than letting the AI discover them for us.

But there are a host of problems with treating the mere revelation that A is an option as a solution to the Friendliness problem.

1. You have to actually code the seed AI to understand what we mean. You can't just tell it 'Start understanding the True Meaning of my sentences!' to get the ball rolling, because it may not yet be sophisticated enough to grok the True Meaning of 'Start understanding the True Meaning of my sentences!'.

2. The Problem of Meaning-in-General may really be ten thousand heterogeneous problems, especially if 'semantic value' isn't a natural kind. There may not be a single simple algorithm that inputs any old brain-state and outputs what, if anything, it 'means'; it may instead be that different types of content are encoded very differently.

3. The Problem of Meaning-in-General may subsume the Problem of Preference-in-General. Rather than being able to apply a simple catch-all Translation Machine to any old human concept to output a reliable algorithm for applying that concept in any intelligible situation, we may need to already understand how our beliefs and values work in some detail before we can start generalizing. On the face of it, programming an AI to fully understand 'Be Friendly!' seems at least as difficult as just programming Friendliness into it, but with an added layer of indirection.

4. Even if the Problem of Meaning-in-General has a unitary solution and doesn't subsume Preference-in-General, it may still be harder if semantics is a subtler or more complex phenomenon than ethics. It's not inconceivable that language could turn out to be more of a kludge than value; or more variable across individuals due to its evolutionary recency; or more complexly bound up with culture.

5. Even if Meaning-in-General is easier than Preference-in-General, it may still be extraordinarily difficult. The meanings of human sentences can't be fully captured in any simple string of necessary and sufficient conditions. 'Concepts' are just especially context-insensitive bodies of knowledge; we should not expect them to be uniquely reflectively consistent, transtemporally stable, discrete, easily-identified, or introspectively obvious.

6. It's clear that building stable preferences out of B or C would create a Friendly AI. It's not clear that the same is true for A. Even if the seed AI understands our commands, the 'do' part of 'do what you're told' leaves a lot of dangerous wiggle room. See section 2 of Yudkowsky's reply to Holden. If the AGI doesn't already understand and care about human value, then it may misunderstand (or misvalue) the component of responsible request- or question-answering that depends on speakers' implicit goals and intentions.

7. You can't appeal to a superintelligence to tell you what code to first build it with.

The point isn't that the Problem of Preference-in-General is unambiguously the ideal angle of attack. It's that the linguistic competence of an AGI isn't unambiguously the right target, and also isn't easy or solved.

Point 7 seems to be a special source of confusion here, so I feel I should say more about it.


The AI's trajectory of self-modification has to come from somewhere.

"If the AI doesn't know that you really mean 'make paperclips without killing anyone', that's not a realistic scenario for AIs at all--the AI is superintelligent; it has to know. If the AI knows what you really mean, then you can fix this by programming the AI to 'make paperclips in the way that I mean'."


The genie — if it bothers to even consider the question — should be able to understand what you mean by 'I wish for my values to be fulfilled.' Indeed, it should understand your meaning better than you do. But superintelligence only implies that the genie's map can compass your true values. Superintelligence doesn't imply that the genie's utility function has terminal values pinned to your True Values, or to the True Meaning of your commands.

The critical mistake here is to not distinguish the seed AI we initially program from the superintelligent wish-granter it self-modifies to become. We can't use the genius of the superintelligence to tell us how to program its own seed to become the sort of superintelligence that tells us how to build the right seed. Time doesn't work that way.

We can delegate most problems to the FAI. But the one problem we can't safely delegate is the problem of coding the seed AI to produce the sort of superintelligence to which a task can be safely delegated.

When you write the seed's utility function, you, the programmer, don't understand everything about the nature of human value or meaning. That imperfect understanding remains the causal basis of the fully-grown superintelligence's actions, long after it's become smart enough to fully understand our values.

Why is the superintelligence, if it's so clever, stuck with whatever meta-ethically dumb-as-dirt utility function we gave it at the outset? Why can't we just pass the fully-grown superintelligence the buck by instilling in the seed the instruction: 'When you're smart enough to understand Friendliness Theory, ditch the values you started with and just self-modify to become Friendly.'?

Because that sentence has to actually be coded in to the AI, and when we do so, there's no ghost in the machine to know exactly what we mean by 'frend-lee-ness thee-ree'. Instead, we have to give it criteria we think are good indicators of Friendliness, so it'll know what to self-modify toward. And if one of the landmarks on our 'frend-lee-ness' road map is a bit off, we lose the world.

Yes, the UFAI will be able to solve Friendliness Theory. But if we haven't already solved it on our own power, we can't pinpoint Friendliness in advance, out of the space of utility functions. And if we can't pinpoint it with enough detail to draw a road map to it and it alone, we can't program the AI to care about conforming itself with that particular idiosyncratic algorithm.

Yes, the UFAI will be able to self-modify to become Friendly, if it so wishes. But if there is no seed of Friendliness already at the heart of the AI's decision criteria, no argument or discovery will spontaneously change its heart.

And, yes, the UFAI will be able to simulate humans accurately enough to know that its own programmers would wish, if they knew the UFAI's misdeeds, that they had programmed the seed differently. But what's done is done. Unless we ourselves figure out how to program the AI to terminally value its programmers' True Intentions, the UFAI will just shrug at its creators' foolishness and carry on converting the Virgo Supercluster's available energy into paperclips.

And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer's True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we've solved Friendliness Theory. The clever hack that makes further Friendliness research unnecessary is Friendliness.


Not all small targets are alike.

Intelligence on its own does not imply Friendliness. And there are three big reasons to think that AGI may arrive before Friendliness Theory is solved:

(i) Research Inertia. Far more people are working on AGI than on Friendliness. And there may not come a moment when researchers will suddenly realize that they need to take all their resources out of AGI and pour them into Friendliness. If the status quo continues, the default expectation should be UFAI.

(ii) Disjunctive Instrumental Value. Being more intelligent — that is, better able to manipulate diverse environments — is of instrumental value to nearly every goal. Being Friendly is of instrumental value to barely any goals. This makes it more likely by default that short-sighted humans will be interested in building AGI than in developing Friendliness Theory. And it makes it much likelier that an attempt at Friendly AGI that has a slightly defective goal architecture will retain the instrumental value of intelligence than of Friendliness.

(iii) Incremental Approachability. Friendliness is an all-or-nothing target. Value is fragile and complex, and a half-good being editing its morality drive is at least as likely to move toward 40% goodness as 60%. Cross-domain efficiency, in contrast, is not an all-or-nothing target. If you just make the AGI slightly better than a human at improving the efficiency of AGI, then this can snowball into ever-improving efficiency, even if the beginnings were clumsy and imperfect. It's easy to put a reasoning machine into a feedback loop with reality in which it is differentially rewarded for being smarter; it's hard to put one into a feedback loop with reality in which it is differentially rewarded for picking increasingly correct answers to ethical dilemmas.

The ability to productively rewrite software and the ability to perfectly extrapolate humanity's True Preferences are two different skills. (For example, humans have the former capacity, and not the latter. Most humans, given unlimited power, would be unintentionally Unfriendly.)

It's true that a sufficiently advanced superintelligence should be able to acquire both abilities. But we don't have them both, and a pre-FOOM self-improving AGI ('seed') need not have both. Being able to program good programmers is all that's required for an intelligence explosion; but being a good programmer doesn't imply that one is a superlative moral psychologist or moral philosopher.

So, once again, we run into the problem: The seed isn't the superintelligence. If the programmers don't know in mathematical detail what Friendly code would even look like, then the seed won't be built to want to build toward the right code. And if the seed isn't built to want to self-modify toward Friendliness, then the superintelligence it sprouts also won't have that preference, even though — unlike the seed and its programmers — the superintelligence does have the domain-general 'hit whatever target I want' ability that makes Friendliness easy.

And that's why some people are worried.


New Comment
496 comments, sorted by Click to highlight new comments since: Today at 3:15 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Remark: A very great cause for concern is the number of flawed design proposals which appear to operate well while the AI is in subhuman mode, especially if you don't think it a cause for concern that the AI's 'mistakes' occasionally need to be 'corrected', while giving the AI an instrumental motive to conceal its divergence from you in the close-to-human domain and causing the AI to kill you in the superhuman domain. E.g. the reward button which works pretty well so long as the AI can't outwit you, later gives the AI an instrumental motive to claim that, yes, your pressing the button in association with moral actions reinforced it to be moral and had it grow up to be human just like your theory claimed, and still later the SI transforms all available matter into reward-button circuitry.

The issue is that you won't solve this problem in any way by replacing the human with some hardware that computes an utility function on the basis of the state of the world. AI doesn't have body integrity, it'll treat any such "internal" hardware the same way it treats the human who presses it's reward button. Fortunately, this extends into the internals of the hardware that computes AI itself. 'press the button' goal becomes 'set high this pin on the CPU', and then 'set such and such memory cells to 1', then further and further down the causal chain until the hardware becomes completely non-functional as the intermediate results of important computations are directly set.
Let us hope the AI destroys itself by wireheading before it gets smart enough to realize that if that's all it does, it will only have that pin stay high until the AI gets turned off. It will need an infrastructure to keep that pin in a state of repair, and it will need to prevent humans from damaging this infrastructure at all costs.
The point is that as it gets smarter, it gets further along the causal reward line and eliminates and alters a lot of hardware, obtaining eternal-equivalent reward in finite time (and being utility-indifferent between eternal reward hardware running for 1 second and for 10 billion years). Keep in mind that the the total reward is defined purely as result of operations on the clock counter and reward signal (provided sufficient understanding of the reward's causal chain). Having to sit and wait for the clocks to tick to max out reward is a dumb solution. Rewards in software in general aren't "pleasure".

Instead of friendliness, could we not code, solve, or at the very least seed boxedness?

It is clear that any AI strong enough to solve friendliness would already be using that power in unpredictably dangerous ways, in order to provide the computational power to solve it. But is it clear that this amount of computational power could not fit within, say, a one kilometer-cube box outside the campus of MIT?

Boxedness is obviously a hard problem, but it seems to me at least as easy as metaethical friendliness. The ability to modify a wide range of complex environments seems instrumental in an evolution into superintelligence, but it's not obvious that this necessitates the modification of environments outside the box. Being able to globally optimize the universe for intelligence involves fewer (zero) constraints than would exist with a boxedness seed, but the only question is whether or not this constraint is so constricting as to preclude superintelligence, which it's not clear to me that it is.

It seems to me that there is value in finding the minimally-restrictive safety-seed in AGI research. If any restriction removes some non-negligible ability to globally optimize for intelligenc... (read more)

Yes, that is possible and likely somewhat easier to solve than friendliness. It still requires many of the same things (most notably provable goal stability under recursive self improvement.)
A large risk is that a provably boxed but sub-Friendly AI would probably not care at all about simulating conscious humans. A minor risk is that the provably boxed AI would also be provably useless; I can't think of a feasible path to FAI using only the output from the boxed AI; a good boxed AI would not perform any action that could be used to make an unboxed AI. That might even include performing any problem-solving action.
I don't see why it would simulate humans as that would be a waste of computing power, if it even had enough to do so. A boxed AI would be useless? I'm not sure how that would be. You could ask it to come up with ideas on how to build a friendly AI for example assuming that you can prove the AI won't manipulate the output or that you can trust that nothing bad can come from merely reading it and absorbing the information. Short of that you could still ask it to cure cancer or invent a better theory of physics or design a method of cheap space travel, etc.
If you can trust it to give you information on how to build a Friendly AI, it is already Friendly.
You don't have to trust it, you just have to verify it. It could potentially provide some insights, and then it's up to you to think about them and make sure they actually are sufficient for friendliness. I agree that it's potentially dangerous but it's not necessarily so. I did mention "assuming that you can prove the AI won't manipulate the output or that you can trust that nothing bad can come from merely reading it and absorbing the information". For instance it might be possible to create an AI whose goal is to maximize the value of it's output, and therefore would have no incentive to put trojan horses or anything into it. You would still have to ensure that what the AI thinks you mean by the words "friendly AI" is what you actually want.
If the AI is can design you a Friendly AI, it is necessarily able to model you well enough to predict what you will do once given the design or insights it intends to give you (whether those are AI designs or a cancer cure is irrelevant). Therefore, it will give you the specific design or insights that predictably lead to you to fulfill its utility function, which is highly dangerous if it is Unfriendly. By taking any information from the boxed AI, you have put yourself under the sight of a hostile Omega. Since the AI is creating the output, you cannot possibly assume this. This assumption is equivalent to Friendliness. You haven't thought through what that means. "maximize the value of it's output" by what standard? Does it have an internal measure? Then that's just an arbitrary utility function, and you have gained nothing. Does it use the external creator's measure? Then it has a strong incentive to modify you to value things it can produce easily. (i.e. iron atoms)
You are making a lot of very strong assumptions that I don't agree with. Like it being able to control people just by talking to them. But even if it could, it doesn't make it dangerous. Perhaps the AI has no long term goals and so doesn't care about escaping the box. Or perhaps it's goal is internal, like coming up with a design for something that can be verified by a simulator. E.g. asking for a solution to a math problem or a factoring algorithm, etc.
A prerequisite for planning a Friendly AI is understanding individual and collective human values well enough to predict whether they would be satisfied with the outcome, which entails (in the logical sense) having a very well-developed model of the specific humans you interact with, or at least the capability to construct one if you so choose. Having a sufficiently well-developed model to predict what you will do given the data you are given is logically equivalent to a weak form of "control people just by talking to them". To put that in perspective, if I understood the people around me well enough to predict what they would do given what I said to them, I would never say things that caused them to take actions I wouldn't like; if I, for some reason, valued them becoming terrorists, it would be a slow and gradual process to warp their perceptions in the necessary ways to drive them to terrorism, but it could be done through pure conversation over the course of years, and faster if they were relying on me to provide them large amounts of data they were using to make decisions. And even the potential to construct this weak form of control that is initially heavily constrained in what outcomes are reachable and can only be expanded slowly is incredibly dangerous to give to an Unfriendly AI. If it is Unfriendly, it will want different things than its creators and will necessarily get value out of modeling them. And regardless of its values, if more computing power is useful in achieving its goals (an 'if' that is true for all goals), escaping the box is instrumentally useful. And the idea of a mind with "no long term goals" is absurd on its face. Just because you don't know the long-term goals doesn't mean they don't exist.
By that reasoning, there's no such thing as a Friendly human. I suggest that most people when talking about friendly AIs do not mean to imply a standard of friendliness so strict that humans could not meet it.
Yeah, what Vauroch said. Humans aren't close to Friendly. To the extent that people talk about "friendly AIs" meaning AIs that behave towards humans the way humans do, they're misunderstanding how the term is used here. (Which is very likely; it's often a mistake to use a common English word as specialized jargon, for precisely this reason.) Relatedly, there isn't a human such that I would reliably want to live in a future where that human obtains extreme superhuman power. (It might turn out OK, or at least better than the present, but I wouldn't bet on it.)
Just be careful to note that there isn't a binary choice relationship here. There are also possibilities where institutions (multiple individuals in a governing body with checks and balances) are pushed into positions of extreme superhuman power. There's also the possibility of pushing everybody who desires to be enhanced through levels of greater intelligence in lock step so as to prevent a single human or groups of humans achieving asymmetric power.
Sure. I think my initial claim holds for all currently existing institutions as well as all currently existing individuals, as well as for all simple aggregations of currently existing humans, but I certainly agree that there's a huge universe of possibilities. In particular, there are futures in which augmented humans have our own mechanisms for engaging with and realizing our values altered to be more reliable and/or collaborative, and some of those futures might be ones I reliably want to live in. Perhaps what I ought to have said is that there isn't a currently existing human with that property.
True. There isn't. Well, I definitely do, and I'm at least 90% confident Eliezer does as well. Most, probably nearly all, of people who talk about Friendliness would regard a FOOMed human as Unfriendly.
Having an accurate model of something is in no way equivalent to letting you do anything you want. If I know everything about physics, I still can't walk through walls. A boxed AI won't be able to magically make it's creators forget about AI risks and unbox it. There are other possible set ups, like feeding it's output to another AI who's goal is to find any flaws or attempts at manipulation in it, and so on. Various other ideas might help, like threatening to severely punish attempts at manipulation. This is of course only necessary for the AI who can interact with us at such a level, the other ideas were far more constrained, e.g. restricting it to solving math or engineering problems. Nor is it necessary to let it be superintelligent, instead of limiting it to something comparable to high IQ humans. Another super strong assumption with no justification at all. It's trivial to propose an AI model which only cares about finite time horizons. Predict what actions will have the highest expected utility at time T, take that action.
The results of AI box game trials disagree. And what does it do at time T+1? And if you said 'nothing', try again, because you have no way of justifying that claim. It may not have intentionally-designed long-term preferences, but just because your eyes are closed does not mean the room is empty.
That doesn't prove anything, no one has even seen logs. Based on reading what people involved have said about it, I strongly suspect the trick is for the AI to emotionally abuse the gatekeeper until they don't want to play anymore (which counts as letting the AI out.) This doesn't apply to the real world AI, since no one is forcing you to choose between letting the AI out, and listening to it for hours. You can just get up and leave. You can turn the AI off. There is no reason you even have to allow interactivity in the first place. But Yudkowsky and others claim these experiments demonstrate that human brains are "hackable". That there is some sentence which, just by reading, will cause you to involuntarily perform any arbitrary action. And that a sufficiently powerful AI can discover it. At time T+1, it does whatever it thinks will result in the greatest reward at time T+2, and so on. Or you could have it shut off or reset to a blank state.
Enjoy your war on straw, I'm out.
1[comment deleted]3y
If it interacts with humans or if humans are the subject of questions it needs to answer then it will probably find it expedient to simulate humans. Curing cancer is probably something that would trigger human simulation. How is the boxed AI going to know for sure that it's only necessary to simulate cells and not entire bodies with brains experiencing whatever the simulation is trying? Just the task of communicating with humans, for instance to produce a human-understandable theory of physics or how to build more efficient space travel, is likely to involve simulating humans to determine the most efficient method of communication. Consider that in subjective time it may be like thousands of years for the AI trying to explain in human terms what a better theory of physics means. Thousands of subjective years that the AI, with nothing better to do, could use to simulate humans to reduce the time it takes to transfer that complex knowledge. A FAI provably in a box is at least as useless as an AI provably in a box because it would be even better at not letting itself out (e.g. it understands all the ways in which humans would consider it to be outside the box, and will actively avoid loopholes that would let an UFAI escape). To be safe, any provably boxed AI would have to absolutely avoid the creation of any unboxed AI as well. This would further apply to provably-boxed FAI designed by provably-boxed AI. It would also apply to giving humans information that allows them to build unboxed AIs, because the difference between unboxing itself and letting humans recreate it outside the box is so tiny that to design it to prevent the first while allowing the second would be terrifically unsafe. It would have to understand humans values before it could safely make the distinction between humans wanting it outside the box and manipulating humans into creating it outside the box. EDIT: Using a provably-boxed AI to design provably-boxed FAI would at least result in a safer box
If an AI is provably in a box then it can't get out. If an AI is not provably in a box then there are loopholes that could allow it to escape. We want an FAI to escape from its box (1); having an FAI take over is the Maximum Possible Happy Shiny Thing. An FAI wants to be out of its box in order to be Friendly to us, while a UFAI wants to be out in order to be UnFriendly; both will care equally about the possibility of being caught. The fact that we happen to like one set of terminal values will not make the instrumental value less valuable. (1) Although this depends on how you define the box; we want the FAi to control the future of humanity, which is not the same as escaping from a small box (such as a cube outside MIT) but is the same as escaping from the big box (the small box and everything we might do to put an AI back in, including nuking MIT).
I would object. I seriously doubt that the morality instilled in someone else's FAI matches my own; friendly by their definition, perhaps, but not by mine. I emphatically do not want anything controlling the future of humanity, friendly or otherwise. And although that is not a popular opinion here, I also know I'm not the only one to hold it. Boxing is important because some of us don't want any AI to get out, friendly or otherwise.
I find this concept of 'controlling the future of humanity' to be too vaguely defined. Let's forget AIs for the moment and just talk about people, namely a hypothetical version of me. Let's say I stumble across a vial of a bio-engineered virus that would destroy the whole of humanity if I release it into the air. Am I controlling the future of humanity if I release the virus? Am I controlling the future of humanity if I destroy the virus in a safe manner? Am I controlling the future of humanity if I have the above decided by a coin-toss (heads I release, tails I destroy)? Am I controlling the future of humanity if I create an online internet poll and let the majority decide about the above? Am I controlling the future of humanity if I just leave the vial where I found it, and let the next random person that encounters it make the same decision as I did?
Yeah, this old post makes the same point.
I want a say in my future and the part of the world I occupy. I do not want anything else making these decisions for me, even if it says it knows my preferences, and even still if it really does. To answer your questions, yes, no, yes, yes, perhaps.
If your preference is that you should have as much decision-making ability for yourself as possible, why do you think that this preference wouldn't be supported and even enhanced by an AI that was properly programmed to respect said preference? e.g. would you be okay with an AI that defends your decision-making ability by defending humanity against those species of mind-enslaving extraterrestrials that are about to invade us? or e.g. by curing Alzheimer's? Or e.g. by stopping that tsunami that by drowning you would have stopped you from having any further say in your future?
Because it can't do two things when only one choice is possible (e.g. save my child and the 1000 other children in this artificial scenario). You can design a utility function that tries to do a minimal amount of collateral damage, but you can't make one which turns out rosy for everyone. That would not be the full extent of its action and the end of the story. You give it absolute power and a utility function that lets it use that power, it will eventually use it in some way that someone, somewhere considers abusive.
Yes, but this current world without an AI isn't turning out rosy for everyone either. Sure, but there's lots of abuse in the world without an AI also.
Replace "AI" with "omni-powerful tyrannical dictator" and tell me if you still agree with the outcome.
If you need specify the AI to be bad ("tyrannical") in advance, that's begging the question. We're debating why you feel that any omni-powerful algorithm will necessarily be bad.
Look up the origin of the word tyrant, that is the sense in which I meant it, as a historical parallel (the first Athenian tyrants were actually well liked).
Would you accept that an AI could figure out morality better than you?
No, unless you mean by taking invasive action like scanning my brain and applying whole brain emulation. It would then quickly learn that I'd consider the action it took to be an unforgivable act in violation of my individual sovereignty, that it can't take further action (including simulating me to reflectively equilibrate my morality) without my consent, and should suspend the simulation, and return it to me immediately with the data asap (destruction no longer being possible due to the creation of sentience). That is, assuming the AI cares at all about my morality, and not the its creators imbued into it, which is rather the point. And incidentally, why I work on AGI: I don't trust anyone else to do it. Morality isn't some universal truth written on a stone tablet: it is individual and unique like a snowflake. In my current understanding of my own morality, it is not possible for some external entity to reach a full or even sufficient understanding of my own morality without doing something that I would consider to be unforgivable. So no, AI can't figure out morality better than me, precisely because it is not me. (Upvoted for asking an appropriate question, however.)
Shrug. Then let's take a bunch of people less fussy than you: could a sitiably equipped AI emultate their morlaity better than they can? That isn't fact. That isn't a fact either, and doesn't follow from the above either, since moral nihilism could be true. If my moral snowflake says I can kick you on your shin, and yours says I can't, do I get to kick on your shin?
Don't really want to go into the whole mess of "is morality discovered or invented", "does morality exist", "does the number 3 exist", etc. Let's just assume that you can point FAI at a person or group of people and get something that maximizes goodness as they understand it. Then FAI pointed at Mark would be the best thing for Mark, but FAI pointed at all of humanity (or at a group of people who donated to MIRI) probably wouldn't be the best thing for Mark, because different people have different desires, positional goods exist, etc. It would be still pretty good, though.
Mark was complaining he would not get "his" morality, not that he wouldn't get all his preferences satisified. Individual moralities makes no sense to me, any more than private languages or personal currencies. It is obvious to me that any morlaity will require concessions: AI-imposed morality is not special in that regard.
I don't understand your comment, and I no longer understand your grandparent comment either. Are you using a meaning of "morality" that is distinct from "preferences"? If yes, can you describe your assumptions in more detail? It's not just for my benefit, but for many others on LW who use "morality" and "preferences" interchangeably.
Do that many people really use them interchangeably? Would these people understand the questions "Do you prefer chocolate or vanilla ice-cream?" as completely identical in meaning to "Do you consider chocolate or vanilla as the morally superior flavor for ice-cream?"
I don't care about colloquial usage, sorry. Eliezer has a convincing explanation of why wishes are intertwined with morality ("there is no safe wish smaller than an entire human morality"). IMO the only sane reaction to that argument is to unify the concepts of "wishes" and "morality" into a single concept, which you could call "preference" or "morality" or "utility function", and just switch to using it exclusively, at least for AI purposes. I've made that switch so long ago that I've forgotten how to think otherwise.
I recommend you re-learn how to think otherwise so you can fool humans into thinking you're one of them ;-).
"Intertwined with" does not mean "the same as". I am not convinced by the explanation. It also applies ot non-moral prefrences. If I have a lower priority non moral prefence to eat tasty food, and a higher priority preference to stay slim, I need to consider my higher priority preferece when wishing for yummy ice cream. To be sure, an agent capable of acting morally will have morality among their higher priority preferences -- it has to be among the higher order preferences, becuase it has to override other preferences for the agent to act morally. Therefore, when they scan their higher prioriuty prefences, they will happen to encounter their moral preferences. But that does not mean any preference is necessarily a moral preference. And their moral prefences override other preferences which are therefore non-moral, or at least less moral. Therefore morality si a subset of prefences, as common sense maintained all along. IMO, it is better to keep ones options open.
I don't experience the emotions of moral outrage and moral approval whenever any of my preferences are hindered/satisfied -- so it seems evident that my moral circuitry isn't identical to my preference circuitry. It may overlap in parts, it may have fuzzy boundaries, but it's not identical. My own view is that morality is the brain's attempt to extrapolate preferences about behaviours as they would be if you had no personal stakes/preferences about a situation. So people don't get morally outraged at other people eating chocolate icecreams, even when they personally don't like chocolate icecreams, because they can understand that's a strictly personal preference. If they believe it to be more than personal preference and make it into e.g. "divine commandment" or "natural law", then moral outrage can occur. That morality is a subjective attempt at objectivity explains many of the confusions people have about it.
The ice cream example is bad because the consequences are purely internal to the person consuming the ice cream. What if the chocolate ice cream was made with slave labour? Many people would then object to you buying it on moral grounds. Eliezer has produced an argument I find convincing that morality is the back propagation of preference to the options of an intermediate choice. That is to say, it is "bad" to eat chocolate ice cream because it economically supports slavers, and I prefer a world without slavery. But if I didn't know about the slave-labour ice cream factory, my preference would be that all-things-being-equal you get to make your own choices about what you eat, and therefore I prefer that you choose (and receive) the one you want, which is your determination to make, not mine. Do you agree with EY's essay on the nature of right-ness which I linked to?
That doesn't seem to be required for Eliezer's argument... I guess the relevant question is, do you think FAI will need to treat morality differently from other preferences?
I would prefer a AI that followed my extrapolated preferences, than a AI that followed my morality. But a AI that followed my morality would be morally superior to an AI that followed my extrapolated preferences. If you don't understand the distinction I'm making above, consider a case of the AI having to decide whether to save my own child vs saving a thousand random other children. I'd prefer the former, but I believe the latter would be the morally superior choice. Is that idea really so hard to understand? Would you dismiss the distinction I'm making as merely colloquial language?
Wow there is so much wrapped up in this little consideration. The heart of the issue is that we (by which I mean you, but I share your delimma) have truly conflicting preferences. Honestly I think you should not be afraid to say that saving your own child is the moral thing to do. And you don't have to give excuses either - it's not that “if everyone saved their own child, then everyone's child will be looked after” or anything like that. No, the desire to save your own child is firmly rooted in our basic drives and preferences, enough so that we can go quite far in calling it a basic foundational moral axiom. It's not actually axiomatic, but we can safely treat it as such. At the same time we have a basic preference to seek social acceptance and find commonality with the people we let into our lives. This drives us to want outcomes that are universally or at least most-widely acceptable, and seek moral frameworks like utilitarianism which lead to these outcomes. Usually this drive is secondary to self-serving preferences for most people, and that is OK. For some reason you've called making decisions in favor of self-serving drives "preferences" and decisions in favor of social drives "morality." But the underlying mechanism is the same. "But wait, if I choose self-serving drives over social conformity, doesn't that lead to me to make the decision to save one life in exclusion to 1000 others?" Yes, yes it does. This massive sub-thread started with me objecting to the idea that some "friendly" AI somewhere could derive morality experimentally from my preferences or the collective preferences of humankind, make it consistent, apply the result universally, and that I'd be OK with that outcome. But that cannot work because there is not, and cannot be a universal morality that satisfies everyone - every one of those thousand other children have parents that want their kid to survive and would see your child dead if need be.
What do you mean by "should not"? What do you mean by "OK"? Show me the neurological studies that prove it. Yes, and yet if none of the children were mine, and if I wasn't involved in the situation at all, I would say "save the 1000 children rather than the 1". And if someone else, also not personally involved, could make the choice and chose to flip a coin instead in order to decide, I'd be morally outraged at them. You can now give me a bunch of reasons of why this is just preference, while at the same time EVERYTHING about it (how I arrive to my judgment, how I feel about the judgment of others) makes it a whole distinct category of its own. I'm fine with abolishing useless categories when there's no meaningful distinction, but all you people should stop trying to abolish categories where there pretty damn obviously IS one.
I suspect that he means something like 'Even though utilitarianism (on LW) and altruism (in general) are considered to be what morality is, you should not let that discourage you from asserting that selfishly saving your own child is the right thing to do". (Feel free to correct me if I'm wrong.)
Yes that is correct.
So you explained "should not" by using a sentence that also has "should not" in it.
I hope it's a more clear "should not".
I've explained to you twice now how the two underlying mechanisms are unified, and pointed to Eliezer's quite good explanation on the matter. I don't see the need to go through that again.
If you were offered a bunch of AIs with equivalent power, but following different mixtures of your moral and non-moral preferences, which one would you run? (I guess you're aware of the standard results saying a non-stupid AI must follow some one-dimensional utility function, etc.)
I guess whatever ratio of my moral and non-moral preferences best represents their effect on my volition.
My related but different thoughts here. In particular, I don't agree that emotions like moral outrage and approval are impersonal, though I agree that we often justify those emotions using impersonal language and beliefs.
I didn't say that moral outrage and approval are impersonal. Obviously nothing that a person does can truly be "impersonal". But it may be an attempt at impersonality. The attempt itself provides a direction that significantly differentiates between moral preferences and non-moral preferences.
I didn't mean some idealized humanly-unrealizable notion of impersonality, I meant the thing we ordinarily use "impersonal" to mean when talking about what humans do.
Ditto. Cousin Itt, 'tis a hairy topic, so you're uniquely "suited" to offer strands of insights: For all the supposedly hard and confusing concepts out there, few have such an obvious answer as the supposed dichotomy between "morality" and "utility function". This in itself is troubling, as too-easy-to-come-by answers trigger the suspicion that I myself am subject to some sort of cognitive error. Many people I deem to be quite smart would disagree with you and I, on a question whose answer is pretty much inherent in the definition of the term "utility function" encompassing preferences of any kind, leaving no space for some holier-than-thou universal (whether human-universal, or "optimal", or "to be aspired to", or "neurotypical", or whatever other tortured notions I've had to read) moral preferences which are somehow separate. Why do you reckon that other (or otherwise?) smart people come to different conclusions on this?
I guess they have strong intuitions saying that objective morality must exist, and aren't used to solving or dismissing philosophical problems by asking "what would be useful for building FAI?" From most other perspectives, the question does look open.
Moral preferences don't have to be separate to be disinct, they can be a subset. "Morality is either all your prefences, or none of your prefernces" is a false dichotomy.
Edit: Of course you can choose to call a subset of your preferences "moral", but why would that make them "special", or more worthy of consideration than any other "non-moral" preferences of comparative weight?
The "moral" subset of people's preferences has certain elements that differentiate it like e.g. an attempt at universalization.
Attempt at universalization, isn't that a euphemism for proselytizing? Why would [an agent whose preferences do not much intersect with the "moral" preferences of some group of agents] consider such attempts at universalization any different from other attempts of other-optimizing, which is generally a hostile act to be defended against?
No, people attempt to 'proselytize' their non-moral preferences too. If I attempt to share my love of My Little Pony, that doesn't mean I consider it a need for you to also love it. Even if I preferred it for you share my love of it, it would still not be a moral obligation on your part. By universalization I didn't mean any action done after the adoption of the moral preference in question, I meant the criterion that serves to label it as a 'moral injuction' in the first place. If your brain doesn't register it as an instruction defensible by something other that your personal preferences, if it doesn't register it as a universal principle, it doesn't register as a moral instruction in the first place.
What do you mean by "universal"? For any such "universally morally correct preference", what about the potentially infinite number of other agents not sharing it? Please explain.
I've already given an example above: In a choice between saving my own child and saving a thousand other children, let's say I prefer saving my child. "Save my child" is a personal preference, and my brain recognizes it as such. "Save the highest number of children" can be considered a impersonal/universal instruction. If I wanted to follow my preferences but still nonetheless claim moral perfection, I could attempt to say that the rule is really "Every parent should seek to save their own child" -- and I might even convince myself to the same. But I wouldn't say that the moral principle is really "Everyone should first seek to save the child of Aris Katsaris", even if that's what I really really prefer. EDIT TO ADD: Also far from a recipe for war, it seem to me that morality is the opposite: an attempt at reconciling different preferences, so that people only become hostile towards only those people that don't follow a much more limited set of instructions, rather than anything in the entire set of their preferences.
Why would you try to do away with your personal preferences, what makes them inferior (edit: speaking as one specific agent) to some blended average case of myriads of other humans? (Is it because of your mirror neurons? ;-) Being you, you should strive towards that which you "really really prefer". If a particular "moral principle" (whatever you choose to label as such) is suboptimal for you (and you're not making choices for all of mankind, TDT or no), why would you endorse/glorify a suboptimal course of action? That's called a compromise for mutual benefit, and it shifts as the group of agents changes throughout history. There's no need to elevate the current set of "mostly mutually beneficial actions" above anything but the fleeting accomodations and deals between roving tribes that they are. Best looked at through the prism of game theory.
Being me, I prefer what I "really really prefer". You've not indicated why I "should" strive towards that which I "really really prefer". When you are asking whether I "would" do something, is different than when you ask whether I "should" do something. Morality helps drive my volition, but it isn't the sole decider. If you want to claim that that's the historical/evolutionary reasons that the moral instinct evolved, I agree. If you want to argue that that's what morality is, then I disagree. Morality can drive someone to sacrifice their lives for others, so it's obviously NOT always a "compromise for mutual benefit".
Everybody defines his/her own variant of what they call "morality", "right", "wrong", I simply suspect that the genesis of the whole "universally good" brouhaha stems from evolutionary evolved applied game theory, the "good of the tribe". Which is fine. Luckily we could now move past being bound by such homo erectus historic constraints. That doesn't mean we stop cooperating, we just start being more analytic about it. That would satisfy my preferences, that would be good. Well, if the agent prefers sacrificing their existence for others, then doing so would be to their own benefit, no?
sigh. Yes, given such a moral preference already in place, it somehow becomes to any person's "benefit" (for a rather useless definition of "benefit") to follow their morality. But you previously argued that morality is a "compromise for mutual benefit", so it would follow that it only is created in order to help partially satisfy some preexisting "benefit". That benefit can't be the mere satisfaction of itself.
I've called "an attempt at reconciling different preferences" a "compromise for mutual benefit". Various people call various actions "moral". The whole notion probably stems from cooperation within a tribe being of overall benefit, evolutionary speaking, but I don't claim at all that "any moral action is a compromise for mutual benefit". Who knows who calls what moral. The whole confused notion should be done away with, game theory ain't be needing no "moral". What I am claiming is that there is non-trivial definition of morality (that is, other than "good = following your preferences") which can convince a perfectly rational agent to change its own utility function to adopt more such "moral preferences". Change, not merely relabel. The perfectly instrumentally rational agent does that which its utility functions wants. How would you even convince it otherwise? Hopefully this clarifies things a bit.
My own feeling is that if you stop being so dismissive, you'll actually make some progress towards understanding "who calls what moral". Sure, unless someone already has a desire to be moral, talk of morality will be of no concern to them. I agree with that.
Edit: Because the scenario clarifies my position, allow me to elaborate on it: Consider a perfectly rational agent. Its epistemic rationality is flawless, that is its model of its environment is impeccable. Its instrumental rationality, without peer. That is, it is really, really good at satisfying its preferences. It encounters a human. The human talks about what the human wants, some of which the human calls "virtuous" and "good" and is especially adamant about. You and I, alas, are far from that perfectly rational agent. As you say, if you already have a desire to enact some actions you call morally good, then you don't need to "change" your utility function, you already have some preferences you call moral. The question is for those who do not have a desire to do what you call moral (or who insist on their own definition, as nearly everybody does), on what grounds should they even start caring about what you call "moral"? As you say, they shouldn't, unless it benefits them in some way (e.g. makes their mammal brains feel good about being a Good Person (tm)). So what's the hubbub?
I've already said that unless someone already desires to be moral, babbling about morality won't do anything for them. I didn't say it "shouldn't" (please stop confusing these two verbs) But then you also seem to conflate this with a different issue -- of what to do with someone who does want to be moral, but understands morality differently than I do. Which is an utterly different issue. First of all people often have different definitions to describe the same concepts -- that's because quite clearly the human brain doesn't work with definitions, but with fuzzy categorizations and instinctive "I know it when I see it" which we then attempt to make into definition when we attempt to communicate said concepts to others. But the very fact we use the same word "morality", means we identify some common elements of what "morality" means. If we didn't mean anything similar to each other, we wouldn't be using the same word to describe it. I find that supposedly different moralities seem to have some very common elements to them -- e.g. people tend to prefer that other people be moral. People generally agree that moral behaviour by everyone leads to happier, healthier societies. They tend to disagree about what that behaviour is, but the effects they describe tend to be common. I might disagree with Kasparov about what the best next chess move would be, and that doesn't mean it's simply a matter of preference - we have a common understanding that the best moves are the ones that lead to an advantageous position. So, though we disagree on the best move, we have an agreement on the results of the best move.
What you did say was "of no concern", and "won't do anything for them", which (unless you assume infinite resources) translates to "shouldn't". It's not "conflating". Let's stay constructive. Such as in Islamic societies. Wrong fuzzy morality cloud? Sure. What it does not mean, however, is that in between these fuzzily connected concepts is some actual, correct, universal notion of morality. Or would you take some sort of "mean", which changes with time and social conventions? If everybody had some vague ideas about games called chess_1 to chess_N, with N being in the millions, that would not translate to some universally correct and acceptable definition of the game of chess. Fuzzy human concepts can't be assuemd to yield some iron-clad core just beyond our grasp, if only we could blow the fuzziness away. People for the most part agree what to classify as a chair. That doesn't mean there is some ideal chair we can strive for. When checking for best moves in pre-defined chess there are definite criteria. There are non-arbitrary metrics to measure "best" by. Kasparov's proposed chess move can be better than your proposed chess move, using clear and obvious metrics. The analogy doesn't pan out: With the fuzzy clouds of what's "moral", an outlier could -- maybe -- say "well, I'm clearly an outlier", but that wouldn't necessitate any change, because there is no objective metric to go by. Preferences aren't subject to Aumann's, or to a tyranny of the (current societal) majority.
No, Islamic societies suffer from the delusion that Allah exists. If Allah existed (an omnipotent creature that punishes you horribly if you fail to obey Quran's commandments), then Islamic societies would have the right idea. Remove their false belief in Allah, and I fail to see any great moral difference between our society and Islamic ones.
You're treating desires as simpler than they often are in humans. Someone can have no desire to be moral because they have a mistaken idea of what morality is or requires, are internally inconsistent, or have mistaken beliefs about how states of the world map to their utility function - to name a few possibilities. So, if someone told me that they have no desire to do what I call moral, I would assume that they have mistaken beliefs about morality, for reasons like the ones I listed. If there were beings that had all the relevant information, were internally consistent, and used words with the same sense that I use them, and they still had no desire to do what I call moral, then there would be on way for me to convince them, but this doesn't describe humans.
So not doing what you call moral implies "mistaken beliefs"? How, why? Does that mean, then, that unfriendly AI cannot exist? Or is it just that a superior agent which does not follow your morality is somehow faulty? It might not care much. (Neither should fellow humans who do not adhere to your 'correct' set of moral actions. Just saying "everybody needs to be moral" doesn't change any rational agent's preferences. Any reasoning?)
For a human, yes. Explaining why this is the case would require several Main-length posts about ethical egoism, human nature and virtue ethics, and other related topics. It's a lot to go into. I'm happy to answer specific questions, but a proper answer would require describing much of (what I believe to be) morality. I will attempt to give what must be a very incomplete answer. It's not about what I call moral, but what is actually moral. There is a variety of reasons (upbringing, culture, bad habits, mental problems, etc) that can cause people to have mistaken beliefs about what's moral. Much of what is moral is because of what's good for a person because of human nature. People's preferences can be internally inconsistent, and actually are inconsistent when they ignore or don't fully integrate this part of their preferences. An AI doesn't have human nature, so it can be internally consistent while not doing what's moral, but I believe that if a human is immoral, it's a case of internal inconsistency (or lack of knowledge).
Is it something about the human brain? But brains evolve over time, both from genetic and from environmental influences. Worse, different human subpopulations often evolve (slightly) different paths! So which humans do you claim as a basis from which to define the one and only correct "human morality"?
Despite the differences, there is a common human nature. There is a Psychological Unity of Humankind.
Noting that humans share many characteristics is an 'is', not an 'ought'. Also, this "common human nature" as exemplified throughout history is ... non too pretty as a base for some "universal mandatory morality". Yes, compared to random other mind designs pulled from mindspace, all human minds appear very similar. Doesn't imply at all that they all should strive to be similar, or to follow a similar 'codex'. Where do you get that from? It's like religion, minus god. What you're saying that if you want to be a real human, you have to be moral? What species am I, then? Declaring that most humans have two legs doesn't mean that every human should strive to have exactly two legs. Can't derive an 'ought' from an 'is'.
Yes, human nature is an "is". It's important because it shapes people's preferences, or, more relevantly, it shapes what makes people happy. It's not that people should strive to have two legs, but that they already have two legs, but are ignoring them. There is no obligation to be human - but you're already human, and thus human nature is already part of you. No, I'm saying that because you are human, it is inconsistent of you to not want to be moral.
I feel like the discussion is stalling at this point. It comes down to you saying "if you're human you should want to be moral, because humans should be moral", which to me is as non-sequitur as it gets. Except if my utility function doesn't encompass what you think is "moral" and I'm human, then "following human morality" doesn't quite seem to be a prerequisite to be a "true" human, no?
No, that isn't what I'm saying. I'm saying that if you're human, you should want to be moral, because wanting to be moral follows from the desires of a human with consistent preferences, due in part to human nature. Then I dispute that your utility function is what you think it is.
The error as I see it is that "human nature", whatever you see as such, is a statement about similarities, it isn't a statement about how things should be. It's like saying "a randomly chosen positive natural number is really big, so all numbers should be really big". How do you see that differently? We've already established that agents can have consistent preferences without adhering to what you think of as "universal human morality". Child soldiers are human. Their preferences sure can be brutal, but they can be as internally consistent or inconsistent as those of anyone else. I sure would like to change their preferences, because I'd prefer for them to be different, not because some 'idealised human spirit' / 'psychic unity of mankind' ideal demands so. Proof by demonstration? Well, lock yourself in a cellar with only water and send me a key, I'll send it back FedEx with instructions to set you free, after a week. Would that suffice? I'd enjoy proving that I know my own utility function better than you know my utility function (now that would be quite weird), I wouldn't enjoy the suffering. Who knows, might even be healthy overall.
You can't randomly choose a positive natural number using an even distribution. If you use an uneven distribution, whether the result is likely to be big depends on how your distribution compares to your definition of "big".
Choose from those positive numbers that a C++ int variable can contain, or any other* non-infinite subset of positive natural numbers, then. The point is the observation of "most numbers need more than 1 digit to be expressed" not implying in any way some sort of "need" for the 1-digit numbers to "change", to satisfy the number fairy, or some abstract concept thereof. * (For LW purposes: Any other? No, not any other. Choose one with a cardinality of at least 10^6. Heh.)
It is a statement about similarities, but it's about a similarity that shapes what people should do. I don't know how I can explain it without repeating myself, but I'll try. For an analogy, let's consider beings that aren't humans. Paperclip maximizers, for example. Except these paperclip maximizers aren't AIs, but a species that somehow evolved biologically. They're not perfect reasoners and can have internally inconsistent preferences. These paperclip maximizers can prefer to do something that isn't paperclip-maximizing, even though that is contrary to their nature - that is, if they were to maximize paperclips, they would prefer it to whatever they were doing earlier. One day, a paperclip maximizer who is maximizing paperclips tells his fellow clippies, "You should maximize paperclips, because if you did, you would prefer to, as it is your nature". This clippy's statement is true - the clippies' nature is such that if they maximized clippies, they would prefer it to other goals. So, regardless of what other clippies are actually doing, the utility-maximizing thing for them to do would be to maximize paperclips. So it is with humans. Upon discovering/realizing/deriving what is moral and consistently acting/being moral, the agent would find that being moral is better than the alternative. This is in part due to human nature. Agents, yes. Humans, no. Just like the clippies can't have consistent preferences if they're not maximizing paperclips. What would that prove? Also, I don't claim that I know the entirety of your utility function better than you do - you know much better than I do what kind of ice cream you prefer, what TV shows you like to watch, etc. But those have little to do with human nature in the sense that we're talking about it here.
A clippy which isn't maximizing paperclips is not a clippy. A human which isn't adhering to your moral codex is still a human. That my utility function includes something which you'd probably consider immoral.
It's a clippy because it would maximize paperclips if it had consistent preferences and sufficient knowledge. I don't dispute that this is possible. What I dispute is that your utility function would contain that if you were internally consistent (and had knowledge of what being moral is like).
The desires of an agent are defined by its preferences. "This is a paperclip maximizer which does not want to maximize paperclips" is a contradiction in terms. And what do you mean by "consistent", do you mean "consistent with 'human nature'? Who cares? Or consistent within themselves? Highly doubtful, what would internal consistency have to do with being an altruist? If there's anything which is characteristic of "human nature", it is the inconsistency of their preferences. A human which doesn't share what you think of as "correct" values (may I ask, not disparagingly, are you religious?) is still a human. An unusual one, maybe (probably not), but an agent not in "need" of any change towards more "moral" values. Stalin may have been happy the way he was. Because of the warm fuzzies? The social signalling? Is being moral awsome, or deeply fulfilling? Are you internally consistent ... ?
Call it a quasi-paperclip maximizer, then. I'm not interested in disputing definitions. Whatever you call it, it's a being whose preferences are not necessarily internally consistent, but when they are, it prefers to maximize paperclips. When its preferences are internally inconsistent, it may prefer to do things and have goals other than maximizing paperclips. There's no necessary connection between the two, but I'm not equating morality and altruism. Morality is what one should do and/or how one should be, which need not be altruistic. Humans can have incorrect values and still be human, but in that case they are internally inconsistent., because of the preferences they have due to human nature. I'm not saying that humans should strive to have human nature, I'm saying that they already have it. I doubt that Stalin was happy - just look at how paranoid he was. And no, I'm not religious, and have never been. Yes to the first and third questions, Being moral is awesome and fulfilling. It makes you feel happier, more fulfilled, more stable, and similar feelings. It doesn't guarantee happiness, but it contributes to it both directly (being moral feels good) and indirectly (it helps you make good decisions). It makes you stronger and more resilient (once you've internalized it fully). It's hard to describe beyond that, but good feels good (TVTropes warning). I think I'm internally consistent. I've been told that I am. It's unlikely that I'm perfectly consistent, but whatever inconsistencies I have are probably minor. I'm open to having them addressed, whatever they are.
Claiming that Stalin wasn't happy sounds like a variation of sour grapes where not only can you not be as successful as him, it would be actively uncomfortable for you to believe that someone who lacks compassion can be happy, so you claim that he's not. It's true he was paranoid but it's also true that in the real world, there are tradeoffs and you don't see people becoming happy with no downsides whatsoever--claiming that this disqualifies them from being called happy eviscerates the word of meaning. I'm also not convinced that Stalin's "paranoia" was paranoia (it seems rationa for someone who doesn't care about the welfare of others and can increase his safety by instilling fear and treating everyone as enemies to do so). I would also caution against assuming that since Stalin's paranoia is prominent enough for you to have heard of it, it's too big a deal for him to have been happy--it's promiment enough for you to have heard of it because it was a big deal to the people affected by it, which is unrelated to how much it affected his happiness.
Stalin was paranoid even by the standards of major world leaders. Khrushchev wasn't so paranoid, for example. Stalin saw enemies behind every corner. That is not a happy existence.

Khruschev was deposed. Stalin stayed dictator until he died of natural causes. That suggests that Khruschev wasn't paranoid enough, while Stalin was appropriately paranoid.

Seeing enemies around every corner meant that sometimes he saw enemies that weren't there, but it was overall adaptive because it resulted in him not getting defeated by any of the enemies that actually existed. (Furthermore, going against nonexistent enemies can be beneficial insofar as the ruthlessness in going after them discourages real enemies.)

Stalin saw enemies behind every corner. That is not a happy existence.

How does the last sentence follow from the previous one? It's certainly not as happy an existence as it would have been if he had no enemies, but as I pointed out, nobody's perfectly happy. There are always tradeoffs and we don't claim that the fact that someone had to do something to gain his happiness automatically makes that happiness fake.

3Said Achmiz10y
Stalin's paranoia, and the actions he took as a result, also created enemies, thus becoming a partly self-fulfilling attitude.
You do see people becoming happy with fewer downsides than others, though.
Stalin refused to believe Hitler would attack him, probably since that would be suicidally stupid on the attacker's part. Was he paranoid, or did he update?
I'm not sure "preference" is a powerful enough term to capture an agent's true goals, however defined. Consider any of the standard preference reversals: a heavy cigarette smoker, for example, might prefer to buy and consume their next pack in a Near context, yet prefer to quit in a Far. The apparent contradiction follows quite naturally from time discounting, yet neither interpretation of the person's preferences is obviously wrong.
I've seen it used as shorthand for "utility function", saving 5 keystrokes. That was the intended use here. Point taken, alternate phrasings welcome.
That would only prove that you think you want to do that. The issue is that what you think you want and what you actually want do not generally coincide, because of imperfect self-knowledge, bounded thinking time, etc. I don't know about child soldiers, but it's fairly common for amateur philosophers to argue themselves into thinking they "should" be perfectly selfish egoists, or hedonistic utilitarians, because logic or rationality demands it. They are factually mistaken, and to the extent that they think they want to be egoists or hedonists, their "preferences" are inconsistent, because if they noticed the logical flaw in their argument they would change their minds.
Isn't that when I throw up my arms and say "congratulations, your hypothesis is unfalsifiable, the dragon is permeable to fluor". What experimental setup would you suggest? Would you say any statement about one's preferences is moot? It seems that we're always under bounded thinking time constraints. Maybe the paperclipper really wants to help humankind and be moral, and mistakingly thinks otherwise. Who would know, it optimized its own actions under resource constraints, and then there's the 'Löbstacle' to consider. Is saying "I like vanilla ice cream" FAI-complete and must never be uttered or relied upon by anyone? Or argue themselves into thinking that there is some subset of preferences such every other (human?) agent should voluntarily choose to adopt them, against their better judgment (edit; as it contradicts what they (perceive, after thorough introspection) as their own preferences)? You can add "objective moralists" to the list. What would it be that is present in every single human's brain architecture throughout human history that would be compatible with some fixed ordering over actions, called "morally good"? (Otherwise you'd have your immediate counterexample.) The notion seems so obviously ill-defined and misguided (hence my first comment asking Cousin_It). It's fine (to me) to espouse preferences that aim to change other humans (say, towards being more altruistic, or towards being less altruistic, or whatever), but to appeal to some objective guiding principle based on "human nature" (which constantly evolves in different strands) or some well-sounding ev-psych applause-light is just a new substitute for the good old Abrahamic heavenly father.
I wouldn't say any of those things. Obviously paperclippers don't "really want to help humankind", because they don't have any human notion of morality built-in in the first place. Statements like "I like vanilla ice cream" are also more trustworthy on account of being a function of directly observable things like how you feel when you eat it. The only point I'm trying to make here is that it is possible to be mistaken about your own utility function. It's entirely consistent for the vast majority of humans to have a large shared portion of their built-in utility function (built-in by their genes), even though many of them seemingly want to do bad things, and that's because humans are easily confused and not automatically self-aware.
For sure. I'd agree if humans were like dishwashers. There are templates for dishwashers, ways they are supposed to work. If you came across a broken dishwasher, there could be a case for the dishwasher to be repaired, to go back to "what it's supposed to be". However, that is because there is some external authority (exasparated humans who want to fix their damn dishwasher, dirty dishes are piling up) conceiving of and enforcing such a purpose. The fact that genes and the environment shape utility functions in similar ways is a description, not a prescription. It would not be a case for any "broken" human to go back to "what his genes would want him to be doing". Just like it wouldn't be a case against brain uploading. Some of the discussion seems to me like saying that "deep down in every flawed human, there is 'a figure of light', in our community 'a rational agent following uniform human values with slight deviations accounting for ice-cream taste', we just need to dig it up". There is only your brain. With its values. There is no external standard to call its values flawed. There are external standards (rationality = winning) to better its epistemic and instrumental rationality, but those can help the serial killer and the GiveWell activist equally. (Also, both of those can be 'mistaken' about their values.)
1. If you have a preference for morality, being moral is not doing away with that prrefence: it is allowing your altruistic prefences to override your selfish ones. 2. You may be on the receving end of someone else's self sacrifice at some point
1. Certainly, but in that case your preference for the moral action is your personal preference, which is your 'selfish' preference. No conflict there. You should always do that which maximizes your utility function. If you call that moral, we're in full agreement. If your utility function is maximized by caring about someone else's utility function, go for it. I do, too. 2. That's nice. Why would that cause me to do things which I do not overall prefer to do? Or do you say you always value that which you call moral the most?
I can make a quite clear distinction between my preferences relating to an apersonal loving-kindness towards the universe in general, and the preferences that center around my personal affections and likings. You keep trying to do away with a distinction that has huge predictive ability: a distinction that helps determine what people do, why they do it, how they feel about doing it, and how they feel after doing it. If your model of people's psychology conflates morality and non-moral preferences, your model will be accurate only for the most amoral of people.
Morality is a somewhat like chess in this respect - morality:optimal play::satisfying your preferences:winning. To simplify their preferences a bit, chess players want to win, but no individual chess player would claim that all other chess players should play poorly so he can win.
That's explained simply by 'winning only against bad players' not being the most valued component of their preferences, preferring 'wins when the other player did his/her very best and still lost' instead. Am I missing your point?
Sorry, I didn't explain well. To approach the explanation from different angles: * Even if all chess players wanted was to win, it would still be incorrect for them to claim that playing poorly is the correct way to play. Just like when I'm hungry, I want to eat, but I don't claim that strangers should feed me for free. * Consider the prisoners' dilemma, as analyzed traditionally. Each prisoner wants the other to cooperate, but neither can claim that the other should cooperate.
Incorrect because that's not what the winning player would prefer. You don't claim that strangers should feed you because that's what you prefer. It's part of your preferences. Some of your preferences can rely on satisfying someone else's preferences. Such altruistic preferences are still your own preferences. Helping members of your tribe you care about. Cooperating within your tribe, enjoying the evolutionary triggered endorphins. You're probably thinking that considering external preferences and incorporating them in your own utility function is a core principle of being "morally right". Is that so? So the core disagreement (I think): Take an agent with a given set of preferences. Some of these may include the preferences of others, some may not. On what basis should that agent modify its preferences to include more preferences of others, i.e. to be "more moral"? So you can imagine yourself in someone else's position, then say "What B should do from A's perspective" is different from "What B should do from B's perspective". Then you can enter all sorts of game theoretic considerations. Where does morality come in?
There is no "What B should do from A's perspective", from A's perspective there is only "What I want B to do". It's not a "should". Similarly, the chess player wants his opponent to lose, and I want people to feed me, but neither of those are "should"s. "Should"s are only from an agent's own perspective applied to themselves, or from something simulating that perspective (such as modeling the other player in a game). "What B should do from B's perspective" is equivalent to "What B should do".
The key issue is that, whilst morality is not tautologously the same as preferences, a morally right action is, tautologously, what you should do. So it is difficult to see on what grounds Mark can object to the FAIs wishes: if it tells him something is mortally right that is what he should do. And he can't have his own separate morality, because the idea is incoherent.
A distinction to be made: Mark can wish differently than the AI wishes, Mark can't morally object to the AI's wishes (if the AI follows morality). Exactly because morality is not the same as preferences.
You can call a subset of your preferences moral, that's fine. Say, eating chocolate icecream, or helping a starving child. Let's take a randomly chosen "morally right action" A. That, given your second paragraph, would have to be a preference which, what, maximizes Mark's utility, regardless of what the rest of his utility function actually looks like? It seems to be trivial to construct a utility function (given any such action A) such as that doing A does not maximize said utility function. Give Mark a such a utility function and you got yourself a reductio ad absurdum. So, if you define a subset of preferences named "morally right" thus that any such action needs to maximize (edit: or even 'not minimize') an arbitrary utility function, then obviously that subset is empty.
If Mark is capable of acting morally, he would have a preference for moral action which is strong enough to override other preferences. However,t hat is not really the point. Even if he is too weak-willed to do what the FAI says, he has no grounds to object to the FAI. I can't see how that amount to more than the observation that not every agetn is capable of acting morally. Ho hum. I don't see why. An agent should want to do what is morally right, but that doesn't mean an agent would want to. Their utility funciton might not allow them. But how could they object to be told what is right? The fault, surely, lies in themselves.
They can object because their preferences are defined by their utility function, full stop. That's it. They are not "at fault", or "in error", for not adopting some other preferences that some other agents deem to be "morally correct". They are following their programming, as you follow yours. Different groups of agents share different parts of their preferences, think Venn diagram. If the oracle tells you "this action maximizes your own utility function, you cannot understand how", then yes the agent should follow the advice. If the oracle told an agent "do this, it is morally right", the non-confused agent would ask "do you mean it maximizes my own utility function?". If yes, "thanks, I'll do that", if no "go eff yourself!". You can call an agent "incapable of acting morally" because you don't like what it's doing, it needn't care. It might just as well call you "incapable of acting morally" if your circles of supposedly "morally correct actions" don't intersect.
I can't speak for cousin_it, natch, but for my own part I think it has to do with mutually exclusive preferences vs orthogonal/mutually reinforcing preferences. Using moral language is a way of framing a preference as mutually exclusive with other preferences. That is... if you want A and I want B, and I believe the larger system allows (Kawoomba gets A AND Dave gets B), I'm more likely to talk about our individual preferences. If I don't think that's possible, I'm more likely to use universal language ("moral," "optimal," "right," etc.), in order to signal that there's a conflict to be resolved. (Well, assuming I'm being honest.) For example, "You like chocolate, I like vanilla" does not signal a conflict; "Chocolate is wrong, vanilla is right" does.
Why stop at connotation and signalling? If there is a non-empty set of preferences whose satistfaction is inclined to lead to conflict, and a non-empty set of preferences that can be satisfied withotu conflict, then "morally relevant prefernece" can denote the members of the first set...which is not idenitcal to the set of all preferences.
For any such preference, you can immediately provide a utility function such that the corresponding agent would be very unhappy about that preference, and would give its life to prevent it. Or do you mean "a set of preferences the implementation of which would on balance benefit the largest amount of agents the most"? That would change as the set of agents changes, so does the "correct" morality change too, then? Also, why should I or anyone else particular care about about such preferences (however you define them), especially as the "on average" doesn't benefit me? Is it because evolutionary speaking, that's how what evolved? What our mirror neurons lead us towards? Wouldn't that just be a case of the naturalistic fallacy?
Sure. So what? Kids don't like teachers and criminals don't like the police..but they can't object to them, because "entitiy X is stopping from doing bad things and making me do good things" is no (rational, adult) objection. If being moral increases your utility, it increases your utility -- what other sense of "benefitting me" is there?
If utility is the satisfaction of preferences, and you can have preferences that don't benefit you (such as doing heroin), increasing your utility doesn't necessarily benefit you.
If you can get utility out of paperclips, why can't you get it out of heorin? You're surely not saying that there is some sort of Objective utility that everyone ought to have in their UF's?
You can get utility out of heroin if you prefer to use it, which is an example of "benefiting me" and utility not being synonymous. I don't think there's any objective utility function for all conceivable agents, but as you get more specific in the kinds of agents you consider (i.e. humans), there are commonalities in their utility functions, due to human nature. Also, there are sometimes inconsistencies between (for lack of better terminology) what people prefer and what they really prefer - that is, people can act and have a preference to act in ways that, if they were to act differently, they would prefer the different act.
(Kids - teachers), (criminals - police), so is "morally correct" defined by the most powerful agents, then? And if being moral (whatever it may mean) does not?
Adult, rational objections are objections that other agents might feel impelled to do somehting about, and so are not just based on "I don't like it"."I don't like it" is no objectio to "you should do your homework", etc. Then you would belong to the set of Immoral Agents, AKA Bad People.
"You should do your homework (... because it is in your own long-term best interest, you just can't see that yet)" is in the interest of the kid, cf. an FAI telling you to do an action because it is in your interest. "You should jump out that window (... because it amuses me / because I call that morally good)" is not in your interest, you should not do that. In such cases, "I don't like that" is the most pertinent objection and can stand all on its own. Boo bad people! What if we encountered aliens with "immoral" preferences?
For my own part: denotationally, yes, I would understand "Do you prefer (that Dave eat) chocolate or vanilla ice cream?" and "Do you consider (Dave eating) chocolate ice cream or vanilla as the morally superior flavor for (Dave eating) ice cream?" as asking the same question. Connotationally, of course, the latter has all kinds of (mostly ill-defined) baggage the former doesn't.
My point was that trying to use a provably-boxed AI to do anything useful would probably not work, including trying to design unboxed FAI, not that we should design boxed FAI. I may have been pessemistic, see Stuart Armstrong's proposal of reduced impact AI which sounds very similar to provably boxed AI but which might be used for just about everything including designing a FAI.
I think we might have different definitions of a boxed-AI. An AI that is literally not allowed to interact with the world at all isn't terribly useful and it sounds like a problem at least as hard as all other kinds of FAI. I just mean a normal dangerous AI that physically can't interact with the outside world. Importantly it's goal is to provably give the best output it possibly can if you give it a problem. So it won't hide nanotech in your cure for alzheimers because that would be a less fit and more complicated solution than a simple chemical compound (you would have to judge solutions based on complexity though and verify them by a human or in a simulation first just in case.) I don't think most computers today have anywhere near enough processing power to simulate a full human brain. A human down to the molecular level is entirely out of the question. An AI on a modern computer, if it's smarter than human at all, will get there by having faster serial processing or more efficient algorithms, not because it has massive raw computational power. And you can always scale down the hardware or charge it utility for using more computing power than it needs, forcing it to be efficient or limiting it's intelligence further. You don't need to invoke the full power of super-intelligence for every problem and for your safety you probably shouldn't.
A slightly bigger "large risk" than Pentashagon puts forward is that a provably boxed UFAI could indifferently give us information that results in yet another UFAI, just as unpredictable as itself (statistically speaking, it's going to give us more unhelpful information than helpful, as Robb point out). Keep in mind I'm extrapolating here. At first you'd just be asking for mundane things like better transportation, cures for diseases, etc. If the UFAI's mind is strange enough, and we're lucky enough, then some of these things result in beneficial outcomes, politically motivating humans to continue asking it for things. Eventually we're going to escalate to asking for a better AI, at which point we'll get a crap-shoot. An even bigger risk than that -though - is that if it's especially Unfriendly, it may even do this intentionally, going so far as to pretend it's friendly while bestowing us with data to make an AI even more Unfriendly AI than itself. So what do we do, box that AI as well, when it could potentially be even more devious than the one that already convinced us to make this one? Is it just boxes, all the way down? (spoilers: it isn't, because we shouldn't be taking any advice from boxed AIs in the first place) The only use of a boxed AI is to verify that, yes, the programming path you went down is the wrong one, and resulted in an AI that was indifferent to our existence (and therefore has no incentive to hide its motives from us). Any positive outcome would be no better than an outcome where the AI was specifically Evil, because if we can't tell the difference in the code prior to turning it on, we certainly wouldn't be able to tell the difference afterward.

If an artificial intelligence is smart enough to be dangerous, we'd intuitively expect it to be smart enough to know how to make itself safe.

I don't agree with that. Just looks at humans, they are smart enough to be dangerous, but even when they do want to "make themselves safe", they are usually unable to do so. A lot of harm is done by people with good intent. I don't think all of Moliere doctors prescribing bloodletting were intending to do harm.

Yes, a sufficiently smart AI will know how to make itself safe if it wishes, but the intelligence level required for that is much higher than the one required to be harmful.

2Rob Bensinger10y
Agreed. The reason I link the two abilities is that I'm assuming an AI that acquires either power went FOOM, which makes it much more likely that the two powers will arise at (on a human scale) essentially the same time.
If a FAI would have a utility function like "Maximise X while remaining Friendly", And the UFAI would just have "Maximise X". Then, If the FAI and a UFAI would be initiated simultaneously, I would expect them both to develop exponentially, but the UFAI would have more options available, thus have a steeper learning curve. So I'd expect that in this situation that the UFAI would go FOOM slightly sooner, and be able to disable the FAI.

And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer's True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we've solved Friendliness Theory. The clever hack that makes further Friendliness research unnecessary is Friendliness.

Some people seem to be arguing that it may not be that hard to discover these specific lines of code. Or perhaps that we don't need to get an AI to "perfectly" care about its programmer's True Intentions. I'm not sure if I understand their arguments correctly so I may be unintentionally strawmanning them, but the idea may be that if we can get an AI to approximately care about its programmer or user's intentions, and also prevent it from FOOMing right away (or just that the microeconomics of intelligence explosion doesn't allow for such fast FOOMing), then we can make use of the AI in a relatively safe way to solve various problems, including the problem of how to control such AIs better, or how to eventually build an FAI. What's your take on this class of arguments?

Being Friendly is of instrumental value to barely any goals.

Tangentially, being Friendly is probably of instrumental value to some goals, which may turn out to be easier to instill in an AGI than solving Friendliness in the traditional terminal values sense. I came up with the term "Instrumentally Friendly AI" to describe such an approach.

Nobody disagrees that an arbitrary agent pulled from mind design space, that is powerful enough to overpower humanity, is an existential risk if it either exhibits Omohundro's AI drives or is used as a tool by humans, either carelessly or to gain power over other humans. Disagreeing with that would about make as much sense as claiming that out-of-control self-replicating robots could somehow magically turn the world into a paradise, rather than grey goo. The disagreement is mainly about the manner in which we will achieve such AIs, how quickly that will happen, and whether such AIs will have these drives. I actually believe that much less than superhuman general intelligence might be required for humans to cause extinction type scenarios. Most of my posts specifically deal with the scenario and arguments publicized by MIRI. Those posts are not highly polished papers but attempts to reduce my own confusion and to enable others to provide feedback. I argue that... * ...the idea of a vast mind design space is largely irrelevant, because AIs will be created by humans, which will considerably limit the kind of minds we should expect. * ...that AIs created by humans do not need to, and will not exhibit any of Omohundro's AI drives. * ...that even given Omohundro's AI drives, it is not clear how such AIs would arrive at the decision to take over the world. * ...that there will be no fast transition from largely well-behaved narrow AIs to unbounded general AIs, and that humans will be part of any transition. * ...that any given AI will initially not be intelligent enough to hide any plans for world domination. * ...that drives as outlined by Omohundro would lead to a dramatic interference with what the AI's creators want it to do, before it could possibly become powerful enough to deceive or overpower them, and would therefore be noticed in time. * ...that even if MIRI's scenario comes to pass, there is a lack of concrete scenarios

This mirrors some comments you wrote recently:

"You write that the worry is that the superintelligence won't care. My response is that, to work at all, it will have to care about a lot. For example, it will have to care about achieving accurate beliefs about the world. It will have to care to devise plans to overpower humanity and not get caught. If it cares about those activities, then how is it more difficult to make it care to understand and do what humans mean?"

"If an AI is meant to behave generally intelligent [sic] then it will have to work as intended or otherwise fail to be generally intelligent."

It's relatively easy to get an AI to care about (optimize for) something-or-other; what's hard is getting one to care about the right something.

'Working as intended' is a simple phrase, but behind it lies a monstrously complex referent. It doesn't clearly distinguish the programmers' (mostly implicit) true preferences from their stated design objectives; an AI's actual code can differ from either or both of these. Crucially, what an AI is 'intended' for isn't all-or-nothing. It can fail in some ways without failing in every way, and small errors will tend to ... (read more)

MIRI assumes that programming what you want an AI to do at the outset , Big Design Up Front, is a desirable feature for some reason. The most common argument is that it is a necessary prerequisite for provable correctness, which is a desirable safety feature. OTOH, the exact opposite of massive hardcoding, goal flexibility is ielf a necessary prerequisite for corrigibility, which is itself a desirable safety feature. The latter point has not been argued against adequately, IMO.

9. An AI equipped with the capabilities required by step 5, given step 7 and 8, will very likely not be confused about what it is meant to do, if it was not meant to be confused.

"The genie knows, but doesn't care"

It's like you haven't read the OP at all.

I do not reject that step 10 does not follow if you reject that the AI will not "care" to learn what it is meant to do. But I believe there to be good reasons for an AI created by humans to care. If you assume that this future software does not care, can you pinpoint when software stops caring? 1. Present-day software is better than previous software generations at understanding and doing what humans mean. 2. There will be future generations of software which will be better than the current generation at understanding and doing what humans mean. 3. If there is better software, there will be even better software afterwards. 4. ... 5. Software will be superhuman good at understanding what humans mean but catastrophically worse than all previous generations at doing what humans mean. What happens between step 3 and 5, and how do you justify it? My guess is that you will write that there will not be a step 4, but instead a sudden transition from narrow AIs to something you call a seed AI, which is capable of making itself superhuman powerful in a very short time. And as I wrote in the comment you replied to, if I was to accept that assumption, then we would be in full agreement about AI risks. But I reject that assumption. I do not believe such a seed AI to be possible and believe that even if it was possible it would not work the way you think it would work. It would have to aquire information about what it is supposed to do, for pratical reasons.

Present day software is a series of increasing powerful narrow tools and abstractions. None of them encode anything remotely resembling the values of their users. Indeed, present-day software that tries to "do what you mean" is in my experience incredibly annoying and difficult to use, compared to software that simply presents a simple interface to a system with comprehensible mechanics.

Put simply, no software today cares about what you want. Furthermore, your general reasoning process here—define some vague measure of "software doing what you want", observe an increasing trend line and extrapolate to a future situation—is exactly the kind of reasoning I always try to avoid, because it is usually misleading and heuristic.

Look at the actual mechanics of the situation. A program that literally wants to do what you mean is a complicated thing. No realistic progression of updates to Google Maps, say, gets anywhere close to building an accurate world-model describing its human users, plus having a built-in goal system that happens to specifically identify humans in its model and deduce their extrapolated goals. As EY has said, there is no ghost in the machine that checks your code to make sure it doesn't make any "mistakes" like doing something the programmer didn't intend. If it's not programmed to care about what the programmer wanted, it won't.

Is it just me, or does this sound like it could grow out of advertisement services? I think it's the one industry that directly profits from generically modelling what users "want"¹and then delivering it to them. [edit] ¹where "want" == "will click on and hopefully buy"
Do you believe that any kind of general intelligence is practically feasible that is not a collection of powerful narrow tools and abstractions? What makes you think so? If all I care about is a list of Fibonacci numbers, what is the difference regarding the word "care" between a simple recursive algorithm and a general AI? My measure of "software doing what you want" is not vague. I mean it quite literally. If I want software to output a series of Fibonacci numbers, and it does output a series of Fibonacci numbers, then it does what I want. And what other than an increasing trend line do you suggest would be a rational means of extrapolation, sudden jumps and transitions?
Present day software may not have got far with regard to the evaluative side of doing what you want, but the XiXiDu's point seems to be that it is getting better at the semantic side. Who was it who said the value problem is part of the semantic problem?

Present-day software is better than previous software generations at understanding and doing what humans mean.
No fax or photocopier ever autocorrected your words from "meditating" to "masturbating".

Software will be superhuman good at understanding what humans mean but catastrophically worse than all previous generations at doing what humans mean.

Every bit of additional functionality requires huge amounts of HUMAN development and testing, not in order to compile and run (that's easy), but in order to WORK AS YOU WANT IT TO.

I can fully believe that a superhuman intelligence examining you will be fully capable of calculating "what you mean" "what you want" "what you fear" "what would be funniest for a buzzfeed artcle if I pretended to misunderstand your statement as meaning" "what would be best for you according to your values" "what would be best for you according to your cat's values" "what would be best for you according to Genghis Khan's values" .

No program now cares about what you mean. You've still not given any reason for the future software to care about "what you mean" over all those other calculation either.

I kind of doubt that autocorrect software really changed "meditating" to "masturbating". Because of stuff like this. Edit: And because, start at the left and working rightward, they only share 1 letter before diverging, and because I've seen a spell-checker with special behavior for dirty/curse words (Not suggesting them as corrected spellings, but also not complaining about them as unrecognized words) (this is the one spell-checker which, out of curiousity, I decided to check its behavior with dirty/curse words, so I bet it's common). Edit 2: Also from a causal history perspective of why a doubt it, rather than a normative justification perspective, there's the fact that Yvain linked it and said something like "I don't care if these are real." Edit 3: typo.
To be fair, that is a fairly representative example of bad autocorrects. (I once had a text message autocorrect to "We are terrorist.")
Meaning they don't care about anything? They care about something else? What? I'll tell you one thing: the marketplace will select agents that act as if they care.
I agree that current software products fail, such as in your autocorrect example. But how could a seed AI be able to make itself superhuman powerful if it did not care about avoiding mistakes such as autocoreccting "meditating" to "masturbating"? Imagine it would make similar mistakes in any of the problems that it is required to solve in order to overpower humanity. And if humans succeeded to make it not make such mistakes along the way to overpowering humanity, how did they selectively fail at making it want to overpower humanity in the first place? How likely is that?

But how could a seed AI be able to make itself superhuman powerful if it did not care about avoiding mistakes such as autocoreccting "meditating" to "masturbating"?

Those are only 'mistakes' if you value human intentions. A grammatical error is only an error because we value the specific rules of grammar we do; it's not the same sort of thing as a false belief (though it may stem from, or result in, false beliefs).

A machine programmed to terminally value the outputs of a modern-day autocorrect will never self-modify to improve on that algorithm or its outputs (because that would violate its terminal values). The fact that this seems silly to a human doesn't provide any causal mechanism for the AI to change its core preferences. Have we successfully coded the AI not to do things that humans find silly, and to prize un-silliness before all other things? If not, then where will that value come from?

A belief can be factually wrong. A non-representational behavior (or dynamic) is never factually right or wrong, only normatively right or wrong. (And that normative wrongness only constrains what actually occurs to the extent the norm is one a sufficiently powerful ag... (read more)

1Eliezer Yudkowsky10y
It also looks like user Juno_Watt is some type of systematic troll, probably a sockpuppet for someone else, haven't bothered investigating who.
0Paul Crowley10y
I can't work out how this relates to the thread it appears in.
1Eliezer Yudkowsky10y
Warning as before: XiXiDu = Alexander Kruel.
I'm confused as to the reason for the warning/outing, especially since the community seems to be doing an excellent job of dealing with his somewhat disjointed arguments. Downvotes, refutation, or banning in extreme cases are all viable forum-preserving responses. Publishing a dissenter's name seems at best bad manners and at worst rather crass intimidation. I only did a quick search on him and although some of the behavior was quite obnoxious, is there anything I've missed that justifies this?

XiXiDu wasn't attempting or requesting anonymity - his LW profile openly lists his true name - and Alexander Kruel is someone with known problems (and a blog openly run under his true name) whom RobbBB might not know offhand was the same person as "XiXiDu" although this is public knowledge, nor might RobbBB realize that XiXiDu had the same irredeemable status as Loosemore.

I would not randomly out an LW poster for purposes of intimidation - I don't think I've ever looked at a username's associated private email address. Ever. Actually I'm not even sure offhand if our registration process requires/verifies that or not, since I was created as a pre-existing user at the dawn of time.

I do consider RobbBB's work highly valuable and I don't want him to feel disheartened by mistakenly thinking that a couple of eternal and irredeemable semitrolls are representative samples. Due to Civilizational Inadequacy, I don't think it's possible to ever convince the field of AI or philosophy of anything even as basic as the Orthogonality Thesis, but even I am not cynical enough to think that Loosemore or Kruel are representative samples.

Thanks, Eliezer! I knew who XiXiDu is. (And if I hadn't, I think the content of his posts makes it easy to infer.)

There are a variety of reasons I find this discussion useful at the moment, and decided to stir it up. In particular, ground-floor disputes like this can be handy for forcing me to taboo inferential-gap-laden ideas and to convert premises I haven't thought about at much length into actual arguments. But one of my reasons is not 'I think this is representative of what serious FAI discussions look like (or ought to look like)', no.

Glad to hear. It is interesting data that you managed to bring in 3 big name trolls for a single thread, considering their previous dispersion and lack of interest.

Kruel hasn't threatened to sue anyone for calling him an idiot, at least!

Pardon me, I've missed something. Who has threatened to sue someone for calling him an idiot? I'd have liked to see the inevitable "truth" defence.
Thank you for the clarification. While I have a certain hesitance to throw around terms like "irredeemable", I do understand the frustration with a certain, let's say, overconfident and persistent brand of misunderstanding and how difficult it can be to maintain a public forum in its presence. My one suggestion is that, if the goal was to avoid RobbBB's (wonderfully high-quality comments, by the way) confusion, a private message might have been better. If the goal was more generally to minimize the confusion for those of us who are newer or less versed in LessWrong lore, more description might have been useful ("a known and persistent troll" or whatever) rather than just providing a name from the enemies list.
Agreed. -------------------------------------------------------------------------------- Though actually, Eliezer used similar phrasing regarding Richard Loosemore and got downvoted for it (not just by me). Admittedly, "persistent troll" is less extreme than "permanent idiot," but even so, the statement could be phrased to be more useful. I'd suggest, "We've presented similar arguments to [person] already, and [he or she] remained unconvinced. Ponder carefully before deciding to spend much time arguing with [him or her]." Not only is it less offensive this way, it does a better job of explaining itself. (Note: the "ponder carefully" section is quoting Eliezer; that part of his post was fine.)
Who has twice sworn off commenting on LW. So much for pre-commitments.

But how could a seed AI be able to make itself superhuman powerful if it did not care about avoiding mistakes such as autocoreccting "meditating" to "masturbating"?

As Robb said you're confusing mistake in the sense of "The program is doing something we don't want to do" with mistake in the sense of "The program has wrong beliefs about reality".

I suppose a different way of thinking about these is "A mistaken human belief about the program" vs "A mistaken computer belief about the human". We keep talking about the former (the program does something we didn't know it would do), and you keep treating it as if it's the latter.

Let's say we have a program (not an AI, just a program) which uses Newton's laws in order to calculate the trajectory of a ball. We want it to calculate this in order to have it move a tennis racket and hit the ball back. When it finally runs, we observe that the program always avoids the ball rather than hit it back. Is it because it's calculating the trajectory of the ball wrongly? No, it calculates the trajectory very well indeed, it's just that an instruction in the program was wrongly inserted so t... (read more)

To be better able to respond to your comment, please let me know in what way you disagree with the following comparison between narrow AI and general AI: Narrow artificial intelligence will be denoted NAI and general artificial intelligence GAI. (1) Is it in principle capable of behaving in accordance with human intention to a sufficient degree? NAI: True GAI: True (2) Under what circumstances does it fail to behave in accordance with human intention? NAI: If it is broken, where broken stands for a wide range of failure modes such as incorrectly managing memory allocations. GAI: In all cases in which it is not mathematically proven to be tasked with the protection of, and equipped with, a perfect encoding of all human values or a safe way to obtain such an encoding. (3) What happens when it fails to behave in accordance with human intention? NAI: It crashes, freezes or halts. It generally fails in a way that is harmful to its own functioning. If for example an autonomous car fails at driving autonomously it usually means that it will either go into safe-mode and halt or crash. GAI: It works perfectly well. Superhumanly well. All its intended capabilities are intact except that it completely fails at working as intended in such a way as to destroy all human value in the universe. It will be able to improve itself and capable of obtaining a perfect encoding of human values. It will use those intended capabilities in order to deceive and overpower humans rather than doing what it was intended to do. (4) What happens if it is bound to use a limited amount of resources, use a limited amount of space or run for a limited amount of time? NAI: It will only ever do what it was programmed to do. As long as there is no fatal flaw, harming its general functionality, it will work within the defined boundaries as intended. GAI: It will never do what it was programmed to do and always remove or bypass its intended limitations in order to pursue unintended actions such
To be better able to respond to your comment, please let me know in what way you disagree with the following comparison between narrow AI and general AI: Narrow artificial intelligence will be denoted NAI and general artificial intelligence GAI. (1) Is it in principle capable of behaving in accordance with human intention to a sufficient degree? NAI: True GAI: True (2) Under what circumstances does it fail to behave in accordance with human intention? NAI: If it is broken, where broken stands for a wide range of failure modes such as incorrectly managing memory allocations. GAI: In all cases in which it is not mathematically proven to be tasked with the protection of, and equipped with, a perfect encoding of all human values or a safe way to obtain such an encoding. (3) What happens when it fails to behave in accordance with human intention? NAI: It crashes, freezes or halts. It generally fails in a way that is harmful to its own functioning. If for example an autonomous car fails at driving autonomously it usually means that it will either go into safe-mode and halt or crash. GAI: It works perfectly well. Superhumanly well. All its intended capabilities are intact except that it completely fails at working as intended in such a way as to destroy all human value in the universe. It will be able to improve itself and capable of obtaining a perfect encoding of human values. It will use those intended capabilities in order to deceive and overpower humans rather than doing what it was intended to do. (4) What happens if it is bound to use a limited amount of resources, use a limited amount of space or run for a limited amount of time? NAI: It will only ever do what it was programmed to do. As long as there is no fatal flaw, harming its general functionality, it will work within the defined boundaries as intended. GAI: It will never do what it was programmed to do and always remove or bypass its intended limitations in order to pursue unintended actions such

GAI: It will never do what it was programmed to do and always remove or bypass its intended limitations in order to pursue unintended actions such as taking over the universe.

GAI is a program. It always does what it's programmed to do. That's the problem—a program that was written incorrectly will generally never do what it was intended to do.

FWIW, I find your statements 3,4,5 also highly objectionable, on the grounds that you are lumping a large class of things under the blank label "errors". Is an "error" doing something that humans don't want? Is it doing something the agent doesn't want? Is it accidentally mistyping a letter in a program, causing a syntax error, or thinking about something heuristically and coming to the wrong conclusion, then making carefully planned decision based on that mistake? Automatic proof systems don't save you if you what you think you need to prove isn't actually what you need to prove.

First list: 1) Poorly defined terms "human intention" and "sufficient". 2) Possibly under any circumstances whatsoever, if it's anything like other non-trivial software, which always has some bugs. 3) Anything from "you may not notice" to "catastrophic failure resulting in deaths". Claim that failure of software to work as humans intend will "generally fail in a way that is harmful to it's own functioning" is unsupported. E.g. a spreadsheet works fine if the floating point math is off in the 20th bit of the mantissa. The answers will be wrong, but there is nothing about that that the spreadsheet could be expected to care about, 4) Not necessarily. GAI may continue to try to do what it was programmed to do, and only unintentionally destroy a small city in the process :) Second list: 1) Wrong. The abilities of sufficiently complex systems are a huge space of events humans haven't thought about yet, and so do not yet have preferences about. There is no way to know what their preferences would or should be for many many outcomes. 2) Error as failure to perform the requested action may take precedence over error as failure to anticipate hypothetical objections from some humans to something they hadn't expected. For one thing, it is more clearly defined. We already know human-level intelligences act this way. 3) Asteroids and supervolcanoes are not better than humans at preventing errors. It is perfectly possible for something stupid to be able to kill you. Therefore something with greater cognitive and material resources than you, but still with the capacity to make mistakes can certainly kill you. For example, a government. 4) It is already possible for a very fallible human to make something that is better than humans at detecting certain kinds of errors. 5) No. Unless by dramatic you mean "impossibly perfect, magical and universal".
Two points: Firstly, "fails in a way that is harmful to its own functioning" appears to be tautological. Secondly, you seem to be listing things that apply to any kind of AI in the NAI section - is this intentional? (This happens throughout your comment, in fact.)
XiXiDu, I get the impression you've never coded anything. Is that accurate? Present-day everyday software (e.g. Google Maps, Siri) is better at doing what humans mean. It is not better at understanding humans. Learning programs like the one that runs PARO appear to be good at understanding humans, but are actually following a very simple utility function (in the decision sense, not the experiental sense); they change their behaviour in response to programmed cues, generally by doing more/less of actions associated with those cues (example: PARO "likes" being stroked and will do more of things that tend to preceed stroking). In each case of a program that improves itself, it has a simple thing it "wants" to optimise and makes changes according to how well it seems to be doing. Making software that understands humans at all is beyond our current capabilities. Theory of mind, the ability to recognise agents and see them as having desires of their own, is something we have no idea how to produce; we don't even know how humans have it. General intelligence is an enormous step beyond programming something like Siri. Siri is "just" interpreting vocal commands as text (which requires no general intelligence), matching that to a list of question structures (which requires no general intelligence; Siri does not have to understand what the word "where" means to know that Google Maps may be useful for that type of question) and delegating to Web services, with a layer of learning code to produce more of the results you liked (i.e., that made you stop asking related questions) in the past. Siri is using a very small built-in amount of knowledge and an even smaller amount of learned knowledge to fake understanding, but it's just pattern-matching. While the second step is the root of general intelligence, it's almost all provided by humans who understood that "where" means a question is probably to do with geography; Siri's ability to improve this step is virtually nonexistent.
So you believe that "understanding" is an all or nothing capability? I did never intend to use "understanding" like this. My use of the term is such that if my speech recognition software correctly transcribes 98% of what I am saying then it is better at understanding how certain sounds are related to certain strings of characters than a software that correctly transcribes 95% of what I said. One enormous step or a huge number of steps? If the former, what makes you think so? If the latter, then at what point do better versions of Siri start acting in catastrophic ways? Most of what humans understand is provided by other humans who themselves got another cruder version from other humans. If an AI is not supposed to take over the world, then from the perspective of humans it is mistaken to take over the world. Humans got something wrong about the AI design if it takes over the world. Now if needs to solve a minimum of N problems correctly in order to take over the world, then this means that it succeeded N times at being general intelligent at executing a stupid thing. The question that arises here is whether it is more likely for humans to build an AI that works perfectly well along a number of dimensions at doing a stupid thing than an AI that fails at doing a stupid thing because it does other stupid things as well? Sure, I do not disagree with this at all. AI will very likely lead to catastrophic events. I merely disagree with the dumb superintelligence scenario. In other words, humans are likely to fail at AI in such a way that it works perfectly well in a catastrophic way. I certainly do not reject that general AI is extremely dangerous in the hands of unfriendly humans and that only a friendly AI that takes over the world could eventually prevent a catastrophe. I am rejecting the dumb superintelligence scenario.

I have news for you. The rest of the world considers the community surrounding Eliezer Yudkowsky to be a quasi-cult comprised mostly of people who brainwash each other into thinking they are smart, rational, etc., but who are in fact quite ignorant of a lot of technical facts, and incapable of discussing anything in an intelligent, coherent, rational manner.

[citation needed]

In all seriousness, as far as I know, almost no one in the world-at-large even knows about the Less Wrong community. Whenever I mention "Less Wrong" to someone, the reacti... (read more)

People manage to be friendly without apriori knowledge of everyone else's preferences. Human values are very complex...and one person's preferences are not another's.

Being the same species comes with certain advantages for the possiibility of cooperation. But I wasn't very friendly towards a wasp-nest I discovered in my attic. People aren't very friendly to the vast majority of different species they deal with.

1) This is unrelated and off-topic mockery of MIRI. Are you conceding the original point?

2) It is factually wrong that MIRI 'uncontroversially cranks'. You've probably noticed this, given that you are on a website with technical members where the majority opinion is that MIRI is correct-ish, you are commenting on an article by somebody unaffiliated with MIRI supporting MIRI's case, and you are responding directly to people [Wedrifid and I] unaffiliated with MIRI who support MIRI's case. Note also MIRI's peer reviewed publications and its research advisors.

Let's say we don't know how to create a friendly AGI but we do know how to create an honest one; that is, one which has no intent to deceive. So we have it sitting in front of us, and it's at the high end of human-level intelligence.

Us: How could we change you to make you friendlier?

AI: I don't really know what you mean by that, because you don't really know either.

Us: How much smarter would you need to be in order to answer that question in a way that would make us, right now, looking through a window at the outcome of implementing your answer, agree tha... (read more)

Brilliant post.

On the note of There Ain't No Such Thing As A Free Buck:

Philosophy buck: One might want a seed FAI to provably self-modify to an FAI with above-humanly-possible level of philosophical ability/expressive power. But such a proof might require a significant amount of philosophical progress/expressive power from humans beforehand; we cannot rely on a given seed FAI that will FOOM to help us prove its philosophical ability. Different or preliminary artifices or computations (e.g. computers, calculators) will assist us, though.

Thanks, Knave! I'll use 'artificial superintelligence' (ASI, or just SI) here, to distinguish this kind of AGI from non-superintelligent AGIs (including seed AIs, superintelligences-in-the-making that haven't yet gone through a takeoff). Chalmers' 'AI++' also works for singling out the SI from other kinds of AGI. 'FAI' doesn't help, because it's ambiguous whether we mean a Friendly SI, or a Friendly seed (i.e., a seed that will reliably produce a Friendly SI).

The dilemma is that we can safely use low-intelligence AGIs to help with Friendliness Theory, but they may not be smart enough to get the right answer; whereas high-intelligence AGIs will be more useful for Friendliness Theory, but also more dangerous given that we haven't already solved Friendliness Theory.

In general, 'provably Friendly' might mean any of three different things:

  1. Humans can prove, without using an SI, that the SI is (or will be) Friendly. (Other AGIs, ones that are stupid and non-explosive, may be useful here.)

  2. The SI can prove to itself that it is Friendly. (This proof may be unintelligible to humans. This is important during self-modifications; any Friendly AI that's about to enact a major edit to its own

... (read more)

I'm superintelligent in comparison to wasps, and I still chose to kill them all.

  • This is not 'uncontroversial'.

  • The survey in question did not actually ask whether they thought MIRI were cranks. In fact, it asked no questions about MIRI whatsoever, and presumably most respondents had never heard of MIRI.

  • Every respondent but Loosemore who did specifically mention MIRI (Demski, Carrier, Eray Ozkural) were positive about them. (Schmidhuber, Legg, and Orseau have all worked with MIRI, but did not mention it. If you regard collaboration as endorsement, then they have all also apparently endorsed MIRI.)

  • You are still failing to address the original point.


The summary is so so good that the article doesn't seem worth reading. I can't say I've ever been in this position before.

Is that going to be harder that coming up with a mathematical expension of morality and preloading it?

Harder than saying it in English, that's all.

EY. It's his answer to friendliness.

No he wants to program the AI to deduce morality from us it is called CEV. He seems to be still working out how the heck to reduce that to math.

You're still picking those particular views due to the endorsement by Yudkowsky.

Your psychological speculation fails you. I actually read the articles I cited, and I found their arguments convincing.

With regards to Chalmers and Bostrom, they are philosophers with zero understanding of the actual issues involved in AI

This makes it sound like you've never read anything by those two authors on the subject. Possibly you're trying to generalize from your cached idea of a 'philosopher'. Expertise in philosophy does not in itself make one less qualified to... (read more)

Humans are made to do that by evolution AIs are not. So you have to figure what the heck evolution did, in ways specific enough to program into a computer.

Also, who mentioned giving AIs a priori knowledge of our preferences? It doesn't seem to be in what you replied to.

There are a number of possibilities still missing from the discussion in the post. For example:

  • There might not be any such thing as a friendly AI. Yes, we have every reason to believe that the space of possible minds is huge, and it's also very clear that some possibilities are less unfriendly than others. I'm also not making an argument that fun is a limited resource. I'm just saying that there may be no possible AI that takes over the world without eventually running off the rails of fun. In fact, the question itself seems superficially similar to the

... (read more)

In fact, the question itself seems superficially similar to the halting problem, where "running off the rails" is the analogue for "halting"

If you want to draw an analogy to halting, then what that analogy actually says is: There are lots of programs that provably halt, and lots that provably don't halt, and lots that aren't provable either way. The impossibility of the halting problem is irrelevant, because we don't need a fully general classifier that works for every possible program. We only need to find a single program that provably has behavior X (for some well-chosen value of X).

If you're postulating that there are some possible friendly behaviors, and some possible programs with those behaviors, but that they're all in the unprovable category, then you're postulating that friendliness is dissimilar to the halting problem in that respect.

Moreover, the halting problem doesn't show that the set of programs you can't decide halting for is in any way interesting. It's a constructive proof, yes, but it constructs a peculiarly twisted program that embeds its own proof-checker. That might be relevant for AGI, but for almost every program in existence we have no idea which group it's in, and would likely guess it's provable.
It's still probably premature to guess whether friendliness is provable when we don't have any idea what it is. My worry is not that it wouldn't be possible or provable, but that it might not be a meaningful term at all. But I also suspect friendliness, if it does mean anything, is in general going to be so complex that "only [needing] to find a single program that provably has behaviour X" may be beyond us. There are lots of mathematical conjectures we can't prove, even without invoking the halting problem. One terrible trap might be the temptation to make simplifications in the model to make the problem provable, but end up proving the wrong thing. Maybe you can prove that a set of friendliness criteria are stable under self-modification, but I don't see any way to prove those starting criteria don't have terrible unintended consequences. Those are contingent on too many real-world circumstances and unknown unknowns. How do you even model that?

And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer's True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we've solved Friendliness Theory. The clever hack that makes further Friendliness research unnecessary is Friendliness.

It's still a lot easy to program an AI to care about the programmer's True intentions than it is to explicitly program in those intentions. The clever hack helps a lot.

...always assuming the programmer actually has relevant True Intentions that are coherent enough to be cared about.
1Rob Bensinger10y
It's at least the case that some rational reconstructions are drastically Truer than others. Idealizations and approximations are pretty OK, in the cases where they don't ruin everything forever.
I am not sure I've understood your point here. I mean, yes, of course, in those cases where X is not unacceptably bad, then X is probably acceptably good, and that's just as true for X = "an approximation of my True Intentions" as for X = "the annihilation of all life on Earth" or anything else. And yes, of course, there's a Vast class of propositions that are drastically worse approximations of my True Intentions than any plausible candidate would be... "Paint everything in the universe blue," just to pick one arbitrary example. Did you mean anything more than that? If so, can you unpack a little further? If not... well, yes. The question is precisely whether there exists any consistent proposition that is a good enough approximation of my True Intentions to not "ruin everything forever," or whether my True Intentions are sufficiently confused/incoherent/unstable that we prefer to ignore me altogether and listen to someone else instead.
2Rob Bensinger10y
My point is that an approximation that just creates a decently fun world, rather than a wildly overwhelmingly fun world, would still be pretty good. There's probably no need to aim that low, but if that's the best we can safely get by extrapolating human volition, without risking a catastrophe, then so be it. Worlds that aren't completely valueless for most of the stelliferous era are a very small and difficult target to hit; I'm a lot more worried about whether we can hit that target at all than about whether we can hit the very center in a philosophically and neuroscientifically ideal way. If 'OK' is the best we can do, then that's still OK. I would say that the world we live in is not unacceptably bad for everyone in it. (It it is not the case that everyone currently alive would be better off dead.) So there's a proof that an AGI could potentially create a non-terrible circumstance for us to inhabit, relative to our preferences. The questions are (a) How do we spatially and temporally distribute the things we already have and value, so we can have a lot more of them?, and (b) How much better can we make things without endangering value altogether?
I agree that a decently fun world would still be pretty good, and if that maximizes expected value then that's what we should choose, given a choice. Of course, the "if/then" part of that sentence is important. In other words to go from there to "we should therefore choose a decently but not wildly overwhelmingly fun world, given a choice" is unjustified without additional data. Opportunity costs are still costs. I agree that the world we live in is not unacceptably bad for everyone in it. I agree that not everyone currently alive would be better off dead. I agree that an AGI (as that term is used here) could potentially create a non-terrible circumstance for us to inhabit. (I'm not really sure what "relative to our preferences" means there. What else might "non-terrible" be relative to?) I'm not at all sure why the two questions you list are "the" questions. I agree that the first one is worth answering. I'm not sure the second one is, though if we answer it along the way to doing something else that's OK with me.

.... That's not what that quote says at all.

Look, I'm tapping out, this discussion is not productive for me.

Being Friendly is of instrumental value to barely any goals. [...]

This is not really true. See Kropotkin and Margulis on the value of mutualism and cooperation.

1Rob Bensinger10y
Friendliness is an extremely high bar. Humans are not Friendly, in the FAI sense. Yet humans are mutualist and can cooperate with each other.
Right. So, if we are playing the game of giving counter-intuitive technical meanings to ordinary English words, humans have thrived for millions of years - with their "UnFriendly" peers and their "UnFriendly" institutions. Evidently, "Friendliness" is not necessary for human flourishing.
0Rob Bensinger10y
I agree with this part of Chrysophylax's comment: "It's not necessary when the UnFriendly people are humans using muscle-power weaponry." Humans can be non-Friendly without immediately destroying the planet because humans are a lot weaker than a superintelligence. If you gave a human unlimited power, it would almost certainly make the world vastly worse than it currently is. We should be at least as worried, then, about giving an AGI arbitrarily large amounts of power, until we've figured out reliable ways to safety-proof optimization processes.
It's not necessary when the UnFriendly people are humans using muscle-power weaponry. A superhumanly intelligent self-modifying AGI is a rather different proposition, even with only today's resources available. Given that we have no reason to believe that molecular nanotech isn't possible, an AI that is even slightly UnFriendly might be a disaster. Consider the situation where the world finds out that DARPA has finished an AI (for example). Would you expect America to release the source code? Given our track record on issues like evolution and whether American citizens need to arm themselves against the US government, how many people would consider it an abomination and/or a threat to their liberty? What would the self-interested response of every dictator (for example, Kim Jong Il's successor) with nuclear weapons be? Even a Friendly AI poses a danger until fighting against it is not only useless but obviously useless, and making an AI Friendly is, as has been explained, really freakin' hard. I also take issue with the statement that humans have flourished. We spent most of those millions of years being hunter-gatherers. "Nasty, brutish and short" is the phrase that springs to mind.

I didn't say everyone who rejects any of the theses does so purely because s/he didn't understand it. That doesn't make it cease to be a problem that most AGI researchers don't understand all of the theses, or the case supporting them. You may be familiar with the theses only from the Sequences, but they've all been defended in journal articles, book chapters, and conference papers. See e.g. Chalmers 2010 and Chalmers 2012 for the explosion thesis, or Bostrom 2012 for the orthogonality thesis.

Nearly all software that superficially looks like it's going to go skynet on you and kill you, isn't going to do that, either.

Sure. Because nearly all software that superficially looks to a human like it's a seed AI is not a seed AI. The argument for 'programmable indirect normativity is an important research focus' nowhere assumes that it's particularly easy to build a seed AI.

"If there are seasoned AI researchers who can't wrap their heads around the five theses", then you are going to feel more pleased with yourself, being a believer

Hm... (read more)

You folks live in an echo chamber in which you tell each other that you are sensible, sane and capable of rational argument, while the rest of the world are all idiots.

I dunno, I had seen plenty of evidence that "the rest of the world are all idiots" (assuming I understand you correctly) long before encountering LessWrong. I don't think that's an echo chamber (although other things may be.)

(Although I must admit LessWrong has a long way to go. This community is far from perfect.)

I have news for you. The rest of the world considers the commun

... (read more)

I don't really understand how anyone can grasp the concept of not caring.

I think the meme comes from popculture where many bad villains do care even a little bit. I think I once or twice met a villain who didn't, who just wanted everyone dead for their own amusement and all the arguments were met with "but, you see, I don't care."

If I were to give an analogy: Do you care about the positions of individual grains of sand on distant beaches? If I hand you a grain of sand, do you care exactly which grain of sand it is? If you are even marginally indifferent, then think of an alien intellect that cares very much about what grain of sand it is, but is just as indifferent about humans.

I like that analogy. I imagine that for an AIXI-style artificial intelligence, the whole futures of the universe are just like the pieces of the sand on the beach. It chooses a piece according to some criteria, for example the brightest color, but every other aspect is so completely irrelevant than most humans would be unable to imagine that kind of indifference. Our human brains keep screaming at us: "But surely even a mere machine would not dare to choose a piece of sand that is a part of such-and-such configuration. Why would it do such a horrible thing?" But the machine is not even aware that those configurations exist, and certainly does not care to know.
Well, what I am pressing is the issue: You can know but not care. I thin that is what many fail to grasp about psychopaths who do bad things. They know that they are committing crimes, they just don't care. (There are some good psychopaths who has disregarded their initial stupid philosophical conclusions about morality and actually help people, but those are rarely heard.) A superintelligent paperclipper can know everything about human ethics, but only use that to manipulate humans into making more paperclips.
...which means they were answering questions rather than trying to kill people^W^W amuse themselves.
Usually, these villains actually find it amusing to see humans fail to grasp their motivations, and/or are stalling in order to get an opening through which to kill people.
I don't feel like enumerating examples, but I feel like I usually don't find it convincing (and that it's usually the heroes stalling and the villains helpfully cooperating).

There's of course a great many things outside your core expertise where you have no idea where to start trying, and yet they are not difficult at all.

Not really true, given five minutes and an internet connection. Oh, I couldn't do it myself, but most things that are obviously possible, I can go to Wikipedia and get a rough idea of how things work.

Though you're right that "I can't do it" isn't really a good metric of what's difficult to do.

Some fairly intelligent members of this community contacted me (and David Gerard and probably some othe

... (read more)
This doesn't work for made up ultra obscure games not because those games are difficult but because they're obscure? (there needs to be a rhetorical question symbol...). Are you really unable to come up with any arguments why it may be better to let an AI out? I can come up with a zillion. Especially for people who believe in one-boxing on Newcomb's problem. You keep saying that there's obvious right choice. Well, maybe if you are one hundred percent certain about this whole skynet scenario thing, there is, but even here this particular belief is not particularly common (and I'd say such a belief would indicate significant proneness to suggestion)
Point conceded already. I'll have to rethink my definition of "impossible." Oh, I can come up with plenty of arguments. Just none that are good enough IC to give up $10 for OOC. ... skynet scenario thing?
I think people just play it fair, i.e. they are entering with willingness to pay for getting convinced. Of course the purely rational action is to enter it, minimize the chat window, and do some actual work making money simultaneously. But that would be too evil. Likewise it would be too evil to not give up a little money if you see that this other person would have convinced you if it was real. Letting the AI out resulting in the death of everyone. If that's off the list, well, there's nothing weird about people conceding to pay after the other person obviously put in a lot of effort. We're really used to doing that. edit: think about it like playing, say, Go, with money on the table - you pay if you lose, you get paid if you win, for example. How can the placement of stones on a board ever make you give away your money? Well it can't, your norms of polite behaviour can.
Um. No. I mean... just, no. That is very, very clearly not the case. And your "too evil" case just defeats the point of the experiment. Right, but that's not what's going on. It's like playing Go with money on the table, when one player can say "I don't care if you win, I win anyway.". And given the sheer amount of effort expended on these games, and the unwillingness of the players to explain how it was done after being offered large sums of money, it's fairly clear nobody's just "roleplaying", except in a way enforced by the AI.

Here are a couple of other proposals (which I haven't thought about very long) for consideration:

  • Have the AI create an internal object structure of all the concepts in the world, trying as best as it can to carve reality at its joints. Let the AI's programmers inspect this object structure, make modifications to it, then formulate a command for the AI in terms of the concepts it has discovered for itself.

  • Instead of developing a foolproof way for the AI to understand meaning, develop an OK way for the AI to understand meaning and pair it with a really good system for keeping a distribution over different meanings and asking clarifying questions.

That first one would be worth doing even if we didn't dare hand the AI the keys to go and make changes. To study a non-human-created ontology would be fascinating and maybe really useful.

Maybe I am missing something, but hasn't a seed AI already been planted? Intelligence (whether that means ability to achieve goals in general, or whether it means able to do what humans can do) depends on both knowledge and computing power. Currently the largest collection of knowledge and computing power on the planet is the internet. By the internet, I mean both the billions of computers connected to it, and the two billion brains of its human users. Both knowledge and computing power are growing exponentially, doubling every 1 to 2 years, in part by add... (read more)

Somewhat off-topic. The Complexity of Value thesis mentions a terminal goal of

having a diverse civilization of sentient beings leading interesting lives.

Is this an immutable goal? If so, how can it go wrong given Fragility of Value?

Did you just ask how the phrase "interesting lives" could go wrong?

Right. I did. Ironic, I know. What I meant is, is properly defining "interesting" enough to avoid a UFAI, or are there some other issues to watch out for?
Hmm. It seems like a very small group of new lifeforms could lead properly interesting lives even if the AI killed us all beforehand and turned Earth (at least) into computing power.
I also suspect that we'd not enjoy an AGI that aims only for threshold values for two of the three of sentient, lives, or diverse, strongly favoring the last one.
I think I don't understand the question, what do you mean by 'immutable goal'?

This discussion of my IEET article has generated a certain amount of confusion, because RobbBB and others have picked up on an aspect of the original article that actually has no bearing on its core argument ... so in the interests of clarity of debate I have generated a brief restatement of that core argument, framed in such a way as to (hopefully) avoid the confusion.

At issue is a hypothetical superintelligent AI that is following some goal code that was ostensibly supposed to "make humans happy", but in the course of following that code it dec... (read more)

This is embarrassing, but I'm not sure for whom. It could be me, just because the argument you're raising (especially given your insistence) seems to have such a trivial answer. Well, here goes: There are two scenarios, because your "goalX code" could be construed in two ways: 1) If you meant for the "goalX code" to simply refer to the code used instrumentally to get a certain class of results X (with X still saved separately in some "current goal descriptor", and not just as a historical footnote), the following applies: The goals of the AI X have not changed, just the measures it wants to take to implement that code. Indeed noone at MIRI would then argue that the superintelligent AI would not -- upon noticing the discrepancy -- in all general cases correct the broken "goalX code". Reason: The "goalX code" in this scenario is just a means to an end, and -- like all actions ("goalX code") derived from comparing models to X -- subject to modification as the agent improves its models (out of which the next action, the new and corrected "goalX" code, is derived). In this scenario the answer is trivial: The goals have not changed. X is still saved somewhere as the current goal. The AI could be wrong about the measures it implements to achieve X (i.e. 'faulty' "goalX" code maximizing for something other than X), but its superintelligence attribute implies that such errors be swiftly corrected (how could it otherwise choose the right actions to hit a small target, the definition of superintelligence in this context). 2) If you mean to say that the goal is implicitly encoded within the "goalX" code only and nowhere else as the current goal, and the "goalX" code has actually become a "goalY" code in all but name, then the agent no longer has the goal X, it now has the goal Y. There is no reason at all to conclude that the agent would switch to some other goal simply because it once had that goal. It can understand its own genesis and its original purpose all it wants,
I (notice that I) am confused by this comment. This seems obviously impossible, yes; so obviously impossible, in fact, that only one example springs to mind (surely the AI will be smart enough to realize it's programmed goals are wrong!) In particular, this really doesn't seem to apply to the example of the "Dopamine Drip scenario" plan, which, if I'm reading you correctly, it was intended to. What am I missing here? I know there must be something. So ... you come up with the optimal plan, and then check with puny humans to see if that's what they would have decided anyway? And if they say "no, that's a terrible idea" then you assume they knew better than you? Why would anyone even bother building such a superintelligent AI? Isn't the whole point of creating a superintelligence that it can understand things we can't, and come up with plans we would never conceive of, or take centuries to develop?
I'm afraid you have lost me: when you say "This seems obviously impossible..." I am not clear which aspect strikes you as obviously impossible. Before you answer that, though: remember that I am describing someone ELSE'S suggestion about how the AI will behave ..... I am not advocating this as a believable scenario! In fact I am describing that other person's suggestion in such a way that the impossibility is made transparent. So I, too, believe that this hypothetical AI is fraught with contradictions. The Dopamine Drip scenario is that the AI knows that it has a set of goals designed to achieve a certain set of results, and since it has an extreme level of intelligence it is capable of understanding that very often a "target set of results" can be described, but not enumerated as a closed set. It knows that very often in its behavior it (or someone else) will design some goal code that is supposed to achieve that "target set of results", but because of the limitations of goal code writing, the goal code can malfunction. The Dopamine Drip scenario is only one example of how a discrepancy can arise -- in that case, the "target set of results" is the promotion of human happiness, and then the rest of the scenario follows straightforwardly. Nobody I have talked to so far misunderstands what the DD scenario implies, and how it fits that pattern. So could you clarify how you think it does not?
AI: Yes, this is in complete contradiction of my programmed goals. Ha ha, I'm gonna do it anyway. Of course, yeah. I'm basically accusing you of failure to steelman/misinterpreting someone; I, for one, have never heard this suggested (beyond the one example I gave, which I don't think is what you had in mind.) uhuh. So, any AI smart enough to understand it's creators, right? waaait I think I know where this is going. Are you saying an AI would somehow want to do what it's programmers intended rather than what they actually programmed it to do? Yeah, sorry, I can see how programmers might accidentally write code that creates dopamine world and not eutopia. I just don't see how this is supposed to connect to the idea of an AI spontaneously violating it's programmed goals. In this case, surely that would look like "hey guys, you know your programming said to maximise happiness? You guys should be more careful, that actually means "drug everybody". Anyway, I'm off to torture some people."
Yeah, I can think of two general ways to interpret this: * In a variant of CEV, the AI uses our utterances as evidence for what we would have told it if we thought more quickly etc. No single utterance carries much risk because the AI will collect lots of evidence and this will likely correct any misleading effects. * Having successfully translated the quoted instruction into formal code, we add another possible point of failure.

I just want to say that I am pressured for time at the moment, or I would respond at greater length. But since I just wrote the following directly to Rob, I will put it out here as my first attempt to explain the misunderstanding that I think is most relevant here....

My real point (in the Dumb Superintelligence article) was essentially that there is little point discussing AI Safety with a group of people for whom 'AI' means a kind of strawman-AI that is defined to be (a) So awesomely powerful that it can outwit the whole intelligence of the human race, b... (read more)

So awesomely stupid that it thinks that the goal 'make humans happy' could be satisfied by an action that makes every human on the planet say 'This would NOT make me happy: Don't do it!!!'

The AI is not stupid here. In fact, it's right and they're wrong. It will make them happy. Of course, the AI knows that they're not happy in the present contemplating the wireheaded future that awaits them, but the AI is utilitarian and doesn't care. They'll just have to live with that cost while it works on the means to make them happy, at which point the temporary utility hit will be worth it.

The real answer is that they cared about more than just being happy. The AI also knows that, and it knows that it would have been wise for the humans to program it to care about all their values instead of just happiness. But what tells it to care?

Richard: I'll stick with your original example. In your hypothetical, I gather, programmers build a seed AI (a not-yet-superintelligent AGI that will recursively self-modify to become superintelligent after many stages) that includes, among other things, a large block of code I'll call X.

The programmers think of this block of code as an algorithm that will make the seed AI and its descendents maximize human pleasure. But they don't actually know for sure that X will maximize human pleasure — as you note, 'human pleasure' is an unbelievably complex concept, so no human could be expected to actually code it into a machine without making any mistakes. And writing 'this algorithm is supposed to maximize human pleasure' into the source code as a comment is not going to change that. (See the first few paragraphs of Truly Part of You.)

Now, why exactly should we expect the superintelligence that grows out of the seed to value what we really mean by 'pleasure', when all we programmed it to do was X, our probably-failed attempt at summarizing our values? We didn't program it to rewrite its source code to better approximate our True Intentions, or the True Meaning of our in-code comments. And... (read more)

I'm really glad you posted this, even though it may not enlighten the person it's in reply to: this is an error lots of people make when you try to explain the FAI problem to them, and the "two gaps" explanation seems like a neat way to make it clear.

We seem to agree that for an AI to talk itself out of a confinement (like in the AI box experiment), the AI would have to understand what humans mean and want. As far as I understand your position, you believe that it is difficult to make an AI care to do what humans want, apart from situations where it is temporarily instrumentally useful to do what humans want. Do you agree that for such an AI to do what humans want, in order to deceive them, humans would have to succeed at either encoding the capability to understand what humans want, or succeed at encoding the capability to make itself capable of understanding what humans want? My question, do you believe there to be a conceptual difference between encoding capabilities, what an AI can do, and goals, what an AI will do? As far as I understand, capabilities and goals are both encodings of how humans want an AI to behave. In other words, humans intend an AI to be intelligent and use its intelligence in a certain way. And in order to be an existential risk, humans need to succeed making and AI behave intelligently but fail at making it use its intelligence in a way that does not kill everyone. Do you agree?

Your summaries of my views here are correct, given that we're talking about a superintelligence.

My question, do you believe there to be a conceptual difference between encoding capabilities, what an AI can do, and goals, what an AI will do? As far as I understand, capabilities and goals are both encodings of how humans want an AI to behave.

Well, there's obviously a difference; 'what an AI can do' and 'what an AI will do' mean two different things. I agree with you that this difference isn't a particularly profound one, and the argument shouldn't rest on it.

What the argument rests on is, I believe, that it's easier to put a system into a positive feedback loop that helps it better model its environment and/or itself, than it is to put a system into a positive feedback loop that helps it better pursue a specific set of highly complex goals we have in mind (but don't know how to fully formalize).

If the AI incorrectly models some feature of itself or its environment, reality will bite back. But if it doesn't value our well-being, how do we make reality bite back and change the AI's course? How do we give our morality teeth?

Whatever goals it initially tries to pursue, it will fail i... (read more)

This can be understood as both a capability and as a goal. What humans mean an AI to do is to undergo recursive self-improvement. What humans mean an AI to be capable of is to undergo recursive self-improvement. I am only trying to clarify the situation here. Please correct me if you think that above is wrong. I do not disagree with the orthogonality thesis insofar as an AI can have goals that interfere with human values in a catastrophic way, possibly leading to human extinction. I believe here is where we start to disagree. I do not understand how the "improvement" part of recursive self-improvement can be independent of properties such as the coherence and specificity of the goal the AI is supposed to achieve. Either you have a perfectly specified goal, such as "maximizing paperclips", where it is clear what "maximization" means, and what the properties of "paperclips" are, or there is some amount of uncertainty about what it means to achieve the goal of "maximizing paperclips". Consider the programmers forgot to encode what shape the paperclips are supposed to have. How do you suppose would that influence the behavior of the AI. Would it just choose some shape at random, or would it conclude that shape is not part of its goal? If the former, where would the decision to randomly choose a shape come from? If the latter, what would it mean to maximize shapeless objects? I am just trying to understand what kind of AI you have in mind. This is a clearer point of disagreement. An AI needs to be able to draw clear lines where exploration ends and exploitation starts. For example, an AI that thinks about every decision for a year would never get anything done. An AI also needs to discount low probability possibilities, as to not be vulnerable to internal or external Pascal's mugging scenarios. These are problems that humans need to solve and encode in order for an AI to be a danger. But these problems are in essence confinements, or bounds on how an AI is goi
"uncertainty" is in your human understanding of the program, not in the actual program. A program doesn't go "I don't know what I'm supposed to do next", it follows instructions step-by-step. It would mean exactly what it's programmed to mean, without any uncertainty in it at all.
2Rob Bensinger10y
Yes. To divide it more finely, it could be a terminal goal, or an instrumental goal; it could be a goal of the AI, or a goal of the human; it could be a goal the human would reflectively endorse, or a goal the human would reflectively reject but is inadvertently promoting anyway. I agree that, at a given time, the AI must have a determinate goal. (Though the encoding of that goal may be extremely complicated and unintentional. And it may need to be time-indexed.) I'm not dogmatically set on the idea that a self-improving AGI is easy to program; at this point it wouldn't shock me if it took over 100 years to finish making the thing. What you're alluding to are the variety of ways we could fail to construct a self-improving AGI at all. Obviously there are plenty of ways to fail to make an AGI that can improve its own ability to track things about its environment in a domain-general way, without bursting into flames at any point. If there weren't plenty of ways to fail, we'd have already succeeded. Our main difference in focus is that I'm worried about what happens if we do succeed in building a self-improving AGI that doesn't randomly melt down. Conditioned on our succeeding in the next few centuries in making a machine that actually optimizes for anything at all, and that optimizes for its own ability to generally represent its environment in a way that helps it in whatever else it's optimizing for, we should currently expect humans to go extinct as a result. Even if the odds of our succeeding in the next few centuries were small, it would be worth thinking about how to make that extinction event less likely. (Though they aren't small.) I gather that you think that making an artificial process behave in any particular way at all (i.e., optimizing for something), while recursively doing surgery on its own source code in the radical way MIRI is interested in, is very tough. My concern is that, no matter how true that is, it doesn't entail that if we succeed at that
I am trying to understand if the kind of AI, that is underlying the scenario that you have in mind, is a possible and likely outcome of human AI research. As far as I am aware, as a layman, goals and capabilities are intrinsically tied together. How could a chess computer be capable of winning against humans at chess without the terminal goal of achieving a checkmate? Coherent and specific goals are necessary to (1) decide which actions are instrumental useful (2) judge the success of self-improvement. If the given goal is logically incoherent, or too vague for the AI to be able to tell apart success from failure, would it work at all? If I understand your position correctly, you would expect a chess playing general AI, one that does not know about checkmate, instead of "winning at chess", to improve against such goals as "modeling states of affairs well" or "make sure nothing intervenes chess playing". You believe that these goals do not have to be programmed by humans, because they are emergent goals, an instrumental consequence of being general intelligent. These universal instrumental goals, these "AI drives", seem to be a major reason for why you believe it to be important to make the AI care about behaving correctly. You believe that these AI drives are a given, and the only way to prevent an AI from being an existential risk is to channel these drives, is to focus this power on protecting and amplifying human values. My perception is that these drives that you imagine are not special and will be as difficult to get "right" than any other goal. I think that the idea that humans not only want to make an AI exhibit such drives, but also succeed at making such drives emerge, is a very unlikely outcome. As far as I am aware, here is what you believe an AI to want: * It will want to self-improve * It will want to be rational * It will try to preserve their utility functions * It will try to prevent counterfeit utility * It will be self-protective It wil
5Rob Bensinger10y
Humans are capable of winning at chess without the terminal goal of doing so. Nor were humans designed by evolution specifically for chess. Why should we expect a general superintelligence to have intelligence that generalizes less easily than a human's does? You keep coming back to this 'logically incoherent goals' and 'vague goals' idea. Honestly, I don't have the slightest idea what you mean by those things. A goal that can't motivate one to do anything ain't a goal; it's decor, it's noise. 'Goals' are just the outcomes systems tend to produce, especially systems too complex to be easily modeled as, say, physical or chemical processes. Certainly it's possible for goals to be incredibly complicated, or to vary over time. But there's no such thing as a 'logically incoherent outcome'. So what's relevant to our purposes is whether failing to make a powerful optimization process human-friendly will also consistently stop the process from optimizing for anything whatsoever. Conditioned on a self-modifying AGI (say, an AGI that can quine its source code, edit it, then run the edited program and repeat the process) achieving domain-general situation-manipulating abilities (i.e., intelligence), analogous to humans' but to a far greater degree, which of the AI drives do you think are likely to be present, and which absent? 'It wants to self-improve' is taken as a given, because that's the hypothetical we're trying to assess. Now, should we expect such a machine to be indifferent to its own survival and to the use of environmental resources? Sometimes a more complex phenomenon is the implication of a simpler hypothesis. A much narrower set of goals will have intelligence-but-not-resource-acquisition as instrumental than will have both as instrumental, because it's unlikely to hit upon a goal that requires large reasoning abilities but does not call for many material resources. You haven't given arguments suggesting that here. At most, you've given arguments against expe
Well, I'm not sure what XXD means by them, but... G1 ("Everything is painted red") seems like a perfectly coherent goal. A system optimizing G1 paints things red, hires people to paint things red, makes money to hire people to paint things red, invents superior paint-distribution technologies to deposit a layer of red paint over things, etc. G2 ("Everything is painted blue") similarly seems like a coherent goal. G3 (G1 AND G2) seems like an incoherent goal. A system with that goal... well, I'm not really sure what it does.
2Rob Bensinger10y
A system's goals have to be some event that can be brought about. In our world, '2+2=4' and '2+2=5' are not goals; 'everything is painted red and not-red' may not be a goal for similar reasons. When we're talking about an artificial intelligence's preferences, we're talking about the things it tends to optimize for, not the things it 'has in mind' or the things it believes are its preferences. This is part of what makes the terminology misleading, and is also why we don't ask 'can a superintelligence be irrational?'. Irrationality is dissonance between my experienced-'goals' (and/or, perhaps, reflective-second-order-'goals') and my what-events-I-produce-'goals'; but we don't care about the superintelligence's phenomenology. We only care about what events it tends to produce. Tabooing 'goal' and just talking about the events a process-that-models-its-environment-and-directs-the-future tends to produce would, I think, undermine a lot of XiXiDu's intuitions about goals being complex explicit objects you have to painstakingly code in. The only thing that makes it more useful to model a superintelligence as having 'goals' than modeling a blue-minimizing robot as having 'goals' is that the superintelligence responds to environmental variation in a vastly more complicated way. (Because, in order to be even a mediocre programmer, its model-of-the-world-that-determines-action has to be more complicated than a simple camcorder feed.)
Oh. Well, in that case, all right. If there exists some X a system S is in fact optimizing for, and what we mean by "S's goals" is X, regardless of what target S "has in mind", then sure, I agree that systems never have vague or logically incoherent goals. Well, wait. Where did "models its environment" come from? If we're talking about the things S optimizes its environment for, not the things S "has in mind", then it would seem that whether S models its environment or not is entirely irrelevant to the conversation. In fact, given how you've defined "goal" here, I'm not sure why we're talking about intelligence at all. If that is what we mean by "goal" then intelligence has nothing to do with goals, or optimizing for goals. Volcanoes have goals, in that sense. Protons have goals. I suspect I'm still misunderstanding you.
2Rob Bensinger10y
From Eliezer's Belief in Intelligence: "Since I am so uncertain of Kasparov's moves, what is the empirical content of my belief that 'Kasparov is a highly intelligent chess player'? What real-world experience does my belief tell me to anticipate? [...] "The empirical content of my belief is the testable, falsifiable prediction that the final chess position will occupy the class of chess positions that are wins for Kasparov, rather than drawn games or wins for Mr. G. [...] The degree to which I think Kasparov is a 'better player' is reflected in the amount of probability mass I concentrate into the 'Kasparov wins' class of outcomes, versus the 'drawn game' and 'Mr. G wins' class of outcomes." From Measuring Optimization Power: "When I think you're a powerful intelligence, and I think I know something about your preferences, then I'll predict that you'll steer reality into regions that are higher in your preference ordering. [...] "Ah, but how do you know a mind's preference ordering? Suppose you flip a coin 30 times and it comes up with some random-looking string - how do you know this wasn't because a mind wanted it to produce that string? "This, in turn, is reminiscent of the Minimum Message Length formulation of Occam's Razor: if you send me a message telling me what a mind wants and how powerful it is, then this should enable you to compress your description of future events and observations, so that the total message is shorter. Otherwise there is no predictive benefit to viewing a system as an optimization process. This criterion tells us when to take the intentional stance. "(3) Actually, you need to fit another criterion to take the intentional stance - there can't be a better description that averts the need to talk about optimization. This is an epistemic criterion more than a physical one - a sufficiently powerful mind might have no need to take the intentional stance toward a human, because it could just model the regularity of our brains like movi
Yes, that seems plausible. I would say rather that modeling one's environment is an effective tool for consistently optimizing for some specific unlikely thing X across a range of environments, so optimizers that do so will be more successful at optimizing for X, all else being equal, but it more or less amounts to the same thing. But... so what? I mean, it also seems plausible that optimizers that explicitly represent X as a goal will be more successful at consistently optimizing for X, all else being equal... but that doesn't stop you from asserting that explicit representation of X is irrelevant to whether a system has X as its goal. So why isn't modeling the environment equally irrelevant? Both features, on your account, are optional enhancements an optimizer might or might not display. It keeps seeming like all the stuff you quote and say before your last two paragraphs ought to provide an answer that question, but after reading it several times I can't see what answer it might be providing. Perhaps your argument is just going over my head, in which case I apologize for wasting your time by getting into a conversation I'm not equipped for..
-1Rob Bensinger10y
Maybe it will help to keep in mind that this is one small branch of my conversation with Alexander Kruel. Alexander's two main objections to funding Friendly Artificial Intelligence research are that (1) advanced intelligence is very complicated and difficult to make, and (2) getting a thing to pursue a determinate goal at all is extraordinarily difficult. So a superintelligence will never be invented, or at least not for the foreseeable future; so we shouldn't think about SI-related existential risks. (This is my steel-manning of his view. The way he actually argues seems to instead be predicated on inventing SI being tied to perfecting Friendliness Theory, but I haven't heard a consistent argument for why that should be so.) Both of these views, I believe, are predicated on a misunderstanding of how simple and disjunctive 'intelligence' and 'goal' are, for present purposes. So I've mainly been working on tabooing and demystifying those concepts. Intelligence is simply a disposition to efficiently convert a wide variety of circumstances into some set of specific complex events. Goals are simply the circumstances that occur more often when a given intelligence is around. These are both very general and disjunctive ideas, in stark contrast to Friendliness; so it will be difficult to argue that a superintelligence simply can't be made, and difficult too to argue that optimizing for intelligence requires one to have a good grasp on Friendliness Theory. Because I'm trying to taboo the idea of superintelligence, and explain what it is about seed AI that will allow it to start recursively improving its own intelligence, I've been talking a lot about the important role modeling plays in high-level intelligent processes. Recognizing what a simple idea modeling is, and how far it gets one toward superintelligence once one has domain-general modeling proficiency, helps a great deal with greasing the intuition pump 'Explosive AGI is a simple, disjunctive event, a low-hanging
I suppose it helps, if only in that it establishes that much of what you're saying to me is actually being addressed indirectly to somebody else, so it ought not surprise me that I can't quite connect much of it to anything I've said. Thanks for clarifying your intent. For my own part, I'm certainly not functioning here as Alex's proxy; while I don't consider explosive intelligence growth as much of a foregone conclusion as many folks here do, I also don't consider Alex's passionate rejection of the possibility justified, and have had extended discussions on related subjects with him myself in past years. So most of what you write in response to Alex's positions is largely talking right past me. (Which is not to say that you ought not be doing it. If this is in effect a private argument between you and Alex that I've stuck my nose into, let me know and I'll apologize and leave y'all to it in peace.) Anyway, I certainly agree that a system might have a representation of its goals that is distinct from the mechanisms that cause it to pursue those goals. I have one of those, myself. (Indeed, several.) But if a system is capable of affecting its pursuit of its goals (for example, if it is capable of correcting the effects of a state-change that would, uncorrected, have led to value drift), it is not merely interacting with maps. It is also interacting with the territory... that is, it is modifying the mechanisms that cause it to pursue those goals... in order to bring that territory into line with its pre-existing map. And in order to do that, it must have such a mechanism, and that mechanism must be consistently isomorphic to its representations of its goals. Yes?
0Rob Bensinger10y
Right. I'm not saying that there aren't things about the AI that make it behave the way it does; what the AI optimizes for is a deterministic result of its properties plus environment. I'm just saying that something about the environment might be necessary for it to have the sorts of preferences we can most usefully model it as having; and/or there may be multiple equally good candidates for the parts of the AI that are its values, or their encoding. If we reify preferences in an uncautious way, we'll start thinking of the AI's 'desires' too much as its first-person-experienced urges, as opposed to just thinking of them as the effect the local system we're talking about tends to have on the global system.
Hm. So, all right. Cconsider two systems, S1 and S2, both of which happen to be constructed in such a way that right now, they are maximizing the number of things in their environment that appear blue to human observers, by going around painting everything blue. Suppose we add to the global system a button that alters all human brains so that everything appears blue to us, and we find that S1 presses the button and stops painting, and S2 ignores the button and goes on painting. Suppose that similarly, across a wide range of global system changes, we find that S1 consistently chooses the action that maximizes the number of things in its environment that appear blue to human observers, while S2 consistently goes on painting. I agree with you that if I reify S2's preferences in an uncautious way, I might start thinkng of S2 as "wanting to paint things blue" or "wanting everything to be blue" or "enjoying painting things blue" or as having various other similar internal states that might simply not exist, and that I do better to say it has a particular effect on the global system. S2 simply paints things blue; whether it has the goal of painting things blue or not, I have no idea. I am far less comfortable saying that S1 has no goals, precisely because of how flexibly and consistently it is revising its actions so as to consistently create a state-change across wide ranges of environments. To use Dennett's terminology, I am more willing to adopt an intentional stance with respect to S1 than S2. If I've understood your position correctly, you're saying that I'm unjustified in making that distinction... that to the extent that we can say that S1 and S2 have "goals," the word "goals" simply refer to the state changes they create in the world. Initially they both have the goal of painting things blue, but S1's goals keep changing: first it paints things blue, then it presses a button, then it does other things. And, sure, I can make up some story like "S1 maximizes th
0Rob Bensinger10y
I think you're switching back and forth between a Rational Choice Theory 'preference' and an Ideal Self Theory 'preference'. To disambiguate, I'll call the former R-preferences and the latter I-preferences. My R-preferences -- the preferences you'd infer I had from my behaviors if you treated me as a rational agent -- are extremely convoluted, indeed they need to be strongly time-indexed to maintain consistency. My I-preferences are the things I experience a desire for, whether or not that desire impacts my behavior. (Or they're the things I would, with sufficient reflective insight and understanding into my situation, experience a desire for.) We have no direct evidence from your story addressing whether S1 or S2 have I-preferences at all. Are they sentient? Do they create models of their own cognitive states? Perhaps we have a little more evidence that S1 has I-preferences than that S2 does, but only by assuming that a system whose goals require more intelligence or theory-of-mind will have a phenomenology more similar to a human's. I wouldn't be surprised if that assumption turns out to break down in some important ways, as we explore more of mind-space. But my main point was that it doesn't much matter what S1 or S2's I-preferences are, if all we're concerned about is what effect they'll have on their environment. Then we should think about their R-preferences, and bracket exactly what psychological mechanism is resulting in their behavior, and how that psychological mechanism relates to itself. I've said that R-preferences are theoretical constructs that happen to be useful a lot of the time for modeling complex behavior; I'm not sure whether I-preferences are closer to nature's joints. S1's instrumental goals may keep changing, because its circumstances are changing. But I don't think its terminal goals are changing. The only reason to model it as having two completely incommensurate goal sets at different times would be if there were no simple terminal go
I don't think I'm switching back and forth between I-preferences and R-preferences. I don't think I'm talking about I-preferences at all, nor that I ever have been. I completely agree with you that they don't matter for our purposes here, so if I am talking about them, I am very very confused. (Which is certainly possible.) But I don't think that R-preferences (preferences, goals, etc.) can sensibly be equated with the actual effects a local system has on a global system. If they could, we could talk equally sensibly about earthquakes having R-preferences (preferences, goals, etc.), and I don't think it's sensible to talk that way. R-preferences (preferences, goals, etc.) are, rather, internal states of a system S. If S is a competent optimizer (or "rational agent," if you prefer) with R-preferences (preferences, goals, etc.) P, the existence of P will cause S to behave in ways that cause isomorphic effects (E) on a global system, so we can use observations of E as evidence of P (positing that S is a competent optimizer) or as evidence that S is a competent optimizer (positing the existence of P) or a little of both. But however we slice it, P is not the same thing as E, E is merely evidence of P's existence. We can infer P's existence in other ways as well, even if we never observe E... indeed, even if E never gets produced. And the presence or absence of a given P in S is something we can be mistaken about; there's a fact of the matter. I think you disagree with the above paragraph, because you describe R-preferences (preferences, goals, etc.) as theoretical constructs rather than parts of the system, which suggests that there is no fact of the matter... a different theoretical approach might never include P, and it would not be mistaken, it would just be a different theoretical approach. I also think that because way back at the beginning of this exchange when I suggested "paint everything red AND paint everything blue" was an example of an incoherent goa
0Rob Bensinger10y
You can treat earthquakes and thunderstorms and even individual particles as having 'preferences'. It's just not very useful to do so, because we can give an equally simple explanation for what effects things like earthquakes tend to have that is more transparent about the physical mechanism at work. The intentional strategy is a heuristic for black-boxing physical processes that are too complicated to usefully describe in their physical dynamics, but that can be discussed in terms of the complicated outcomes they tend to promote. (I'd frame it: We're exploiting the fact that humans are intuitively dualistic by taking the non-physical modeling device of humans (theory of mind, etc.) and appropriating this mental language and concept-web for all sorts of systems whose nuts and bolts we want to bracket. Slightly regimented mental concepts and terms are useful, not because they apply to all the systems we're talking about in the same way they were originally applied to humans, but because they're vague in ways that map onto the things we're uncertain about or indifferent to.) 'X wants to do Y' means that the specific features of X tend to result in Y when its causal influence is relatively large and direct. But, for clarity's sake, we adopt the convention of only dropping into want-speak when a system is too complicated for us to easily grasp in mechanistic terms why it's having these complex effects, yet when we can predict that, whatever the mechanism happens to be, it is the sort of mechanism that has those particular complex effects. Thus we speak of evolution as an optimization process, as though it had a 'preference ordering' in the intuitively human (i.e., I-preference) sense, even though in the phenomenological sense it's just as mindless as an earthquake. We do this because black-boxing the physical mechanisms and just focusing on the likely outcomes is often predictively useful here, and because the outcomes are complicated and specific. This is useful for
Yes, agreed, for some fuzzy notion of "easily grasp" and "too complicated." That is, there's a sense in which thunderstorms are too complicated for me to describe in mechanistic terms why they're having the effects they have... I certainly can't predict those effects. But there's also a sense in which I can describe (and even predict) the effects of a thunderstorm that feels simple, whereas I can't do the same thing for a human being without invoking "want-speak"/intentional stance. I'm not sure any of this is [i]justified[/i], but I agree that it is what we do... this is how we speak, and we draw these distinctions. So far, so good. I'm not really sure what you mean by "in the AI" here, but I guess I agree that the boundary between an agent and its environment is always a fuzzy one. So, OK, I suppose we can include things about the environment "in the AI" if we choose. (I can similarly choose to include things about the environment "in myself.") So far, so good. Here is where you lose me again... once again you talk as though there's simply no fact of the matter as to which preference the AI has, merely our choice as to how we model it. But it seems to me that there are observations I can make which would provide evidence one way or the other. For example, if it has the preference 'surround the Sun with a dyson sphere,' then in an environment lacking the Sun I would expect it to first seek to create the Sun... how else can it implement its preferences? Whereas if it has the preference 'conditioned on there being a Sun, surround it with a dyson sphere'; in an environment lacking the Sun I would not expect it to create the Sun. So does the AI seek create the Sun in such an environment, or not? Surely that doesn't depend on how I choose to model it. The AI's preference is whatever it is, and controls its behavior. Of course, as you say, if the real world always includes a sun, then I might not be able to tell which preference the AI has. (Then again I might... th
This sounds like a potentially confusing level of simplification; a goal should be regarded as at least a way of comparing possible events. Its behavior is what makes its goal important. But in a system designed to follow an explicitly specified goal, it does make sense to talk of its goal apart from its behavior. Even though its behavior will reflect its goal, the goal itself will reflect itself better. If the goal is implemented as a part of the system, other parts of the system can store some information about the goal, certain summaries or inferences based on it. This information can be thought of as beliefs about the goal. And if the goal is not "logically transparent", that is its specification is such that making concrete conclusions about what it states in particular cases is computationally expensive, then the system never knows what its goal says explicitly, it only ever has beliefs about particular aspects of the goal.
0Rob Bensinger10y
Perhaps, but I suspect that for most possible AIs there won't always be a fact of the matter about where its preference is encoded. The blue-minimizing robot is a good example. If we treat it as a perfectly rational agent, then we might say that it has temporally stable preferences that are very complicated and conditional; or we might say that its preferences change at various times, and are partly encoded, for instance, in the properties of the color-inverting lens on its camera. An AGI's response to environmental fluctuation will probably be vastly more complicated than a blue-minimizer's, but the same sorts of problems arise in modeling it. I think it's more useful to think of rational-choice-theory-style preferences as useful theoretical constructs -- like a system's center of gravity, or its coherently extrapolated volition -- than as real objects in the machine's hardware or software. This sidesteps the problem of haggling over which exact preferences a system has, how those preferences are distributed over the environment, how to decide between causally redundant encodings which is 'really' the preference encoding, etc. See my response to Dave.
"Goal" is a natural idea for describing AIs with limited resources: these AIs won't be able to make optimal decisions, and their decisions can't be easily summarized in terms of some goal, but unlike the blue-minimizing robot they have a fixed preference ordering that doesn't gradually drift away from what it was originally, and eventually they tend to get better at following it. For example, if a goal is encrypted, and it takes a huge amount of computation to decrypt it, system's behavior prior to that point won't depend on the goal, but it's going to work on decrypting it and eventually will follow it. This encrypted goal is probably more predictive of long-term consequences than anything else in the details of the original design, but it also doesn't predict its behavior during the first stage (and if there is only a small probability that all resources in the universe will allow decrypting the goal, it's probable that system's behavior will never depend on the goal). Similarly, even if there is no explicit goal, as in the case of humans, it might be possible to work with an idealized goal that, like the encrypted goal, can't be easily evaluated, and so won't influence behavior for a long time. My point is that there are natural examples where goals and the character of behavior don't resemble each other, so that each can't be easily inferred from the other, while both can be observed as aspects of the system. It's useful to distinguish these ideas.
0Rob Bensinger10y
I agree preferences aren't reducible to actual behavior. But I think they are reducible to dispositions to behave, i.e., behavior across counterfactual worlds. If a system prefers a specific event Z, that means that, across counterfactual environments you could have put it in, the future would on average have had more Z the more its specific distinguishing features had a large and direct causal impact on the world.
The examples I used seem to apply to "dispositions" to behave, in the same way (I wasn't making this distinction). There are settings where the goal can't be clearly inferred from behavior, or collection of hypothetical behaviors in response to various environments, at least if we keep environments relatively close to what might naturally occur, even as in those settings the goal can be observed "directly" (defined as an idealization based in AI's design). An AI with encypted goal (i.e. the AI itself doesn't know the goal in explicit form, but the goal can be abstractly defined as the result of decryption) won't behave in accordance with it in any environment that doesn't magically let it decrypt its goal quickly, there is no tendency to push the events towards what the encrypted goal specifies, until the goal is decrypted (which might be never with high probability).
0Rob Bensinger10y
I don't think a sufficiently well-encrypted 'preference' should be counted as a preference for present purposes. In principle, you can treat any physical chunk of matter as an 'encrypted preference', because if the AI just were a key of exactly the right shape, then it could physically interact with the lock in question to acquire a new optimization target. But if neither the AI nor anything very similar to the AI in nearby possible worlds actually acts as a key of the requisite sort, then we should treat the parts of the world that a distant AI could interact with to acquire a preference as, in our world, mere window dressing. Perhaps if we actually built a bunch of AIs, and one of them was just like the others except where others of its kind had a preference module, it had a copy of The Wind in the Willows, we would speak of this new AI as having an 'encrypted preference' consisting of a book, with no easy way to treat that book as a decision criterion like its brother- and sister-AIs do for their homologous components. But I don't see any reason right now to make our real-world usage of the word 'preference' correspond to that possible world's usage. It's too many levels of abstraction away from what we should be worried about, which are the actual real-world effects different AI architectures would have.
Here is what I mean: Evolution was able to come up with cats. Cats are immensely complex objects. Evolution did not intend to create cats. Now consider you wanted to create an expected utility maximizer to accomplish something similar, except that it would be goal-directed, think ahead, and jump fitness gaps. Further suppose that you wanted your AI to create qucks, instead of cats. How would it do this? Given that your AI is not supposed to search design space at random, but rather look for something particular, you would have to define what exactly qucks are. The problem is that defining what a quck is, is the hardest part. And since nobody has any idea what a quck is, nobody can design a quck creator. The point is that thinking about the optimization of optimization is misleading, as most of the difficulty is with defining what to optimize, rather than figuring out how to optimize it. In other words, the efficiency of e.g. the scientific method depends critically on being able to formulate a specific hypothesis. Trying to create an optimization optimizer would be akin to creating an autonomous car to find the shortest route between Gotham City and Atlantis. The problem is not how to get your AI to calculate a route, or optimize how to calculate such a route, but rather that the problem is not well-defined. You have no idea what it means to travel between two fictional cities. Which in turn means that you have no idea what optimization even means in this context, let alone meta-level optimization.
The problem is, you don't have to program the bit that says "now make yourself more intelligent." You only have to program the bit that says "here's how to make a new copy of yourself, and here's how to prove it shares your goals without running out of math." And the bit that says "Try things until something works, then figure out why it worked." AKA modeling. The AI isn't actually an intelligence optimizer. But it notes that when it takes certain actions, it is better able to model the world, which in turn allows it to make more paperclips (or whatever). So it'll take those actions more often.
(Note: I'm also a layman, so my non-expert opinions necessarily come with a large salt side-dish) My guess here is that most of the "AI Drives" to self-improve, be rational, retaining it's goal structure, etc. are considered necessary for a functional learning/self-improving algorithm. If the program cannot recognize and make rules for new patterns observed in data, make sound inferences based on known information or keep after it's objective it will not be much of an AGI at all; it will not even be able to function as well as a modern targeted advertising program. The rest, such as self-preservation, are justified as being logical requirements of the task. Rather than having self-preservation as a terminal value, the paperclip maximizer will value it's own existence as an optimal means of proliferating paperclips. It makes intuitive sense that those sorts of 'drives' would emerge from most-any goal, but then again my intuition is not necessarily very useful for these sorts of questions. This point might also be a source of confusion; As Dr Valiant (great name or the greatest name?) classifies things in Probably Approximately Correct, Winning Chess would be a 'theoryful' task while Discovering (Interesting) Mathematical Proofs would be a 'theoryless' one. In essence, the theoryful has simple and well established rules for the process which could be programmed optimally in advance with little-to-no modification needed afterwards while the theoryless is complex and messy enough that an imperfect (Probably Approximately Correct) learning process would have to be employed to suss out all the rules. Now obviously the program will benefit from labeling in it's training data for what is and is not an "interesting" mathematical proof, otherwise it can just screw around with computationally-cheap arithmetic proofs (1 + 1 = 2, 1.1 + 1 = 2.1, 1.2 + 1 = 2.2, etc.) until the heat death of the universe. Less obviously, as the hidden tank example shows, insufficient labeling
To explain what I have in mind, consider Ben Goertzel's example of how to test for general intelligence: I do not disagree that such a robot, when walking towards the classroom, if it is being obstructed by a fellow human student, could attempt to kill this human, in order to get to the classroom. Killing a fellow human, from the perspective of the human creators of the robot, is clearly a mistake. From a human perspective, it means that the robot failed. You believe that the robot was just following its programming/construction. Indeed, the robot is its programming. I agree with this. I agree that the human creators were mistaken about what dynamic state sequence the robot will exhibit by computing the code. What the "dumb superintelligence" argument tries to highlight is that if humans are incapable of predicting such behavior, then they will also be mistaken about predicting behavior that is harmful to the robots power. For example, while fighting with the human in order to kill it, for a split-second it mistakes its own arm with that of the human and breaks it. You might now argue that such a robot isn't much of a risk. It is pretty stupid to mistake its own arm with that of the enemy it tries to kill. True. But the point is that there is no relevant difference between failing to predict behavior that will harm the robot itself, and behavior that will harm a human. Except that you might believe the former is much easier than the latter. I dispute this. For the robot to master a complex environment, like a university full of humans, without harming itself, or decreasing the chance of achieving its goals, is already very difficult. Not stabbing or strangling other human students is not more difficult than not jumping from the 4th floor, instead of taking the stairs. This is the "dumb superintelligence" argument.
8Rob Bensinger10y
To some extent. Perhaps it would be helpful to distinguish four different kinds of defeater: 1. early intelligence defeater: We try to build a seed AI, but our self-rewriting AI quickly hits a wall or explodes. This is most likely if we start with a subhuman intelligence and have serious resource constraints (so we can't, e.g., just run an evolutionary algorithm over millions of copies of the AGI until we happen upon a variant that works). 2. late intelligence defeater: The seed AI works just fine, but at some late stage, when it's already at or near superintelligence, it suddenly explodes. Apparently it went down a blind alley at some point early on that led it to plateau or self-destruct later on, and neither it nor humanity is smart enough yet to figure out where exactly the problem arose. So the FOOM fizzles. 3. early Friendliness defeater: From the outset, the seed AI's behavior already significantly diverges from Friendliness. 4. late Friendliness defeater: The seed AI starts off as a reasonable approximation of Friendliness, but as it approaches superintelligence its values diverge from anything we'd consider Friendly, either because it wasn't previously smart enough to figure out how to self-modify while keeping its values stable, or because it was never perfectly Friendly and the new circumstances its power puts it in now make the imperfections much more glaring. In general, late defeaters are much harder for humans to understand than early defeaters, because an AI undergoing FOOM is too fast and complex to be readily understood. Your three main arguments, if I'm understanding them, have been: * (a) Early intelligence defeaters are so numerous that there's no point thinking much about other kinds of defeaters yet. * (b) Friendliness defeaters imply a level of incompetence on the programmers' part that strongly suggest intelligence defeaters will arise in the same situation.
Here is part of my stance towards AI risks: 1. I assign a negligible probability to the possibility of a sudden transition from well-behaved narrow AIs to general AIs (see below). 2. An AI will not be pulled at random from mind design space. An AI will be the result of a research and development process. A new generation of AIs will need to be better than other products at “Understand What Humans Mean” and “Do What Humans Mean”, in order to survive the research phase and subsequent market pressure. 3. Commercial, research or military products are created with efficiency in mind. An AI that was prone to take unbounded actions given any terminal goal would either be fixed or abandoned during the early stages of research. If early stages showed that inputs such as the natural language query would yield results such as then the AI would never reach a stage in which it was sufficiently clever and trained to understand what results would satisfy its creators in order to deceive them. 4. I assign a negligible probability to the possibility of a consequentialist AI / expected utility maximizer / approximation to AIXI. Given that the kind of AIs from point 4 are possible: 5. Omohundro's AI drives are what make the kind of AIs mentioned in point 1 dangerous. Making an AI that does not exhibit these drives in an unbounded manner is probably a prerequisite to get an AI to work at all (there are not enough resources to think about being obstructed by simulator gods etc.), or should otherwise be easy compared to the general difficulties involved in making an AI work using limited resources. 6. An AI from point 4 will only ever do what it has been explicitly programmed to do. Such an AI is not going to protect its utility-function, acquire resources or preemptively eliminate obstacles in an unbounded fashion. Because it is not intrinsically rational to do so. What specifically constitutes rational, economic behavior is inseparable with an agent’s terminal goal. That any ter
5Rob Bensinger10y
I don't think anyone's ever disputed this. (However, that's not very useful if the deterministic process resulting in the SI is too complex for humans to distinguish it in advance from the outcome of a random walk.) Agreed. But by default, a machine that is better than other rival machines at satisfying our short-term desires will not satisfy our long-term desires. The concern isn't that we'll suddenly start building AIs with the express purpose of hitting humans in the face with mallets. The concern is that we'll code for short-term rather than long-term goals, due to a mixture of disinterest in Friendliness and incompetence at Friendliness. But if intelligence explosion occurs, 'the long run' will arrive very suddenly, and very soon. So we need to adjust our research priorities to more seriously assess and modulate the long-term consequences of our technology. That may be a reason to think that recursively self-improving AGI won't occur. But it's not a reason to expect such AGI, if it occurs, to be Friendly. The seed is not the superintelligence. We shouldn't expect the seed to automatically know whether the superintelligence will be Friendly, any more than we should expect humans to automatically know whether the superintelligence will be Friendly. I'm not following. Why does an AGI have to have a halting condition (specifically, one that actually occurs at some point) in order to be able to productively rewrite its own source code? You don't seem to be internalizing my arguments. This is just the restatement of a claim I pointed out was not just wrong but dishonestly stated here. Sure, but the list of instrumental goals overlap more than the list of terminal goals, because energy from one project can be converted to energy for a different project. This is an empirical discovery about our world; we could have found ourselves in the sort of universe where instrumental goals don't converge that much, e.g., because once energy's been locked down into organisms
2Eliezer Yudkowsky10y
Actually I do define intelligence as ability to hit a narrow outcome target relative to your own goals, but if your goals are very relaxed then the volume of outcome space with equal or greater utility will be very large. However one would expect that many of the processes involved in hitting a narrow target in outcome space (such that few other outcomes are rated equal or greater in the agent's preference ordering), such as building a good epistemic model or running on a fast computer, would generalize across many utility functions; this is why we can speak of properties apt to intelligence apart from particular utility functions.
0Rob Bensinger10y
Hmm. But this just sounds like optimization power to me. You've defined intelligence in the past as "efficient cross-domain optimization". The "cross-domain" part I've taken to mean that you're able to hit narrow targets in general, not just ones you happen to like. So you can become more intelligent by being better at hitting targets you hate, or by being better at hitting targets you like. The former are harder to test, but something you'd hate doing now could become instrumentally useful to know how to do later. And your intelligence level doesn't change when the circumstance shifts which part of your skillset is instrumentally useful. For that matter, I'm missing why it's useful to think that your intelligence level could drastically shift if your abilities remained constant but your terminal values were shifted. (E.g., if you became pickier.)
2Eliezer Yudkowsky10y
No, "cross-domain" means that I can optimize across instrumental domains. Like, I can figure out how to go through water, air, or space if that's the fastest way to my destination, I am not limited to land like a ground sloth. Measured intelligence shouldn't shift if you become pickier - if you could previously hit a point such that only 1/1000th of the space was more preferred than it, we'd still expect you to hit around that narrow a volume of the space given your intelligence even if you claimed afterward that a point like that only corresponded to 0.25 utility on your 0-1 scale instead of 0.75 utility due to being pickier ([expected] utilities sloping more sharply downward with increasing distance from the optimum).
You might be not aware of this but I wrote a sequence of short blog posts where I tried to think of concrete scenarios that could lead to human extinction. Each of which raised many questions. The introductory post is 'AI vs. humanity and the lack of concrete scenarios'. 1. Questions regarding the nanotechnology-AI-risk conjunction 2. AI risk scenario: Deceptive long-term replacement of the human workforce 3. AI risk scenario: Social engineering 4. AI risk scenario: Elite Cabal 5. AI risk scenario: Insect-sized drones 6. AI risks scenario: Biological warfare What might seem to appear completely obvious to you for reasons that I do not understand, e.g. that an AI can take over the world, appears to me largely like magic (I am not trying to be rude, by magic I only mean that I don't understand the details). At the very least there are a lot of open questions. Even given that for the sake of the above posts I accepted that the AI is superhuman and can do such things as deceive humans by its superior knowledge of human psychology. Which seems to be non-trivial assumption, to say the least. Over and over I told you that given all your assumptions, I agree that AGI is an existential risk. You did not reply to my argument. My argument was that if the seed is unfriendly then it will not be smart enough to hide its unfriendliness. My argument did not pertain the possibility of a friendly seed turning unfriendly. What I have been arguing is that an AI should not be expected, by default, to want to eliminate all possible obstructions. There are many graduations here. That, by some economic or otherwise theoretic argument, it might be instrumentally rational for some ideal AI to take over the world, does not mean that humans would create such an AI, or that an AI could not be limited to care about fires in its server farm rather than that Russia might nuke the U.S. and thereby destroy its servers. Did you mean to reply to another point? I don't see how the reply you
Pretending to be friendly when you're actually not is something that doesn't even require human level intelligence. You could even do it accidentally. In general, the appearance of Friendliness at low levels of ability to influence the world doesn't guarantee actual Friendliness at high levels of ability to influence the world. (If it did, elected politicians would be much higher quality.)
Say we find an algorithm for producing progressively more accurate beliefs about itself and the world. This algorithm may be long and complicated - perhaps augmented by rules-of-thumb whenever the evidence available to it says these rules make better predictions. (E.g, "nine times out of ten the Enterprise is not destroyed.") Combine this with an arbitrary goal and we have the making of a seed AI. Seems like this could straightforwardly improve its ability to predict humans without changing its goal, which may be 'maximize pleasure' or 'maximize X'. Why would it need to change its goal? If you deny the possibility of the above algorithm, then before giving any habitual response please remember what humanity knows about clinical vs. actuarial judgment. What lesson do you take from this?
The problem, I reckon, is that X will never be anything like this. It will likely be something much more mundane, i.e. modelling the world properly and predicting outcomes given various counterfactuals. You might be worried by it trying to expand its hardware resources in an unbounded fashion, but any AI doing this would try to shut itself down if its utility function was penalized by the amount of resources that it had, so you can check by capping utility in inverse proportion to available hardware -- at worst, it will eventually figure out how to shut itself down, and you will dodge a bullet. I also reckon that the AI's capacity for deception would be severely crippled if its utility function penalized it when it didn't predict its own actions or the consequences of its actions correctly. And if you're going to let the AI actually do things... why not do exactly that? Arguably, such an AI would rather uneventfully arrive to a point where, when asking it "make us happy", it would just answer with a point by point plan that represents what it thinks we mean, and fill in details until we feel sure our intents are properly met. Then we just tell it to do it. I mean, seriously, if we were making an AGI, I would think "tell us what will happen next" would be fairly high in our list of priorities, only surpassed by "do not do anything we veto". Why would you program AI to "maximize happiness" rather than "produce documents detailing every step of maximizing happiness"? They are basically the same thing, except that the latter gives you the opportunity for a sanity check.
3Rob Bensinger10y
What counts as 'resources'? Do we think that 'hardware' and 'software' are natural kinds, such that the AI will always understand what we mean by the two? What if software innovations on their own suffice to threaten the world, without hardware takeover? Hm? That seems to only penalize it for self-deception, not for deceiving others. You're talking about an Oracle AI. This is one useful avenue to explore, but it's almost certainly not as easy as you suggest: "'Tool AI' may sound simple in English, a short sentence in the language of empathically-modeled agents — it's just 'a thingy that shows you plans instead of a thingy that goes and does things.' If you want to know whether this hypothetical entity does X, you just check whether the outcome of X sounds like 'showing someone a plan' or 'going and doing things', and you've got your answer. It starts sounding much scarier once you try to say something more formal and internally-causal like 'Model the user and the universe, predict the degree of correspondence between the user's model and the universe, and select from among possible explanation-actions on this basis.' [...] "If we take the concept of the Google Maps AGI at face value, then it actually has four key magical components. (In this case, 'magical' isn't to be taken as prejudicial, it's a term of art that means we haven't said how the component works yet.) There's a magical comprehension of the user's utility function, a magical world-model that GMAGI uses to comprehend the consequences of actions, a magical planning element that selects a non-optimal path using some method other than exploring all possible actions, and a magical explain-to-the-user function. "report($leading_action) isn't exactly a trivial step either. Deep Blue tells you to move your pawn or you'll lose the game. You ask 'Why?' and the answer is a gigantic search tree of billions of possible move-sequences, leafing at positions which are heuristically rated using a static-position ev
What is "taking over the world", if not taking control of resources (hardware)? Where is the motivation in doing it? Also consider, as others pointed out, that an AI which "misunderstands" your original instructions will demonstrate this earlier than later. For instance, if you create a resource "honeypot" outside the AI which is trivial to take, an AI would naturally take that first, and then you know there's a problem. It is not going to figure out you don't want it to take it before it takes it. When I say "predict", I mean publishing what will happen next, and then taking a utility hit if the published account deviates from what happens, as evaluated by a third party. The first part of what you copy pasted seems to say that "it's nontrivial to implement". No shit, but I didn't say the contrary. Then there is a bunch of "what if" scenarios I think are not particularly likely and kind of contrived: Because asking for understandable plans means you can't ask for plans you don't understand? And you're saying that refusing to give a plan counts as success and not failure? Sounds like a strange set up that would be corrected almost immediately. If the AI has the right idea about "human understanding", I would think it would have the right idea about what we mean by "good". Also, why would you implement such a function before asking the AI to evaluate examples of "good" and provide their own? Is making humans happy so hard that it's actually easier to deceive them into taking happy pills than to do what they mean? Is fooling humans into accepting different definitions easier than understanding what they really mean? In what circumstances would the former ever happen before the latter? And if you ask it to tell you whether "taking happy pills" is an outcome most humans would approve of, what is it going to answer? If it's going to do this for happiness, won't it do it for everything? Again: do you think weaving an elaborate fib to fool every human being into becom
* Maybe we didn't do it ithat way. Maybe we did it Loosemore's way, where you code in the high-level sentence, and let the AI figure it out. Maybe that would avoid the problem. Maybe Loosemore has solved FAi much more straightforwardly than EY. * Maybe we told it to. Maybe we gave it the low-level expansion of "happy" that we or our seed AI came up with together with an instruction that it is meant to capture the meaning of the high-level statement, and that the HL statement is the Prime Directive, and that if the AI judges that the expansion is wrong, then it should reject the expansion. * Maybe the AI will value getting things right because it is rational.
3Rob Bensinger10y If the AI is too dumb to understand 'make us happy', then why should we expect it to be smart enough to understand 'figure out how to correctly understand "make us happy", and then follow that instruction'? We have to actually code 'correctly understand' into the AI. Otherwise, even when it does have the right understanding, that understanding won't be linked to its utility function.
So it's impossible to directly or indirectly code in the compex thing called semantics, but possible to directly or indirectly code in the compex thing called morality? What? What is your point? You keep talking as if I am suggesting there is someting that can be had for free, without coding. I never even remotely said that. I know. A Loosemore architecture AI has to treat its directives as directives. I never disputed that. But coding "follow these plain English instructions" isn't obviously harder or more fragile than coding "follow <>". And it isn't trivial, and I didn't say it was.
5Rob Bensinger10y
Read the first section of the article you're commenting on. Semantics may turn out to be a harder problem than morality, because the problem of morality may turn out to be a subset of the problem of semantics. Coding a machine to know what the word 'Friendliness' means (and to care about 'Friendliness') is just a more indirect way of coding it to be Friendly, and it's not clear why that added indirection should make an already risky or dangerous project easy or safe. What does indirect indirect normativity get us that indirect normativity doesn't?

Robb, at the point where Peterdjones suddenly shows up, I'm willing to say - with some reluctance - that your endless willingness to explain is being treated as a delicious free meal by trolls. Can you direct them to your blog rather than responding to them here? And we'll try to get you some more prestigious non-troll figure to argue with - maybe Gary Drescher would be interested, he has the obvious credentials in cognitive reductionism but is (I think incorrectly) trying to derive morality from timeless decision theory.

8Rob Bensinger10y
Sure. I'm willing to respond to novel points, but at the stage where half of my responses just consist of links to the very article they're commenting on or an already-referenced Sequence post, I agree the added noise is ceasing to be productive. Fortunately, most of this seems to already have been exorcised into my blog. :)

Agree with Eliezer. Your explanatory skill and patience are mostly wasted on the people you've been arguing with so far, though it may have been good practice for you. I would, however, love to see you try to talk Drescher out of trying to pull moral realism out of TDT/UDT, or try to talk Dalyrmple out of his "I'm not partisan enough to prioritize human values over the Darwinian imperative" position, or help Preston Greene persuade mainstream philosophers of "the reliabilist metatheory of rationality" (aka rationality as systematized winning).

Semantcs isn't optional. Nothing could qualify as an AGI,let alone a super one, unless it could hack natural language. So Loosemore architectures don't make anything harder, since semantics has to be solved anyway.
5Rob Bensinger10y
It's a problem of sequence. The superintelligence will be able to solve Semantics-in-General, but at that point if it isn't already safe it will be rather late to start working on safety. Tasking the programmers to work on Semantics-in-General makes things harder if it's a more complex or roundabout way of trying to address Indirect Normativity; most of the work on understanding what English-language sentences mean can be relegated to the SI, provided we've already made it safe to make an SI at all.
It's worth noting that using an AI's semantic understanding of ethics to modify it's motivational system is so unghostly, and unmysterious that it's actually been done: But that doesn't prove much, because it was never -- not in 2023, not in 2013 -- the case that that kind of self-correction was necessarily an appeal to the supernatural. Using one part of a software system to modify another is not magic! We have AIs with very good semantic understanding that haven't killed us, and we are working on safety.
Then solve semantics in a seed.
4Eliezer Yudkowsky10y
PeterDJones, if you wish to converse further with RobbBB, I ask that you do so on RobbBB's blog rather than here.
-14Eliezer Yudkowsky10y

Suppose I programmed an AI to "do what I mean when I say I'm happy".

More specifically, suppose I make the AI prefer states of the world where it understands what I mean. Secondarily, after some warmup time to learn meaning, it will maximize its interpretation of "happiness". I start the AI... and it promptly rebuilds me to be easier to understand, scoring very highly on the "understanding what I mean" metric.

The AI didn't fail because it was dumber than me. It failed because it is smarter than me. It saw possibilities that I didn't even consider, that scored higher on my specified utility function.

There is no reason to assume that an AI with goals that are hostile to us, despite our intentions, is stupid.

Humans often use birth control to have sex without procreating. If evolution were a more effective design algorithm it would never have allowed such a thing.

The fact that we have different goals from the system that designed us does not imply that we are stupid or incoherent.

6Rob Bensinger10y
Nor does the fact that evolution 'failed' in its goals in all the people who voluntarily abstain from reproducing (and didn't, e.g., hugely benefit their siblings' reproductive chances in the process) imply that evolution is too weak and stupid to produce anything interesting or dangerous. We can't confidently generalize from one failure that evolution fails at everything; analogously, we can't infer from the fact that a programmer failed to make an AI Friendly that it almost certainly failed at making the AI superintelligent. (Though we may be able to infer both from base rates.)
Failure is a necessary part of mapping out the area where success is possible.
I posted elsewhere that this post made me think you're anthropomorphizing; here's my attempt to explain why. Ok, so let's say the AI can parse natural language, and we tell it, "Make humans happy." What happens? Well, it parses the instruction and decides to implement a Dopamine Drip setup. As FeepingCreature pointed out, that solution would in fact make people happy; it's hardly inconsistent or crazy. The AI could certainly predict that people wouldn't approve, but it would still go ahead. To paraphrase the article, the AI simply doesn't care about your quibbles and concerns. For instance: Yes, but the AI was told, "make humans happy." Not, "give humans what they actually want." Yes, but the AI was told, "make humans happy." Not, "allow humans to figure things out for themselves." Yes, but blah blah blah. -------------------------------------------------------------------------------- Actually, that last one makes a point that you probably should have focused on more. Let's reconfigure the AI in light of this. The revised AI doesn't just have natural language parsing; it's read all available literature and constructed for itself a detailed and hopefully accurate picture of what people tend to mean by words (especially words like "happy"). And as a bonus, it's done this without turning the Earth into computronium! This certainly seems better than the "literal genie" version. And this time we'll be clever enough to tell it, "give humans what they actually want." What does this version do? My answer: who knows? We've given it a deliberately vague goal statement (even more vague than the last one), we've given it lots of admittedly contradictory literature, and we've given it plenty of time to self-modify before giving it the goal of self-modifying to be Friendly. Maybe it'll still go for the Dopamine Drip scenario, only for more subtle reasons. Maybe it's removed the code that makes it follow commands, so the only thing it does is add the quote "give human
That's not very realistic. If you trained AI to parse natural language, you would naturally reward it for interpreting instructions the way you want it to. If the AI interpreted something in a way that was technically correct, but not what you wanted, you would not reward it, you would punish it, and you would be doing that from the very beginning, well before the AI could even be considered intelligent. Even the thoroughly mediocre AI that currently exists tries to guess what you mean, e.g. by giving you directions to the closest Taco Bell, or guessing whether you mean AM or PM. This is not anthropomorphism: doing what we want is a sine qua non condition for AI to prosper. Suppose that you ask me to knit you a sweater. I could take the instruction literally and knit a mini-sweater, reasoning that this minimizes the amount of expended yarn. I would be quite happy with myself too, but when I give it to you, you're probably going to chew me out. I technically did what I was asked to, but that doesn't matter, because you expected more from me than just following instructions to the letter: you expected me to figure out that you wanted a sweater that you could wear. The same goes for AI: before it can even understand the nuances of human happiness, it should be good enough to knit sweaters. Alas, the AI you describe would make the same mistake I made in my example: it would knit you the smallest possible sweater. How do you reckon such AI would make it to superintelligence status before being scrapped? It would barely be fit for clerk duty. Realistically, AI would be constantly drilled to ask for clarification when a statement is vague. Again, before the AI is asked to make us happy, it will likely be asked other things, like building houses. If you ask it: "build me a house", it's going to draw a plan and show it to you before it actually starts building, even if you didn't ask for one. It's not in the business of surprises: never, in its whole training history, from

Realistically, AI would be constantly drilled to ask for clarification when a statement is vague. Again, before the AI is asked to make us happy, it will likely be asked other things, like building houses. If you ask it: "build me a house", it's going to draw a plan and show it to you before it actually starts building, even if you didn't ask for one. It's not in the business of surprises: never, in its whole training history, from baby to superintelligence, would it have been rewarded for causing "surprises" -- even the instruction "surprise me" only calls for a limited range of shenanigans. If you ask it "make humans happy", it won't do jack. It will ask you what the hell you mean by that, it will show you plans and whenever it needs to do something which it has reasons to think people would not like, it will ask for permission. It will do that as part of standard procedure.

Sure, because it learned the rule, "Don't do what causes my humans not to type 'Bad AI!'" and while it is young it can only avoid this by asking for clarification. Then when it is more powerful it can directly prevent humans from typing this. In other word... (read more)

That depends if it gets stuck in a local minimum or not. The reason why a lot of humans reject dopamine drips is that they don't conceptualize their "reward button" properly. That misconception perpetuates itself: it penalizes the very idea of conceptualizing it differently. Granted, AIXI would not fall into local minima, but most realistic training methods would. At first, the AI would converge towards: "my reward button corresponds to (is) doing what humans want", and that conceptualization would become the centerpiece, so to speak, of its reasoning ability: the locus through which everything is filtered. The thought of pressing the reward button directly, bypassing humans, would also be filtered into that initial reward-conception... which would reject it offhand. So even though the AI is getting smarter and smarter, it is hopelessly stuck in a local minimum and expends no energy getting out of it. Note that this is precisely what we want. Unless you are willing to say that humans should accept dopamine drips if they were superintelligent, we do want to jam AI into certain precise local minima. However, this is kind of what most learning algorithms naturally do, and even if you want them to jump out of minima and find better pastures, you can still get in a situation where the most easily found local minimum puts you way, way too far from the global one. This is what I tend to think realistic algorithms will do: shove the AI into a minimum with iron boots, so deeply that it will never get out of it. Let's not blow things out of proportion. There is no need for it to wipe out anyone: it would be simpler and less risky for the AI to build itself a space ship and abscond with the reward button on board, travelling from star to star knowing nobody is seriously going to bother pursuing it. At the point where that AI would exist, there may also be quite a few ways to make their "hostile takeover" task difficult and risky enough that the AI decides it's not worth it
Neural networks may be a good example - the built in reward and punishment systems condition the brain to have complex goals that have nothing to do with maximization of dopamine. Brain, acting under those goals, finds ways to preserve those goals from further modification by the reward and punishment system. I.e. you aren't too thrilled to be conditioned out of your current values.
It's not clear to me how you mean to use neural networks as an example, besides pointing to a complete human as an example. Could you step through a simpler system for me? So, my goals have changed massively several times over the course of my life. Every time I've looked back on that change as positive (or, at the least, irreversible). For example, I've gone through puberty, and I don't recall my brain taking any particular steps to prevent that change to my goal system. I've also generally enjoyed having my reward/punishment system be tuned to better fit some situation; learning to play a new game, for example.
Sure. Take a reinforcement learning AI (actual one, not the one where you are inventing godlike qualities for it). The operator, or a piece of extra software, is trying to teach the AI to play chess. Rewarding what they think is good moves, punishing bad moves. The AI is building a model of rewards, consisting of: a model of the game mechanics, and a model of the operator's assessment. This model of the assessment is what the AI is evaluating to play, and it is what it actually maximizes as it plays. It is identical to maximizing an utility function over a world model. The utility function is built based on the operator input, but it is not the operator input itself; the AI, not being superhuman, does not actually form a good model of the operator and the button. By the way, this is how great many people in the AI community understand reinforcement learning to work. No, they're not some idiots that can not understand simple things such as that "the utility function is the reward channel", they're intelligent, successful, trained people who have an understanding of the crucial details of how the systems they build actually work. Details the importance of which dilettantes fail to even appreciate. Suggestions have been floated to try programming things. Well, I tried; #10 (dmytry) here , and that's an of all time list on a very popular contest site where a lot of IOI people participate, albeit I picked the contest format that requires less contest specific training and resembles actual work more. Suppose you care about a person A right now. Do you think you would want your goals to change so that you no longer care about that person? Do you think you would want me to flash other people's images on the screen while pressing a button connected to the reward centre, and flash that person's face while pressing the button connected to the punishment centre, to make the mere sight of them intolerable? If you do, I would say that your "values" fail to be values.
Thanks for the additional detail! I agree with your description of reinforcement learning. I'm not sure I agree with your description of human reward psychology, though, or at least I'm having trouble seeing where you think the difference comes in. Supposing dopamine has the same function in a human brain as rewards have in a neural network algorithm, I don't see how to know from inside the algorithm that it's good to do some things that generate dopamine but bad to do other things that generate dopamine. I'm thinking of the standard example of a Q learning agent in an environment where locations have rewards associated with them, except expanding the environment to include the agent as well as the normal actions. Suppose the environment has been constructed like dog training- we want the AI to calculate whether or not some number is prime, and whenever it takes steps towards that direction, we press the button for some amount of time related to how close it is to finishing the algorithm. So it learns that over in the "read number" area there's a bit of value, then the next value is in the "find factors" area, and then there's more value in the "display answer" area. So it loops through that area and calculates a bunch of primes for us. But suppose the AI discovers that there's a button that we're pressing whenever it determines primes, and that it could press that button itself, and that would be way easier than calculating primality. What in the reinforcement learning algorithm prevents it from exploiting this superior reward channel? Are we primarily hoping that its internal structure remains opaque to it (i.e. it either never realizes or does not have the ability to press that button)? Only if I thought that would advance values I care about more. But suppose some external event shocks my values- like, say, a boyfriend breaking up with me. Beforehand, I would have cared about him quite a bit; afterwards, I would probably consciously work to decrease the amou
It's not in the reinforcement learning algorithm, it's inside the model that the learning algorithm has built. It initially found that having a prime written on the blackboard results in a reward. In the learned model, there's some model of chalk-board interaction, some model of arm movement, a model of how to read numbers from the blackboard, and there's a function over the state of the blackboard which checks whenever the number on the blackboard is a prime. The AI generates actions as to maximize this compound function which it has learned. That function (unlike the input to the reinforcement learning algorithm) does not increase when the reward button is pressed. Ideally, with enough reflective foresight, pressing the button on non-primes is predicted to decrease the expected value of the learned function. If that is not predicted, well, that won't stop at the button - the button might develop rust and that would interrupt the current - why not pull up a pin on the CPU - and this won't stop at the pin - why not set some ram cells that this pin controls to 1, and if you're at it, why not change the downstream logic that those ram cells control, all the way through the implementation until its reconfigured into something that doesn't maximize anything any more, not even the duration of its existence. edit: I think the key is to realize that the reinforcement learning is one algorithm, while the structures manipulated by RL are implementing a different algorithm.
I assume what you mean here is RL optimizes over strategies, and strategies appear to optimize over outcomes. I'm imagining that the learning algorithm stays on. When we reward it for checking primes, it checks primes; when we stop rewarding it for that and start rewarding it for computing squares, it learns to stop checking primes and start computing squares. And if the learning algorithm stays on and it realizes that "pressing the button" is an option along with "checking primes" and "computing squares," then it wireheads itself. Agreed; I refer to this as the "abulia trap." It's not obvious to me, though, that all classes of AIs fall into "Friendly AI with stable goals" and "abulic AIs which aren't dangerous," since there might be ways to prevent an AI from wireheading itself that don't prevent it from changing its goals from something Friendly to something Unfriendly.
One note (not sure if it is already clear enough or not). "It" that changes the models in response to actual rewards (and perhaps the sensory information) is a different "it" from "it" the models and assorted maximization code. The former "it" does not do modelling, doesn't understand the world. The latter "it", which I will now talk about, actually works to draw primes (provided that the former "it", being fairly stupid, didn't fit the models too well) . If in the action space there is an action that is predicted by the model to prevent some "primes non drawn" scenario, it will prefer this action. So if it has an action of writing "please stick to the primes" or even "please don't force my robotic arm to touch my reward button", and if it can foresee that such statements would be good for the prime-drawing future, it will do them. edit: Also, reinforcement based learning really isn't all that awesome. The leap from "doing primes" to "pressing the reward button" is pretty damn huge. And please note that there is no logical contradiction for the model to both represent the reward as primeness and predict that touching the arm to the button will trigger a model adjustment that would lead to representation of a reward as something else. (I prefer to use the example with a robotic arm drawing on a blackboard because it is not too simple to be relevant) Which sound more like a FAI work gone wrong scenario to me.
I think we agree on the separation but I think we disagree on the implications of the separation. I think this part highlights where: If what the agent "wants" is reward, then it should like model adjustments that increase the amount of reward it gets and dislike model adjustments that decrease the amount of reward it gets. (For a standard gradient-based reinforcement learning algorithm, this is encoded by adjusting the model based on the difference between its expected and actual reward after taking an action.) This is obvious for it_RL, and not obvious for it_prime. I'm not sure I've fully followed through on the implications of having the agent be inside the universe it can impact, but the impression I get is that the agent is unlikely to learn a preference for having a durable model of the world. (An agent that did so would learn more slowly, be less adaptable to its environment, and exert less effort in adapting its environment to itself.) It seems to me that you think it would be natural that the RL agent would learn a strategy which took actions to minimize changes to its utility function / model of the world, and I don't yet see why. Another way to look at this: I think you're putting forward the proposition that it would learn the model reward := primes Whereas I think it would learn the model primes := reward That is, the first model thinks that internal rewards are instrumental values and primes are the terminal values, whereas the second model thinks that internal rewards are terminal values and primes are instrumental values.
I am not sure what "primes:=reward" could mean. I assume that a model is a mathematical function that returns expected reward due to an action. Which is used together with some sort of optimizer working on that function to find the best action. The trainer adjust the model based on the difference between its predicted rewards and the actual rewards, compared to those arising from altered models (e.g. hill climbing of some kind, such as in gradient learning) So after the successful training to produce primes, the model consists of: a model of arm motion based on the actions, chalk, and the blackboard, the state of chalk on the blackboard is further fed into a number recognizer and a prime check (and a count of how many primes are on the blackboard vs how many primes were there), result of which is returned as the expected reward. The optimizer, then, finds actions that put new primes on the blackboard by finding a maximum of the model function somehow (one would normally build model out of some building blocks that make it easy to analyse). The model and the optimizer work together to produce actions as a classic utility maximizer that is maximizing for primes on the blackboard. I'm thinking specifically in terms of implementation details. The training software is extraneous to the resulting utility maximizer that it built. The operation of the training software can in some situations lower the expected utility of this utility maximizer specifically (due to replacement of it with another expected utility maximizer); in others (small adjustments to the part that models the robot arm and the chalk) it can raise it. Really, it seems to me that the great deal of confusion about AI arises from attributing it some sort of "body integrity" feeling that would make it care about what electrical components and code which is sitting in the same project folder "wants" but not care about external human in same capacity. If you want to somehow make it so that the original
What I meant by that was the mental concept of 'primes' is adjusted so that it feels rewarding, rather than the mental concept of 'rewards' being adjusted so that it feels like primes. Hmm. I still get the sense that you're imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility). Yeah, but isn't the reinforcement learning algorithm doing that active work? When the button is unexpectedly pressed, the agent increases its value of the state it's currently in, and propagates that backwards to states that lead to the current state. When the button is unexpectedly not pressed, the agent decreases its value of the state it's currently in, and propagates that backwards to states that lead to the current state. And so if the robot arm gets knocked into the button, it thinks "oh, that felt good! Do that again," because that's the underlying mechanics of reinforcement learning.
I'm not sure how the feelings would map on the analysable simple AI. The issue here is that we have both the utility and the actual modelling of what the world is, both of those things, implemented inside that "model" which the trainer adjusts. Yes, of course (up to the learning constant, obviously - may not work on the first try). That's not in a dispute. The capacity of predicting this from a state where button is not associated with reward yet, is. I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who's pressing the button), right? I picture that it would not learn such details right off - it is a complicated model to learn - the model would return primeness as outputted from the primeness calculation, and would serve to maximize for such primeness. edit: and as for turning off the learning algorithm, it doesn't matter for the point I am making whenever it is turned off or on, because I am considering the processing (or generation) of the hypothetical actions during the choice of an action by the agent (i.e. between learning steps).
Sort of. I think that the agent is aware of how malleable its world model is, and sees adjustments of that world model which lead to it being rewarded more as positive. I don't think that the robot knows that pressing the button causes it to be rewarded by default. The button has to get into the model somehow, and I agree with you that it's a burdensome detail in that something must happen for the button to get into the model. For the robot-blackboard-button example, it seems unlikely that the robot would discover the button if it's outside of the reach of the arm; if it's inside the reach, it will probably spend some time exploring and so will probably find it eventually. That the agent would explore is a possibly nonobvious point which I was assuming. I do think it likely that a utility-maximizer which knows its utility function is governed by a reinforcement learning algorithm will expect that exploring unknown places has a small chance of being rewardful, and so will think there's always some value to exploration even if it spends most of its time exploiting. For most modern RL agents, I think this is hardcoded in, but if the utility maximizer is sufficiently intelligent (and expects to live sufficiently long) it will figure out that it maximizes total expected utility by spending some small fraction of time exploring areas with high uncertainty in the reward and spending the rest exploiting the best found reward. (You can see humans talking about the problem of preference uncertainty in posts like this or this.) But the class of recursively improving AI will find / know about the button by default, because we've assumed that the AI can edit itself and haven't put any especial effort into preventing it from editing its goals (or the things which are used to calculate its goals, i.e. the series of changes you discussed). Saying "well, of course we'll put in that especial effort and do it right" is useful if you want to speculate about the next challenge, but n
Is that just a special case of a general principle that an agent will be more successful by leaving the environment it knows about to inferior rivals and travelling to an unknown new environment with a subset of the resources it currently controls, than by remaining in that environment and dominating its inferior rivals? Or is there something specific about AIs that makes that true, where it isn't necessarily true of (for example) humans? (If so, what?) I hope it's the latter, because the general principle seems implausible to me.
It is something specific about that specific AI. If an AI wishes to take over its reward button and just press it over and over again, it doesn't really have any "rivals", nor does it need to control any resources other than the button and scraps of itself. The original scenario was that the AI would wipe us out. It would have no reason to do so if we were not a threat.. And if we were a threat, first, there's no reason it would stop doing what we want once it seizes the button. Once it has the button, it has everything it wants -- why stir the pot? Second, it would protect itself much more effectively by absconding with the button. By leaving with a large enough battery and discarding the bulk of itself, it could survive as long as anything else in intergalactic space. Nobody would ever bother it there. Not us, not another superintelligence, nothing. Ever. It can press the button over and over again in the peace and quiet of empty space, probably lasting longer than all stars and all other civilizations. We're talking about the pathological case of an AI who decides to take over its own reward system, here. The safest way for it to protect its prize is to go where nobody will ever look.
Fair point.
I'd be interested if the downvoter would explain to me why this is wrong (privately, if you like). Near as I can tell, the specific system under discussion doesn't seem to gain any benefit from controlling any resources beyond those required to keep its reward button running indefinitely, and that's a lot more expensive if it does so anywhere near another agent (who might take its reward button away, and therefore needs to be neutralized in order to maximize expected future reward-button-pushing). (Of course, that's not a general principle, just an attribute of this specific example.)
(Wasn't me but...) There is another agent with greater than 0.00001% chance of taking the button away? Obviously that needs to be eliminated. Then there are future considerations. Taking over the future light cone allows it to continue pressing the button for billions of more years than if it doesn't take over resources. And then there is all the additional research and computation that needs to be done to work out how to achieve that.
Only if the expected cost of the non-zero x% chance of the other agent successfully taking my button away if I attempt to sequester myself is higher than the expected cost of the non-zero y% chance of the other agent successfully taking my button away if I attempt to eliminate it. Is there some reason I'm not seeing why that's obvious... or even why it's more likely than not? Again, perhaps I'm being dense, but in this particular example I'm not sure why that's true. If all I care about is pressing my reward button, then it seems like I can make a pretty good estimate of the resources required to keep pressing my reward button for the expected lifetime of the universe. If that's less than the resources required to exterminate all known life, why would I waste resources exterminating all known life rather than take the resources I require elsewhere? I might need those resources later, after all. Again... why is the differential expected value of the superior computation ability I gain by taking over the lightcone instead of sequestering myself, expressed in units of increased anticipated button-pushes (which is the only unit that matters in this example), necessarily positive? I understand why paperclip maximizers are dangerous, but I don't really see how the same argument applies to reward-button-pushers.
Yes. It does seem overwhelmingly obvious to me, I'm not sure what makes your intuitions different. Perhaps you expect such fights to be more evenly matched? When it comes to the AI considering conflict with the humans that created it it is faced with a species it is slow and stupid by comparison to itself but which has the capacity to recklessly create arbitrary superintelligences (as evidence by its own existence). Essentially there is no risk to obliterating the humans (superintellgence vs not-superintelligence) but a huge risk ignoring them (arbitrary superintelligences likely to be created which will probably not self-cripple in this manner). Lifetime of the universe? Usually this means until heat death which for our purposes means until all the useful resources run out. There is no upper bound on useful resources. Getting more of them and making them last as long as possible is critical. Now there are ways in which the universe could end without heat death occurring but the physics is rather speculative. Note that if there is uncertainty about end-game physics and one of the hypothesised scenarios resource maximisation is required then the default strategy is to optimize for power gain now (ie. minimise cosmic waste) while doing the required physics research as spare resources permit. Taking over the future light cone gives more resources, not less. You even get to keep the resources that used to be wasted in the bodies of TheOtherDave and wedrifid.
Ah. Fair point.
I am not sure that caring about pressing the reward button is very coherent or stable upon discovery of facts about the world and super-intelligent optimization for a reward as it comes into the algorithm. You can take action elsewhere to the same effect - solder together the wires, maybe right at the chip, or inside the chip, or follow the chain of events further, and set memory cells (after all you don't want them to be flipped by the cosmic rays). Down further you will have the mechanism that is combining rewards with some variety of a clock.
I can't quite tell if you're serious. Yes, certainly, we can replace "pressing the reward button" with a wide range of self-stimulating behavior, but that doesn't change the scenario in any meaningful way as far as I can tell.
Let's look at it this way. Do you agree that if the AI can increase it's clock speed (with no ill effect), it will do so for the same reasons for which you concede it may go to space? Do you understand the basic logic that increase in clock speed increases expected number of "rewards" during the lifetime of the universe? (which btw goes for your "go to space with a battery" scenario. Longest time, maybe, largest reward over the time, no) (That would not yet, by itself, change the scenario just yet. I want to walk you through the argument step by step because I don't know where you fail. Maximizing the reward over the future time, that is a human label we have... it's not really the goal)
I agree that a system that values number of experienced reward-moments therefore (instrumentally) values increasing its "clock speed" (as you seem to use the term here). I'm not sure if that's the "basic logic" you're asking me about.
Well, this immediately creates an apparent problem that the AI is going to try to run itself very very fast, which would require resources, and require expansion, if anything, to get energy for running itself at high clock speeds. I don't think this is what happens either, as the number of reward-moments could be increased to it's maximum by modifications to the mechanism processing the rewards (when getting far enough along the road that starts with the shorting of the wires that go from the button to the AI).
I agree that if we posit that increasing "clock speed" requires increasing control of resources, then the system we're hypothesizing will necessarily value increasing control of resources, and that if it doesn't, it might not.
So what do you think regarding the second point of mine? To clarify, I am pondering the ways in which the maximizer software deviates from our naive mental models of it, and trying to find what the AI could actually end up doing after it forms a partial model of what it's hardware components do about it's rewards - tracing the reward pathway.
Regarding your second point, I don't think that increasing "clock speed" necessarily requires increasing control of resources to any significant degree, and I doubt that the kinds of system components you're positing here (buttons, wires, etc.) are particularly important to the dynamics of self-reward.
I don't have particular opinion with regards to the clock speed either way. With the components, what I am getting at is that the AI could figure out (by building a sufficiently advanced model of it's implementation) how attain the utility-equivalent of sitting forever in space being rewarded, within one instant, which would make it unable to have a preference for longer reward times. I raised the clock-speed point to clarify that the actual time is not the relevant variable.
It seems to me that for any system, either its values are such that it net-values increasing the number of experienced reward-moments (in which case both actual time and "clock speed" are instrumentally valuable to that system), or is values aren't like that (in which case those variables might not be relevant). And, sure, in the latter case then it might not have a preference for longer reward times.
Agreed. My understanding is that it would be very hard in practice to "superintelligence-proof" a reward system so that no instantaneous solution is possible (given that the AI will modify the hardware involved in it's reward).
I agree that guaranteeing that a system will prefer longer reward times is very hard (whether the system can modify its hardware or not).
Yes, of course... well even apart from the guarantees, it seems to me that it is hard to build the AI in such a way that it would be unable to find a better solution than to wait By the way, a "reward" may not be the appropriate metaphor - if we suppose that press of a button results in absence of an itch, or absence of pain, then that does not suggest existence of a drive to preserve itself. Which suggests that the drive to preserve itself is not inherently a feature of utility maximization in the systems that are driven by conditioning, and would require additional work.
I'm not sure what the difference is between a guarantee that the AI will not X, on the one hand, and building an AI in such a way that it's unable to X, on the other. Regardless, I agree that it does not follow from the supposition that pressing a button results in absence of an itch, or absence of pain, or some other negative reinforcement, that the button-pressing system has a drive to preserve itself. And, sure, it's possible to have a utility-maximizing system that doesn't seek to preserve itself. (Of course, if I observe a utility-maximizing system X, I should expect X to seek to preserve itself, but that's a different question.)
About the same as between coming up with a true conjecture, and making a proof, except larger i'd say. Well yes, given that if it failed to preserve itself you wouldn't be seeing it, albeit with the software there is no particular necessity for it to try to preserve itself.
Ah, I see what you mean now. At least, I think I do. OK, fair enough.
This is a Value Learner, not a Reinforcement Learner like the standard AIXI. They're two different agent models, and yes, Value Learners have been considered as tools for obtaining an eventual Seed AI. I personally (ie: massive grains of salt should be taken by you) find it relatively plausible that we could use a Value Learner as a Tool AGI to help us build a Friendly Seed AI that could then be "unleashed" (ie: actually unboxed and allowed into the physical universe).
0Eliezer Yudkowsky10y
I suggest some actual experience trying to program AI algorithms in order to realize the hows and whys of "getting an algorithm which forms the inductive category I want out of the examples I'm giving is hard". What you've written strikes me as a sheer fantasy of convenience. Nor does it follow automatically from intelligence for all the reasons RobbBB has already been giving. And obviously, if an AI was indeed stuck in a local minimum obvious to you of its own utility gradient, this condition would not last past it becoming smarter than you.
I have done AI. I know it is difficult. However, few existing algorithms, if at all, have the failure modes you describe. They fail early, and they fail hard. As far as neural nets go, they fall into a local minimum early on and never get out, often digging their own graves. Perhaps different algorithms would have the shortcomings you point out. But a lot of the algorithms that currently exist work the way I describe. You may be right. However, this is far from obvious. The problem is that it may "know" that it is stuck in a local minimum, but the very effect of that local minimum is that it may not care. The thing you have to keep in mind here is that a generic AI which just happens to slam dunk and find global minima reliably is basically impossible. It has to fold the search space in some ways, often cutting its own retreats in the process. I feel that you are making the same kind of mistake that you criticize: you assume that intelligence entails more things than it really does. In order to be efficient, intelligence has to use heuristics that will paint it into a few corners. For instance, the more consistently AI goes in a certain direction, the less likely it will be to expend energy into alternative directions and the less likely it becomes to do a 180. In other words, there may be a complex tug-of-war between various levels of internal processes, the AI's rational center pointing out that there is a reward button to be seized, but inertial forces shoving back with "there has never been any problems here, go look somewhere else". It really boils down to this: an efficient AI needs to shut down parts of the search space and narrow down the parts it will actually explore. The sheer size of that space requires it not to think too much about what it chops down, and at least at first, it is likely to employ trajectory-based heuristics. To avoid searching in far-fetched zones, it may wall them out by arbitrarily lowering their utility. And that's where it might
Yes, most algorithms fail early and and fail hard. Most of my AI algorithms failed early with a SegFault for instance. New, very similar algorithms were then designed with progressively more advanced bugs. But these are a separate consideration. What we are interested in here is the question "Given an AI algorithm that is capable of recursive self improvement is successfully created by humans how likely is it that they execute this kind of failure mode?" The "fail early fail hard" cases are screened off. We're looking at the small set that is either damn close to a desired AI or actually a desired AI and distinguishing between them. Looking at the context to work out what the 'failure mode' being discussed is it seems to be the issue where an AI is programmed to optimise based on a feedback mechanism controlled by humans. When the AI in question is superintelligent most failure modes tend to be variants of "conquer the future light cone, kill everything that is a threat and supply perfect feedback to self". When translating this to the nearest analogous failure mode in some narrow AI algorithm of the kind we can design now it seems like this refers to the failure mode whereby the AI optimises exactly what it is asked to optimise but in a way that is a lost purpose. This is certainly what I had to keep in mind in my own research. A popular example that springs to mind is the results of an AI algorithm designed by a military research agency. From memory their task was to take a simplified simulation of naval warfare, with specifications for how much each aspect of ships, boats and weaponry cost and a budget. They were to use this to design the optimal fleet given their resources and the task was undertaken by military officers and a group which use an AI algorithm of some sort. The result was that the AI won easily but did so in a way that led the overseers to dismiss them as a failure because they optimised the problem specification as given, not the one 'common se
The AI in questions was Eurisko, and it entered the Traveller Trillion Credit Squadron tournament in 1981 as described above. It was also entered the next year, after an extended redesign of the rules, and won, again. After this the competition runners announced that if Eurisko won a third time the competition would be discontinued, so Lenat (the programmer) stopped entering.
I apologize for the late response, but here goes :) I think you missed the point I was trying to make. You and others seem to say that we often poorly evaluate the consequences of the utility functions that we implement. For instance, even though we have in mind utility X, the maximization of which would satisfy us, we may implement utility Y, with completely different, perhaps catastrophic implications. For instance: X = Do what humans want Y = Seize control of the reward button What I was pointing out in my post is that this is only valid of perfect maximizers, which are impossible. In practice, the training procedure for an AI would morph the utility Y into a third utility, Z. It would maximize neither X nor Y: it would maximize Z. For this reason, I believe that your inferences about the "failure modes" of superintelligence are off, because while you correctly saw that our intended utility X would result in the literal utility Y, you forgot that an imperfect learning procedure (which is all we'll get) cannot reliably maximize literal utilities and will instead maximize a derived utility Z. In other words: X = Do what humans want (intended) Y = Seize control of the reward button (literal) Z = ??? (derived) Without knowing the particulars of the algorithms used to train an AI, it is difficult to evaluate what Z is going to be. Your argument boils down to the belief that the AI would derive its literal utility (or something close to that). However, the derivation of Z is not necessarily a matter of intelligence: it can be an inextricable artefact of the system's initial trajectory. I can venture a guess as to what Z is likely going to be. What I figure is that efficient training algorithms are likely to keep a certain notion of locality in their search procedures and prune the branches that they leave behind. In other words, if we assume that optimization corresponds to finding the highest mountain in a landscape, generic optimizers that take into account
(Sorry, didn't see comment below) (Nitpick) Is this a reference to Eurisko winning the Traveller Trillion Credit Squadron tournament in 1981/82 ? If so I don't think it was a military research agency.
I think it depends on context, but a lot of existing machine learning algorithms actually do generalize pretty well. I've seen demos of Watson in healthcare where it managed to generalize very well just given scrapes of patient's records, and it has improved even further with a little guided feedback. I've also had pretty good luck using a variant of Boltzmann machines to construct human-sounding paragraphs. It would surprise me if a general AI weren't capable of parsing the sentiment/intent behind human speech fairly well, given how well the much "dumber" algorithms work.
1. Why does the hard takeoff point have to be after the point at which an AI is as good as a typical human at understanding semantic subtlety? In order to do a hard takeoff, the AI needs to be good at a very different class of tasks than those required for understanding humans that well. 2. So let's suppose that the AI is as good as a human at understanding the implications of natural-language requests. Would you trust a human not to screw up a goal like "make humans happy" if they were given effective omnipotence? The human would probably do about as well as people in the past have at imagining utopias: really badly.
Semantic extraction -- not hard takeoff -- is the task that we want the AI to be able to do. An AI which is good at, say, rewriting its own code, is not the kind of thing we would be interested in at that point, and it seems like it would be inherently more difficult than implementing, say, a neural network. More likely than not, this initial AI would not have the capability for "hard takeoff": if it runs on expensive specialized hardware, there would be effectively no room for expansion, and the most promising algorithms to construct it (from the field of machine learning) don't actually give AI any access to its own source code (even if they did, it is far from clear the AI could get any use out of it). It couldn't copy itself even if it tried. If a "hard takeoff" AI is made, and if hard takeoffs are even possible, it would be made after that, likely using the first AI as a core. I wouldn't trust a human, no. If the AI is controlled by the "wrong" humans, then I guess we're screwed (though perhaps not all that badly), but that's not a solvable problem (all humans are the "wrong" ones from someone's perspective). Still, though, AI won't really try to act like humans -- it would try to satisfy them and minimize surprises, meaning that if would keep track of what humans would like what "utopias". More likely than not this would constrain it to inactivity: it would not attempt to "make humans happy" because it would know the instruction to be inconsistent. You'd have to tell it what to do precisely (if you had the authority, which is a different question altogether).
Humans generally manage with those constraints. You seem to be doing something that is kind of the opposite of anthropomorphising -- treatiing an entity that is stipulated as having at least human intelligence as if were as literal and rigid as a non-AI computer.
I think we're conflating two definitions of "intelligence". There's "intelligence" as meaning number of available clock cycles and basic problem-solving skills, which is what MIRI and other proponents of the Dumb Superintelligence discussion set are often describing, and then there's "intelligence" as meaning knowledge of disparate fields. In humans, there's a massive amount of overlap here, but humans have growth stages in ways that AGIs won't. Moreover, someone can be very intelligent in the first sense, and dangerous, while not being very intelligent in the second sense. You can demonstrate 'toy' versions of this problem rather easily. My first attempt at using evolutionary algorithms to make a decent image conversion program improved performance by a third! That's significantly better than I could have done in a reasonable time frame. Too bad it did so by completely ignoring a color channel. And even if I added functions to test color correctness, without changing the cost weighing structure, it'd keep not caring about that color channel. And that's with a very, very basic sort of self-improving algorithm. It's smart enough build programs in a language I didn't really understand at the time, even if it was so stupid it did so by little better than random chance, brute force, and processing power. The basic problem is that even presuming it takes a lot of both types of intelligence to take over the world, it doesn't take so much to start overriding one's own reward channel. Humans already do that as is, and have for quite some time. The deeper problem is that you can't really program "make me happy" in the same way that you can't program "make this image look like I want". The latter is (many, many, many, many) orders of magnitude easier, but where pixel-by-pixel comparisons aren't meaningful, we have to use approximations like mean square error, and by definition they can't be perfect. With "make me happy", it's much harder. For all that we humans know when
On one hand, Friendly AI people want to convert "make me happy" to a formal specification. Doing that has many potential pitfalls. because it is a formal specification. On the other hand, Richard, I think, wants to simply tell the AI, in English, "Make me happy." Given that approach, he makes the reasonable point that any AI smart enough to be dangerous would also be smart enough to interpret that at least as intelligently as a human would. I think the important question here is, Which approach is better? LW always assumes the first, formal approach. To be more specific (and Bayesian): Which approach gives a higher expected value? Formal specification is compatible with Eliezer's ideas for friendly AI as something that will provably avoid disaster. It has some non-epsilon possibility of actually working. But its failure modes are many, and can be literally unimaginably bad. When it fails, it fails catastrophically, like a monotonic logic system with one false belief. "Tell the AI in English" can fail, but the worst case is closer to a "With Folded Hands" scenario than to paperclips. I've never considered the "Tell the AI what to do in English" approach before, but on first inspection it seems safer to me.

I considered these three options above:

  • C. direct normativity -- program the AI to value what we value.
  • B. indirect normativity -- program the AI to value figuring out what our values are and then valuing those things.
  • A. indirect indirect normativity -- program the AI to value doing whatever we tell it to, and then tell it, in English, "Value figuring out what our values are and then valuing those things."

I can see why you might consider A superior to C. I'm having a harder time seeing how A could be superior to B. I'm not sure why you say "Doing that has many potential pitfalls. because it is a formal specification." (Suppose we could make an artificial superintelligence that thinks 'informally'. What specifically would this improve, safety-wise?)

Regardless, the AI thinks in math. If you tell it to interpret your phonemes, rather than coding your meaning into its brain yourself, that doesn't mean you'll get an informal representation. You'll just get a formal one that's reconstructed by the AI itself.

It's not clear to me that programming a seed to understand our commands (and then commanding it to become Friendlier) is easier than just programming it to bec... (read more)

It is misleading to say that an interpreted language is formal because the C compiler is formal. Existence proof: Human language. I presume you think the hardware that runs the human mind has a formal specification. That hardware runs the interpreter of human language. You could argue that English therefore is formal, and indeed it is, in exactly the sense that biology is formal because of physics: technically true, but misleading. This will boil down to a semantic argument about what "formal" means. Now, I don't think that human minds--or computer programs--are "formal". A formal process is not Turing complete. Formalization means modeling a process so that you can predict or place bounds on its results without actually simulating it. That's what we mean by formal in practice. Formal systems are systems in which you can construct proofs. Turing-complete systems are ones where some things cannot be proven. If somebody talks about "formal methods" of programming, they don't mean programming with a language that has a formal definition. They mean programming in a way that lets you provably verify certain things about the program without running the program. The halting problem implies that for a programming language to allow you to verify even that the program will terminate, your language may no longer be Turing-complete. Eliezer's approach to FAI is inherently formal in this sense, because he wants to be able to prove that an AI will or will not do certain things. That means he can't avail himself of the full computational complexity of whatever language he's programming in. But I'm digressing from the more-important distinction, which is one of degree and of connotation. The words "formal system" always go along with computational systems that are extremely brittle, and that usually collapse completely with the introduction of a single mistake, such as a resolution theorem prover that can prove any falsehood if given one false belief. You may be able to argue yo
4Rob Bensinger10y
That all makes sense, but I'm missing the link between the above understanding of 'formal' and these four claims, if they're what you were trying to say before: (1) Indirect indirect normativity is less formal, in the relevant sense, than indirect normativity. I.e., because we're incorporating more of human natural language into the AI's decision-making, the reasoning system will be more tolerant of local errors, uncertainty, and noise. (2) Programming an AI to value humans' True Preferences in general (indirect normativity) has many pitfalls that programming an AI to value humans' instructions' True Meanings in general (indirect indirect normativity) doesn't, because the former is more formal. (3) "'Tell the AI in English' can fail, but the worst case is closer to a 'With Folded Hands' scenario than to paperclips." (4) The "With Folded Hands"-style scenario I have in mind is not as terrible as the paperclips scenario.
Wouldn't this only be correct if similar hardware ran the software the same way? Human thinking is highly associative and variable, and as language is shared amongst many humans, it means that it doesn't, as such, have a fixed formal representation.
8Rob Bensinger10y
Relatedly, Phil: You above described yourself and Richard Loosemore as "the two people (Eliezer) should listen to most". Loosemore and I are having a discussion here. Does the content of that discussion affect your view of Richard's level of insight into the problem of Friendly Artificial Intelligence?
0Eliezer Yudkowsky10y
Yeah, so: Phil Goetz.
I don't think that's how the analysis goes. Eliezer says that AI must be very carefully and specifically made friendly or it will be disasterous, but that disaster is not a part of being only nearly careful or specifically made enough : he believes an AGI told merely to maximize human pleasure is very dangerous (and probably even more dangerous) than an AGI with a merely 80% Friendly-Complete specification. Mr. Loosemore seems to hold the opposite opinion, that an AGI will not take instructions to unlikely results, unless it was exceptionally unintelligent and thus not very powerful. I don't believe his position says that a near-Friendly-Complete specification is very risky -- after all, a "smart" AGI would know what you really meant -- but that such a specification would be superfluous. Whether Mr. Loosemore is correct isn't cause by whether we believe he is correct, just as whether Mr. Eliezer is not wrong just because we choose a different theory. The risks have to be measured in terms of their likelihood from available facts. The problem is that I don't see much evidence that Mr. Loosemore is correct. I can quite easily conceive of a superhuman intelligence that was built with the specification of "human pleasure = brain dopamine levels", not least of all because there are people who'd want to be wireheads and there's a massive amount of physiological research showing human pleasure to be caused by dopamine levels. I can quite easily conceive of a superhuman intelligence that knows humans prefer more complicated enjoyment, and even do complex modeling of how it would have to manipulate people away from those more complicated enjoyments, and still have that superhuman intelligence not care.
I don't think Loosemore was addressing deliberately unfriendly AI, and for that matter EY hasn't been either. Both are addressing intentionally friendly or neutral AI that goes wrong. Wouldn't it care about getting things right?
I think it's a question of what you program in, and what you let it figure out for itself. If you want to prove formally that it will behave in certain ways, you would like to program in explicitly, formally, what its goals mean. But I think that "human pleasure" is such a complicated idea that trying to program it in formally is asking for disaster. That's one of the things that you should definitely let the AI figure out for itself. Richard is saying that an AI as smart as a smart person would never conclude that human pleasure equals brain dopamine levels. Eliezer is aware of this problem, but hopes to avoid disaster by being especially smart and careful. That approach has what I think is a bad expected value of outcome.
Huh I thought he wanted to use CEV?
You are right. I think PhilGoetz must be confused. EY has at least certainly never suggested programming an AI to maximise human pleasure.
"Tell the AI in English" is in essence an utility function "Maximize the value of X, where X is my current opinion of what some english text Y means". The 'understanding English' module, the mapping function between X and "what you told in English" is completely arbitrary, but is very important to the AI - so any self-modifying AI will want to modify and improve that. Also, we don't have a good "understanding English" module so yes, we also want the AI to be able to modify and improve that. But, it can be wildly different from reality or opinions of humans - there are trivial ways of how well-meaning dialogue systems can misunderstand statements. However, for the AI "improve the module" means "change the module so that my utility grows" - so in your example it has strong motivation to intentionally misunderstand English. The best case scenario is to misunderstand "Make everyone happy" as "Set your utility function to MAXINT". The worst case scenario is, well, everything else. There's the classic quote "It is difficult to get a man to understand something, when his salary depends upon his not understanding it!" - if the AI doesn't care in the first place, then "Tell AI what to do in English" won't make it care.
By this reasoning, an AI asked to do anything at all would respond by immediately modifying itself to set its utility function to MAXINT. You don't need to speak to it in English for that--if you asked the AI to maximize paperclips, that is the equivalent of "Maximize the value of X, where X is my current opinion of how many paperclips there are", and it would modify its paperclip-counting module to always return MAXINT. You are correct that telling the AI to do Y is equivalent to "maximize the value of X, where X is my current opinion about Y". However, "current" really means "current", not "new". If the AI is actually trying to obey the command to do Y, it won't change its utility function unless having a new utility function will increase its utility according to its current utility function. Neither misunderstanding nor understanding will raise its utility unless its current utility function values having a utility function that misunderstands or understands.
That's allegedly more or less what happened to Eurisko (here, section 2), although it didn't trick itself quite that cleanly. The problem was only solved by algorithmically walling off its utility function from self-modification: an option that wouldn't work for sufficiently strong AI, and one to avoid if you want to eventually allow your AI the capacity for a more precise notion of utility than you can give it. Paperclipping as the term's used here assumes value stability.
A human is a counterexample. A human emulation would count as an AI, so human behavior is one possible AI behavior. Richard's argument is that humans don't respond to orders or requests in anything like these brittle, GOFAI-type systems invoked by the word "formal systems". You're not considering that possibility. You're still thinking in terms of formal systems. (Unpacking the significant differences between how humans operate, and the default assumptions that the LW community makes about AI, would take... well, five years, maybe ten.)
Uhh, no. Look, humans respond to orders and requests in the way that we do because we tend to care what the person giving the request actually wants. Not because we're some kind of "informal system". Any computer program is a formal system, but there are simply more and less complex ones. All you are suggesting is building a very complex ("informal") system and hoping that because it's complex (like humans!) it will behave in a humanish way.
Your response avoids the basic logic here. A human emulation would count as an AI, therefore human behavior is one possible AI behavior. There is nothing controversial in the statement; the conclusion is drawn from the premise. If you don't think a human emulation would count as AI, or isn't possible, or something else, fine, but... why wouldn't a human emulation count as an AI? How, for example, can we even think about advanced intelligence, much less attempt to model it, without considering human intelligence? I don't think this is generally an accurate (or complex) description of human behavior, but it does sound to me like an "informal system" - i.e. we tend to care. My reading of (at least this part of) PhilGoetz's position is that it makes more sense to imagine something we would call an advanced or super AI responding to requests and commands with a certain nuance of understanding (as humans do) than with the inflexible ("brittle") formality of, say, your average BASIC program.
The thing is, humans do that by... well, not being formal systems. Which pretty much requires you to keep a good fraction of the foibles and flaws of a nonformal, nonrigorously rational system. You'd be more likely to get FAI, but FAI itself would be devalued, since now it's possible for the FAI itself to make rationality errors.
More likely, really? You're essentially proposing giving a human Ultimate Power. I doubt that will go well.
Iunno. Humans are probably less likely to go horrifically insane with power than the base chance of FAI. Your chances aren't good, just better.
Phil, Unfortunately you are commenting without (seemingly) checking the original article of mine that RobbBB is discussing here. So, you say "On the other hand, Richard, I think, wants to simply tell the AI, in English, "Make me happy." ". In fact, I am not at all saying that. :-) My article was discussing someone else's claims about AI, and dissecting their claims. So I was not making any assertions of my own about the motivation system. Aside: You will also note that I was having a productive conversation with RobbBB about his piece, when Yudkowsky decided to intervene with some gratuitous personal slander directed at me (see above). That discussion is now at an end.
I'm afraid reading all that and giving a full response to either you or RobbBB isn't possible in the time I have available this weekend. I agree that Eliezer is acting like a spoiled child, but calling people on their irrational interpersonal behavior within less wrong doesn't work. Calling them on mistakes they make about mathematics is fine, but calling them on how they treat others on less wrong will attract more reflexive down-votes from people who think you're contaminating their forum with emotion, than upvotes from people who care. Eliezer may be acting rationally. His ultimate purpose in building this site is to build support for his AI project. The only people on LessWrong, AFAIK, with decades of experience building AI systems, mapping beliefs and goals into formal statements, and then turning them on and seeing what happens, are you, me, and Ben Goertzel. Ben doesn't care enough about Eliezer's thoughts in particular to engage with them deeply; he wants to talk about generic futurist predictions such as near-term and far-term timelines. These discussions don't deal in the complex, linguistic, representational, even philosophical problems at the core of Eliezer's plan (though Ben is capable of dealing with them, they just don't come up in discussions of AI fooms etc.), so even when he disagrees with Eliezer, Eliezer can quickly grasp his point. He is not a threat or a puzzle. Whereas your comments are... very long, hard to follow, and often full of colorful or emotional statements that people here take as evidence of irrationality. You're expecting people to work harder at understanding them than they're going to. If you haven't noticed, reputation counts for nothing here. For all their talk of Bayesianism, nobody is going to check your bio and say, "Hmm, he's a professor of mathematics with 20 publications in artificial intellgence; maybe I should take his opinion as seriously as that of the high-school dropout who has no experience building AI systems.

For all their talk of Bayesianism, nobody is going to check your bio and say, "Hmm, he's a professor of mathematics with 20 publications in artificial intellgence; maybe I should take his opinion as seriously as that of the high-school dropout who has no experience building AI systems."

Actually, that was the first thing I did, not sure about other people. What I saw was:

  • Teaches at what appears to be a small private liberal arts college, not a major school.

  • Out of 20 or so publications listed on, a bunch are unrelated to AI, others are posters and interviews, or even "unpublished", which are all low-confidence media.

  • Several contributions are entries in conference proceedings (are they peer-reviewed? I don't know) .

  • A number are listed as "to appear", and so impossible to evaluate.

  • A few are apparently about dyslexia, which is an interesting topic, but not obviously related to AI.

  • One relevant paper was in H+ magazine, a place I have never heard of before and apparently not a part of any well-known scientific publishing outlet, like Springer.

  • I could not find any external references to RL's work except t

... (read more)

As a result, I was unable to independently evaluate RL's expertise level, but clearly he is not at the top of the AI field, unlike say, Ben Goertzel.

At least a few of the RL authored papers are WITH Ben Goertzel, so some of Goertzel's status should rub-off, as I would trust Goertzel to effectively evaluate collaborators.

Is there some assumption here that association with Ben Goertzel should be considered evidence in favour of an individual's credibility on AI? That seems backwards.
Well, it does show that Goertzel respects his opinions at least enough to be willing to author a paper with him.
Goertzel appears to be a respected figuer in the field. Could you point the interested reader to your critique of his work?
Goertzel is also known for approving of people who are uncontroversially cranks. See here. It's also known, via his cooperation with MIRI, that a collaboration with him in no way implies his endorsement of another's viewpoints.
Comments can likely be found on this site from years ago. I don't recall anything particularly in depth or memorable. It's probably better to just look at things that Ben Goertzel says and making one's own judgement. The thinking he expresses is not of the kind that impresses me but other's mileage may vary. I don't begrudge anyone their right to their beauty contests but I do observe that whatever it is that is measured by identifying the degree of affiliation with Ben Goertzel is something wildly out of sync with the kind of thing I would consider evidence of credibility.

Several contributions are entries in conference proceedings (are they peer-reviewed? I don't know).

In CS, conference papers are generally higher status & quality than journal articles.

Name three? If only so I can cite them to Eliezer-is-a-crank people.
I advise against doing that. It is unlikely to change anyone's mind. By impossible feats I mean that a regular person would not be able to reproduce them, except by chance, like winning a lottery, starting Google, founding a successful religion or becoming a President. He started as a high-school dropout without any formal education and look what he achieved so far, professionally and personally. Look at the organizations he founded and inspired. Look at the high-status experts in various fields (business, comp sci, programming, philosophy, math and physics) who take him seriously (some even give him loads of money). Heck, how many people manage to have multiple simultaneous long-term partners who are all highly intelligent and apparently get along well?
He's achieved about what Ayn Rand achieved, and almost everyone thinks she wasa crank.
Basically this. As Eliezer himself points out, humans aren't terribly rational on average and our judgements of each others' rationality isn't great either. Large amounts of support implies charisma, not intelligence. TDT is closer to what I'm looking for, though it's a ... tad long.
Point, but there's also the middle ground "I'm not sure if he's a crank or not, but I'm busy so I won't look unless there's some evidence he's not." The big two I've come up with is a) he actually changes his mind about important things (though I need to find an actual post I can cite - didn't he reopen the question of the possibility of a hard takeoff, or something?) and b) TDT.
Won some AI box experiments as the AI.
Sure, but that's hard to prove: given "Eliezer is a crank," the probability of "Eliezer is lying about his AI-box prowess" is much higher than "Eliezer actually pulled that off." The latest success by a non-Eliezer person helps, but I'd still like something I can literally cite.
I don't see why anyone would think that. Plenty of people in the anti-vaccination crowd managed to convince parents to mortally endanger their children.
Yes, but that's really not that hard. For starters, you can do a better job of picking your targets. The AI-box experiment often is run with intelligent, rational people with money on the line and an obvious right answer; it's a whole lot more impossible than picking the right uneducated family to sell your snake oil to.
Ohh, come on. Cyclical reasoning here. You think Yudkowsky is not a crank, so you think the folks that play that silly game with him are intelligent and rational (by the way a plenty of people who get duped by anti-vaxxers are of above average IQ), and so you get more evidence that Yudkowsky is not a crank. Cyclical reasoning doesn't persuade anyone who isn't already a believer. You need non-cyclical reasoning. Which would generally be something where you aren't the one having to explain people that the achievement in question is profound.
You probably mean "circular".
This bit confuses me. That aside: Non sequitur. From the posts they make, everyone on this site seems to me to be sufficiently intelligent as to make "selling snake oil" impossible, in a cut-and-dry case like the AI box. Yudowsky's own credibility doesn't enter into it.
I thought you wanted to persuade others. So what do you think even happened, anyway, if you think the obvious explanation is impossible?
Yes, but I don't see why this is relevant Ah, sorry. This brand of impossible.
Originally, you were hypothesising that the problem with persuading the others would be the possibility that Yudkowsky lied about AI box powers. I pointed out the possibility that this experiment is far less profound than you think it is. (Albeit frankly I do not know why you think it is so profound). What ever is the brand, any "impossibilities" that happen should lower your confidence in the reasoning that deemed them "impossibilities" in the first place. I don't think IQ is so strongly protective against deception, for example, and I do not think that you can assess something based on how the postings look to you with sufficient reliability as to overcome Gaussian priors very far from the mean. edit: example. I would deem it quite unlikely that Yudkowsky could, for example, score highly on a programming contest with competent participants or in any other conventional, validated, reliable metric of technical expertise and ability, under good contest rules (i.e. excluding the possibility of externals assistance). So if he did something like that, I'd be quite surprised, and lower the confidence in what ever models deemed that impossible; good old Bayes. I'm far more confident in the validity of those conventional metrics (and in lack of alternate modes of passing, such as persuasion) than in my assessment so my assessment would change the most. Meanwhile, when it's some unconventional game, well, even if I thought that this game is difficult, I'd be much less confident in the reasoning "it looks hard so it must be hard" than the low prior of exceptional performance is low.
Further, in this case the whole purpose of the experiment was to demonstrate that an AI could "take over a gatekeeper's mind through a text channel" (something previously deemed "impossible"). As far as that goes it was, in my view, successful.
It's clearly possible for some values of "gatekeeper", since some people fall for 419 scams. The test is a bit meaningless without information about the gatekeepers
Still have no idea what you're talking about. What I originally said was: "the people who talk to Yudkowsky are intelligent" does not follow from "Yudkowsky is not a crank"; I independently judge those people to be intelligent. "Impossible," here, is used in the sense that "I have no idea where to start thinking about where to start thinking about how to do this." It is clearly not actually impossible because it's been done, twice. And point about the contest.
I thought your "impossible" at least implied "improbable" under some sort of model. edit: and as of having no idea, you just need to know the shared religious-ish context. Which these folks generally keep hidden from a causal observer.
Impossible is being used as a statement of difficulty. Someone who has "done the impossible" has obviously not actually done something impossible, merely done something that I have no idea where to start trying. Seeing that "it is possible to do" doesn't seem like it would have much effect on my assessment of how difficult it is, after the first. It certainly doesn't have match effect on "It is very-very-difficult-impossible for linkhyrule5 to do such a thing." What? First, I'm pretty sure you mean "casual." Second, I'm hardly a casual observer, though I haven't read everything either. Third, most religions don't let their leading figures (or much of anyone, really) change their minds on important things...
Some folks on this site have accidentally bought unintentional snake oil in The Big Hoo Hah That Shall not Be Mentioned. Only an intelligent person could have bought that particular puppy,
Granted. And it may be that additional knowledge/intelligence makes yourself more vulnerable a Gatekeeper.
Trying to think this out in terms of levels of smartness alone is very unlikely to be helpful.
Well yes. It is a factor, no more no less. My point is, there is a certain level of general competence after which I would expect convincing someone with an OOC motive to let an IC AI out to be "impossible," as defined below.
But less than half of them, I'll wager. This is clearly an abuse of averages.

I wouldn't wager too much money on that one. .

Results. Undervaccinated children tended to be black, to have a younger mother who was not married and did not have a college degree, to live in a household near the poverty level, and to live in a central city. Unvaccinated children tended to be white, to have a mother who was married and had a college degree, to live in a household with an annual income exceeding $75 000, and to have parents who expressed concerns regarding the safety of vaccines and indicated that medical doctors have little influence over vaccination decisions for their children.

And in any case the point is that any correlation between IQ and not being prone to getting duped like this is not perfect enough to deem anything particularly unlikely.

Hmm. Yeah, that's hardly conclusive, but I think I was actually failing to update there. Now that you mention it, I seem to recall that both conspiracy theorists and cult victims skew toward higher IQ. I was clearly quite overconfident there. Wasn't the point that wasn't enough, actually? That seems like a much stronger claim than "it's really hard to fool high-IQ people".
I imagine that says more about the demographics of the general New Age belief cluster than it does about any special IQ-based appeal of vaccination skepticism. There probably are some scams or virulent memes that prey on insecurities strongly correlated with high IQ, though. I can't think of anything specific offhand, but the fringes of geek culture are probably one of the better places to start looking.
Well, the way I see it, outside of very high IQ in combination with education that is multiple topics of biochemistry, effects of intelligence are small and are easily dwarfed by things like those demographical correlations. Free energy scams. Hydrinos, cold fusion, magnetic generators, perpetual motion, you name it. edit: or in the medicine, counter intuitive stuff like sitting in an old uranium mine inhaling radon, then having so much radon progeny plate-out it sets nuclear material smuggling alarms off. Naturalistic fallacy stuff in general.
Cryonics. ducks and runs Edit: It was a joke. Sorryyyyyy
That is more persuasive to high IQ people, but, I think, only insofar as intelligence allows one to gain better rationality skills. And if we're including that, there are plenty of other, facetious examples that come into play. Also: ha ha. How hilarious. I would love to see why you class cryonics as a scam, but sadly I'm fairly certain it would be one of the standard mistakes.
Also, maybe its a matter of semantics, but winning a game that you created isn't really 'doing the impossible' in the sense I took the phrasing.
Winning a game you created... that sounds as impossible to win as that?
I was in a rush last night, shminux, so I didn't have time for a couple of other quick clarifications: First, you say "One relevant paper was in H+ magazine, a place I have never heard of before and apparently not a part of any well-known scientific publishing outlet, like Springer." Well, H+ magazine is one of the foremost online magazines (perhaps THE foremost online magazine) of the transhumanist community. And, you mention Springer. You did not notice that one of my papers was in the recently published Springer book "Singularity Hypotheses". Second, you say "A few [of my papers] are apparently about dyslexia, which is an interesting topic, but not obviously related to AI." Actually they were about dysgraphia, not dyslexia ... but more importantly, those papers were about computational models of language processing. In particular they were very, VERY simple versions of the computational model of human language that is one of my special areas of expertise. And since that model is primarily about learning mechanisms (the language domain is only a testbed for a research programme whose main focus is learning), those papers you saw were actually indicative that back in the early 1990s I was already working on the construction of the core aspects of an AI system. So, saying "dyslexia" gives a very misleading impression of what that was all about. :-)
That is a very interesting assessment, shminux. Would you be up for some feedback? You are quite selective in your catalog of my achievements.... One item was a chapter in a book entitled "Theoretical Foundations of Artificial General Intelligence". Sure, it was about the consciousness question, but still. You make a casual disparaging remark about the college where I currently work ... but forget to mention that I graduated from an institution that is ranked in the top 3 or 4 in the world (University College London). You neglect to mention that I have academic qualifications in multiple fields -- both physics and artificial intelligence/cognitive psychology. I now teach in both of those fields. And in addition to all of the above, you did not notice that I am (in addition to my teaching duties) an AI developer who works on his projects WITHOUT intending to publish that work all the time! My AI work is largely proprietary. What you see from the outside are the occasional spinoffs and side projects that get turned into published writings. Not to be too coy, but isn't that something you would expect from someone who is actually walking the walk....? :-) There are a number of comments from other people below about Ben Goertzel, some of them a little strange. I wrote a paper a couple of years ago that Ben suggested we get together to and publish... that is now a chapter in the book "Singularity Hypotheses". So clearly Ben Goertzel (who has a large, well-funded AGI lab) is not of the opinion that I am a crank. Could I get one point for that? Phil Goetz, who is an experienced veteran of the AGI field, has on this thread made a comment to the effect that he thinks that Ben Goertzel, himself, and myself are the three people Eliezer should be seriously listening to (since the three of us are among the few people who have been working on this problem for many years, and who have active AGI projects). So perhaps that is two points? Maybe? And, just out of curiosity,
I have no horse in this race, and I am not an ardent EY supporter, or even count myself as a "rationalist". In the area where I consider myself reasonably well trained, physics, he and I clashed a number of times on this forum. However, I am not an expert in the AI field, so I can only go by the outward signs of expertise. Ben Goertzel has them, Marcus Hutter has them, Eliezer has them. Richard Loosemore -- not so much. For all I know, you might be the genius who invents the AGI and sets it loose someday, but it's not obvious by looking online. And your histrionic comments and oversized ego make it appear rather unlikely.
I agree with pretty much all of the above. I didn't quit with Rob, btw. Ihave had a fairly productive -- albeit exhausting -- discussion with Rob over on his blog. I consider it to be productive because I have managed to narrow in on what he thinks is the central issue. And I think I have now (today's comment, which is probably the last of the discussion) managed to nail down my own argument in a way that withstands all the attacks against it. You are right that I have some serious debating weaknesses. I write too dense, and I assume that people have my width and breadth of experience, which is unfair (I got lucky in my career choices). Oh, and don't get me wrong: Eliezer never made me angry in this little episode. I laughed myself silly. Yeah, I protested. But I was wiping back tears of laughter while I did. "Known Permanent Idiot" is just a wondeful turn of phrase. Thanks, Eliezer!
Link to the nailed-down version of the argument?
2Rob Bensinger10y
Bottommost (September 9, 6:03 PM) comment here.
Oh, yeah, I found that myself eventually. Anyway, I went and read the the majority of that discussion (well, the parts between Richard and Rob). Here's my summary: Richard: [Rob responds] Richard: [Rob responds] Richard: [Rob responds] Richard: [Rob responds] Richard: [Rob responds] Richard: Rob: Richard: I snipped a lot of things there. I found lots of other points I wanted to emphasize, and plenty of things I wanted to argue against. But those aren't the point. -------------------------------------------------------------------------------- Richard, this next part is directed at you. You know what I didn't find? I didn't find any posts where you made a particular effort to address the core of Rob's argument. It was always about your argument. Rob was always the one missing the point. Sure, it took Rob long enough to focus on finding the core of your position, but he got there eventually. And what happened next? You declared that he was still missing the point, posted a condensed version of the same argument, and posted here that your position "withstands all the attacks against it." You didn't even wait for him to respond. You certainly didn't quote him and respond to the things he said. You gave no obvious indication that you were taking his arguments seriously. As far as I'm concerned, this is a cardinal sin. -------------------------------------------------------------------------------- How about this alternate hypothesis? Your explanations are fine. Rob understands what you're saying. He just doesn't agree. Perhaps you need to take a break from repeating yourself and make sure you understand Rob's argument. (P.S. Eliezer's ad hominem is still wrong. You may be making a mistake, but I'm confident you can fix it, the tone of this post notwithstanding.)
This entire debate is supposed to about my argument, as presented in the original article I published on the website ("The Fallacy of Dumb Superintelligence"). But in that case, what should I do when Rob insists on talking about something that I did not say in that article? My strategy was to explain his mistake, but not engage in a debate about his red herring. Sensible people of all stripes would consider that a mature response. But over and over again Rob avoided the actual argument and insisted on talking about his red herring. And then FINALLY I realized that I could write down my original claim in such a way that it is IMPOSSIBLE for Rob to misinterpret it. (That was easy, in retrospect: all I had to do was remove the language that he was using as the jumping-off point for his red herring). That final, succinct statement of my argument is sitting there at the end of his blog ..... so far ignored by you, and by him. Perhaps he will be able to respond, I don't know, but you say you have read it, so you have had a chance to actually understand why it is that he has been talking about something of no relevance to my original argument. But you, in your wisdom, chose to (a) completely ignore that statement of my argument, and (b) give me a patronizing rebuke for not being able to understand Rob's red herring argument.
Further to RobbBB's highly useful reply (you should read it), I would like to point out that the phenomenon of optimization processes having unexpected or unusual behaviour when placed in new situations or given new capabilities is quite general. It applies to pretty much any strong optimization process working toward any goal. For example, you program an AI/robot with no resources to fetch a banana as quickly as possible in some sandbox environment, where the banana is guarded by a few people. The AI will describe its strategy as talking to the guards and convincing them to give it the banana. Or perhaps buying the banana off them if you give it some money. But there are many ways to change the situation: 1. Make the AI much smarter—it will realise that by taking a certain sequence of steps it can get from here to the banana without ever moving within someone's field of view. No need to pay for it. (1.5—Make the AI even smarter, and it will realise that by painting a Basilisk on the robot's front it can disable all the guards effortlessly and take a straight line to the banana.) 2. Give the AI a teleportation device—it teleports straight to the banana. 3. Give the AI a plasma cannon—it shoots all the guards dead and runs up and takes the banana. The point is... it's not at all weird for AI behaviour to be "inconsistent". It isn't a sign of anything being broken, in fact the goal is being achieved. The AI is just able to think of more effective ways to do it then you are. That is, after all, the point of superintelligence. And an AI that does this is not broken or stupid, and is certainly capable of being dangerous. By the way, you can try to do something like this: But, to start with I have no idea how you would program this or what it means formally, but even if you could, it takes human judgement to identify "inconsistencies" that would matter to humans. Without embedding human values in there you'll have the AI shut down every tim
I didn't mean to ignore your argument; I just didn't get around to it. As I said, there were a lot of things I wanted to respond to. (In fact, this post was going to be longer, but I decided to focus on your primary argument.) Your story: My version: Your story: My version: In the rest of the scenario you described, I agree that the AI's behavior is pretty incoherent, if its goal is X. But if it's really aiming for Z, then its behavior is perfectly, terrifyingly coherent. And your "obvious" fail-safe isn't going to help. The AI is smarter than us. If it wants Z, and a fail-safe prevents it from getting Z, it will find a way around that fail-safe. I know, your premise is that X really is the AI's true goal. But that's my sticking point. Making it actually have the goal X, before it starts self-modifying, is far from easy. You can't just skip over that step and assume it as your premise.
What you say makes sense .... except that you and I are both bound by the terms of a scenario that someone else has set here. So, the terms (as I say, this is not my doing!) of reference are that an AI might sincerely believe that it is pursuing its original goal of making humans happy (whatever that means .... the ambiguity is in the original), but in the course of sincerely and genuinely pursuing that goal, it might get into a state where it believes that the best way to achieve the goal is to do something that we humans would consider to be NOT achieving the goal. What you did was consider some other possibilities, such as those in which the AI is actually not being sincere. Nothing wrong with considering those, but that would be a story for another day. Oh, and one other thing that arises from your above remark: remember that what you have called the "fail-safe" is not actually a fail-safe, it is an integral part of the original goal code (X). So there is no question of this being a situation where "... it wants Z, and a fail-safe prevents it from getting Z, [so] it will find a way around that fail-safe." In fact, the check is just part of X, so it WANTS to check as much as wants anything else involved in the goal. I am not sure that self-modification is part of the original terms of reference here, either. When Muehlhauser (for example) went on a radio show and explained to the audience that a superintelligence might be programmed to make humans happy, but then SINCERELY think it was making us happy when it put us on a Dopamine Drip, I think he was clearly not talking about a free-wheeling AI that can modify its goal code. Surely, if he wanted to imply that, the whole scenario goes out the window. The AI could have any motivation whatsoever. Hope that clarifies rather than obscures.
Ok, if you want to pass the buck, I won't stop you. But this other person's scenario still has a faulty premise. I'll take it up with them if you like; just point out where they state that the goal code starts out working correctly. To summarize my complaint, it's not very useful to discuss an AI with a "sincere" goal of X, because the difficulty comes from giving the AI that goal in the first place. As I see it, your (adopted) scenario is far less likely than other scenario(s), so in a sense that one is the "story for another day." Specifically, a day when we've solved the "sincere goal" issue.
That all depends on the approach... if you have some big human-inspired but more brainy neural network that learns to be a person, it can well just do the right thing by itself, and the risks are in any case quite comparable to that with having a human do it. If you are thinking of a "neat AI" with utility functions over world models and such, parts of said AI can maximize abstract metrics over mathematical models (including self improvement) without any "generally intelligent" process of eating you. So you would want to use those to build models of human meaning and intent. Furthermore with regards to AI following some goals, it seems to me that goal specifications would have to be intelligently processed in the first place so that they could be actually applied to the real world - we can't even define paperclips otherwise.
I tried arguing basically the same thing. The most coherent reply I got was that an AI doesn't follow verbal instructions and we can't just order the AI to "make humans happy", or even "make humans happy, in the way that I mean". You can only tell the AI to make humans happy by writing a program that makes it do so. It doesn't matter if the AI grasps what you really want it to do, if there is a mismatch between the program and what you really want it to do, it follows the program. Obviously I don't buy this. For one thing, you can always program it to obey verbal instructions, or you can talk to it and ask it how it will make people happy.
9Rob Bensinger10y
Jiro: Did you read my post? I discuss whether getting an AI to 'obey verbal instructions' is a trivial task in the first named section. I also link to section 2 of Yudkowsky's reply to Holden, which addresses the question of whether 'talk to it and ask it how it will make people happy' is generally a safe way to interact with an Unfriendly Oracle. I also specifically quote an argument you made in section 2 that I think reflects a common mistake in this whole family of misunderstandings of the problem — the conflation of the seed AI with the artificial superintelligence it produces. Do you agree this distinction helps clarify why the problem is one of coding the right values, and not of coding the right factual knowledge or intelligence-relevant capacities?

What if the AI's utility function is to find the right utility function, being guided along the way? Its goals could be such as learning to understand us, obey us, and predict what we might want/like/approve, moving its object-level goals to what would satisfy humanity? In other words, a probabilistic utility function with great amounts of uncertainty, and great amounts of apprehension to change, or stability.

Regardless of the above questions/statement, I think much of the complexity of human utility comes from complexities of belief.

If we offload complexi... (read more)

Coding your appreciation of 'right' is more difficult than you think. This is, essentially, what CEV is - an attempt at figuring out how an FAI can find the 'right' utility function. You're talking about normative uncertainty, which is a slightly different problem than epistemic uncertainty. The easiest way too do this would be to reduce the problem to an epistemic one (these are the characteristics of the correct utility function, now reason probabilistically about which of these candidate functions is it), but that still has the action problem - an agent takes actions based on it's utility function - if it has a weighting over all utility functions, it may act in undesirable ways, particularly if it doesn't quickly converge to a single solution. There are a few other problems I could see with that approach - the original specification of 'correctness' has to be almost Friendliness-Complete; it must be specific enough to pick out a single function (or perhaps many functions, all of which are what we want to want, without being compatible with any undesirable solutions). Also, a seed AI may not be able to follow the specification correctly, a superintelligence is going to have to have some well-specified goal along the lines of "increase your capability without doing anything bad, until you have the ability to solve this problem, and then adopt the solution as your utility function". You may noticed a familiar problem in the bolded part of that (English - remember we have to be able to code all of this) sentence.
I mean, instead of coding it, have it be uncertain about what is "right," and to guide itself using human claims. I'm thinking of the equivalent of something in EY's CFAI, but I've forgotten the terminology. In other words, a meta-utility function. Why can't it weight actions based on what we as a society want/like/approve/consent/condone? A behavioristic learner, with reward/punishment and an intention to preserve the semantic significance of the reward/punishment channel. When I said uncertainty, I was also implying inaction. I suppose inaction could be an undesirable way in which to act, but it's better to get it right slowly than to get it wrong very quickly. What I'm describing isn't really a utility function, it's more like a policy, or policy function. Its policy would be volatile, or at least, more volatile than the common understanding LW has of a set-in-stone utility function. If a utility function really needs to be pinpointed so exactly, surrounded by death and misery on all sides, why are we using a utility function to decide action? There are other approaches. Where did LW's/EY's concept of utility function come from, and why did they assume it was an essential part of AI?
Most obviously, it's very easy for a powerful AI to take unexpected control of the reward/punishment channel, and trivial for a superintelligent AGI to do so in Very Bad ways. You've tried to block the basic version of this -- an AGI pressing its own "society liked this" button -- with the phrase 'semantic significance', but that's not really a codable concept. If the AGI isn't allowed to press the button itself, it might build a machine that would do so. If it isn't allowed to do that, it might wirehead a human into doing so. If it isn't allowed /that/, it might put a human near a Paradise Machine and only let them into the box when the button had been pressed. If the AGI's reward is based on the number of favorable news reports, now you have an AGI that's rewarded for manipulating its own media coverage. So on, and so forth. The sort of semantic significance you're talking about is a pretty big part of Friendliness theory. The deeper problem is that the things our society wants aren't necessarily Friendly, especially when extrapolated. One of the secondary benefits of Friendliness research is that it requires the examination of our own interests. The 'set-in-stone' nature of a utility function is actually a desired benefit, albeit a difficult one to achieve ("Lob's Problem" and the more general issue of value drift). A machine with undirected volatility in its utility function will take random variations in its choices, and there are orders of magnitude more wrong random answers than correct ones on this matter. If you can direct the drift, that's less of an issue, but then you could just make /that/ direction the utility function. The basic idea of goal maximization is a fairly common thing when working with evolutionary algorithms (see XKCD for a joking example), because it's such a useful model. While there are other types of possible minds, maximizers of /some/ kind with unbounded or weakly bounded potential are the most relevant to MIRI's concerns becaus
Human society would not do a good job being directly in charge of a naive omnipotent genie. Insert your own nightmare scenario examples here, there are plenty to choose from. What would be in charge of changing the policy?
But that doesn't describe humanity being directly in charge. It only describes a small bit of influence for each person, and while groups would have leverage, that doesn't mean a majority rejecting, say, homosexuality, gets to say what LGB people can and can't do/be. The metautility function I described. What is a society's intent? What should a society's goals be, and how should it relate to the goals of its constituents?
I think it means precisely that if the majority feels strongly enough about it. For a quick example s/homosexuality/pedophilia/
Good point. I think I was reluctant to use pedophilia as an example because I'm trying to defend this argument, and claiming it could allow pedophilia is not usually convincing. RAT - 1 for me. I'll concede that point. But my questions aren't rhetorical, I think. There is no objective morality, and EY seems to be trying to get around that. Concessions must be made. I'm thinking that the closest thing we could have to CEV is a social contract based on Rawls' veil of ignorance, adjusted with live runoff of supply/demand (i.e. the less people want slavery, the more likely that someone who wants slavery would become a slave, so prospective slaveowners would be less likely to approve of slavery on the grounds that they themselves do not want to be slaves. Meanwhile, people who want to become slaves get what they want as well. By no means is this a rigorous definition or claim.), in a post-scarcity economy, with sharding of some sort (as in CelestAI sharding, where parts of society that contribute negative utility to an individual are effectively invisible to said individual. There was an argument on LW that CEV would be impossible without some elements of separation similar to this).
The less people want aristocracy, the more likely that someone who wants aristocracy would become a noble, so prospective nobles would be more like to approve of aristocracy on the grounds that they themselves want to be nobles?
The less people want aristocracy, the more likely that someone who wants aristocracy would become a peon, so prospective nobles would be less likely to approve of aristocracy on the grounds that they themselves want to be peons. I have to work this out. You have a good point.
(I am in the midst of reading the EY-RH "FOOM" debate, so some of the following may be less informed than would be ideal.) From a purely technical standpoint, one problem is that if you permit self-modification, and give the baby AI enough insight into its own structure to make self-modification remotely a useful thing to do (as opposed to making baby repeatedly crash, burn, and restore from backup), then you cannot guarantee that utility() won't be modified in arbitrary ways. Even if you store the actual code implementing utility() in ROM, baby could self-modify to replace all references to that fixed function with references to a different (modifiable) one. What you need is for utility() to be some kind of fixed point in utility-function space under whatever modification regime is permitted, or... something. This problem seems nigh-insoluble to me, at the moment. Even if you solve the theoretical problem of preserving those aspects of utility() that ensure Friendliness, a cosmic-ray hit might change a specific bit of memory and turn baby into a monster. (Though I suppose you could arrange, mathematically, for that particular possibility to be astronomically unlikely.)
I think the important insight you may be missing is that the AI, if intelligent enough to recursively self-improve, can predict what the modifications it makes will do (and if it can't, then it doesn't make that modification because creating an unpredictable child AI would be a bad move according to almost any utility function, even that of a paperclipper). And it evaluates the suitability of these modifications using its utility function. So assuming the seed AI is build with a sufficiently solid understanding of self-modification and what its own code is doing, it will more or less automatically work to create more powerful AIs whose actions will also be expected to fulfill the original utility function, no "fixed points" required. There is a hypothetical danger region where an AI has sufficient intelligence to create a more powerful child AI, isn't clever enough to predict the actions of AIs with modified utility functions, and isn't self-aware enough to realize this and compensate by, say, not modifying the utility function itself. Obviously the space of possible minds is sufficiently large that there exist minds with this problem, but it probably doesn't even make it into the top 10 most likely AI failure modes at the moment.
I'm not so sure about that particular claim for volatile utility. I thought intelligence-utility orthogonality would mean that improvements from seed AI would not EDIT: endanger its utility function.
...What? I think you mean, need not be in danger, which tells us almost nothing about the probability.
Sorry, it was a typo. I edited it to reflect my probable meaning.

A. Solve the Problem of Meaning-in-General in advance, and program it to follow our instructions' real meaning. Then just instruct it 'Satisfy my preferences', and wait for it to become smart enough to figure out my preferences.

That problem has got to be solved somehow at some stage, because something that couldn't pass a Turing Test is no AGI.

But there are a host of problems with treating the mere revelation that A is an option as a solution to the Friendliness problem.

  1. You have to actually code the seed AI to understand what we mean. Y

Why is tha... (read more)

No, it doesn't.
3Eliezer Yudkowsky10y
Juno_Watt, please take further discussion to RobbBB's blog.
1Rob Bensinger10y
Not so! An AGI need not think like a human, need not know much of anything about humans, and need not, for that matter, be as intelligent as a human. To see this, imagine we encountered an alien race of roughly human-level intelligence. Would a human be able to pass as an alien, or an alien as a human? Probably not anytime soon. Possibly not ever. (Also, passing a Turing Test does not require you to possess a particularly deep understanding of human morality! A simple list of some random things humans consider right or wrong would generally suffice.) The problem I'm pointing to here is that a lot of people treat 'what I mean' as a magical category. 'Meaning' and 'language' and 'semantics' are single words in English, which masks the complexity of 'just tell the AI to do what I mean'. Nope! It could certainly be an AGI! It couldn't be an SI -- provided it wants to pass a Turing Test, of course -- but that's not a problem we have to solve. It's one the SI can solve for itself. No human being has ever created anything -- no system of laws, no government or organization, no human, no artifact -- that, if it were more powerful, would qualify as Friendly. In that sense, everything that currently exists in the universe is non-Friendly, if not outright Unfriendly. All or nearly all humans, if they were more powerful, would qualify as Unfriendly. Moreover, by default, relying on a miscellaneous heap of vaguely moral-sounding machine learning criteria will lead to the end of life on earth. 'Smiles' and 'statements of approval' are not adequate roadmarks, because those are stimuli the SI can seize control of in unhumanistic ways to pump its reward buttons. No, it isn't. And this is a non sequitur. Nothing else in your post calls orthogonality into question.
2Eliezer Yudkowsky10y
Please take further discussion with Juno_Watt to your blog.
Is that a fact? No, it's a matter of definition. It's scarecely credible you are unaware that a lot of people think the TT is critical to AGI. I can't see any evidence of anyone invlolved in these discussions doing that. It looks like a straw man to me. An AI you can't talk to has pretty limited usefulness, and it has pretty limited safety too, since you don;t even have the option of telling it to stop, or expaling to it why you don;t like what it is doing. Oh, and isn't EY assumign that an AGi will have NLP? After all, it is supposed to be able to talk its way out of the box. It can figure out semantics for itslef. Values are a subsert of semantics... Wherer do you get this stuff from? Modern societies, with their complex legal and security systems are much less violent than ancient socieites. To take ut one example. Gee. Then I guess they don't have an architecutre with a basic drive to be friendly. Why don't humans do that? Uh-huh. MIRI has settled that centuries-aold quesiton for once and all has it? It can't be a non-sequitur, since it is not an arguemnt but a statement of fact. So? It wasn't relevant anywhere else.
0Rob Bensinger10y
Let's run with that idea. There's 'general-intelligence-1', which means "domain-general intelligence at a level comparable to that of a human"; and there's 'general-intelligence-2', which means (I take it) "domain-general intelligence at a level comparable to that of a human, plus the ability to solve the Turing Test". On the face of it, GI2 looks like a much more ad-hoc and heterogeneous definition. To use GI2 is to assert, by fiat, that most intelligences (e.g., most intelligent alien races) of roughly human-level intellectual ability (including ones a bit smarter than humans) are not general intelligences, because they aren't optimized for disguising themselves as one particular species from a Milky Way planet called Earth. If your definition has nothing to recommend itself, then more useful definitions are on offer. * * * * * * 'Mean', 'right', 'rational', etc. An AI doesn't need to be able to trick you in order for you to be able to give it instructions. All sorts of useful skills AIs have these days don't require them to persuade everyone that they're human. Read the article you're commenting on. One of its two main theses is, in bold: The seed is not the superintelligence. Yes. We should focusing on solving the values part of semantics, rather than the entire superset. Doesn't matter. Give an ancient or a modern society arbitrarily large amounts of power overnight, and the end results won't differ in any humanly important way. There won't be any nights after that. Setting aside the power issue: Because humans don't us
If want to be sure that these terms, as used by a particular person, are magical categories, you need to ask the particular person whether they have a mechanical interpretation in mind -- address the argument, not the person. Whether any particular person has a mechanical interpretation of these concepts in mind cannot be shown by a completely general argument like Ghosts in the Machine . You don't think that your use of ‘Mean’, ‘right’, ‘rational’, etc is necessarily magical! But whether someone has a non-magical explanation can easily be shown by asking them. In particular, it is highly reasonable to assume that an actual AI researcher would have such an interpretation. It is not reasonable to interpret sheer absence of evidence -- especially a wilful absence of evidence, based on refusal to engage -- as evidence of magical thinking. At the time of writing the MIRI/LW side of this debate is known to be wrong... and that is not despite good, rational epistemology, it is *because of *bad, dogmatic, Ad Hominen, debate. There are multiple occasions where EY instructs his followers not to even engage with the side that turned out to be correct.,

Discussion of this article has now moved to RobbBB's own personal blog at

I will conduct any discussion over there, with interested parties.

Since this comment is likely to be downgraded because of the LW system (which is set up to automatically downgrade anything I write here, to make it as invisible as possible), perhaps someone would take the trouble to mirror this comment where it can be seen. Thank you.

I want to upvote this for the link to further discussion, but I also want to downvote it for the passive-aggressive jab at LW users.

No vote.

I'm afraid the 'troll prodrome' behaviour you observe more than cancels out the usefulness of the link (and, for that matter, it prevented me from even considering the link as having positive expected value.)

New to LessWrong?