I begin by thanking Holden Karnofsky of Givewell for his rare gift of his detailed, engaged, and helpfully-meant critical article Thoughts on the Singularity Institute (SI). In this reply I will engage with only one of the many subjects raised therein, the topic of, as I would term them, non-self-modifying planning Oracles, a.k.a. 'Google Maps AGI' a.k.a. 'tool AI', this being the topic that requires me personally to answer.  I hope that my reply will be accepted as addressing the most important central points, though I did not have time to explore every avenue.  I certainly do not wish to be logically rude, and if I have failed, please remember with compassion that it's not always obvious to one person what another person will think was the central point.

Luke Mueulhauser and Carl Shulman contributed to this article, but the final edit was my own, likewise any flaws.

Summary:

Holden's concern is that "SI appears to neglect the potentially important distinction between 'tool' and 'agent' AI." His archetypal example is Google Maps:

Google Maps is not an agent, taking actions in order to maximize a utility parameter. It is a tool, generating information and then displaying it in a user-friendly manner for me to consider, use and export or discard as I wish.

The reply breaks down into four heavily interrelated points:

First, Holden seems to think (and Jaan Tallinn doesn't apparently object to, in their exchange) that if a non-self-modifying planning Oracle is indeed the best strategy, then all of SIAI's past and intended future work is wasted.  To me it looks like there's a huge amount of overlap in underlying processes in the AI that would have to be built and the insights required to build it, and I would be trying to assemble mostly - though not quite exactly - the same kind of team if I was trying to build a non-self-modifying planning Oracle, with the same initial mix of talents and skills.

Second, a non-self-modifying planning Oracle doesn't sound nearly as safe once you stop saying human-English phrases like "describe the consequences of an action to the user" and start trying to come up with math that says scary dangerous things like (he translated into English) "increase the correspondence between the user's belief about relevant consequences and reality".  Hence why the people on the team would have to solve the same sorts of problems.

Appreciating the force of the third point is a lot easier if one appreciates the difficulties discussed in points 1 and 2, but is actually empirically verifiable independently:  Whether or not a non-self-modifying planning Oracle is the best solution in the end, it's not such an obvious privileged-point-in-solution-space that someone should be alarmed at SIAI not discussing it.  This is empirically verifiable in the sense that 'tool AI' wasn't the obvious solution to e.g. John McCarthy, Marvin Minsky, I. J. Good, Peter Norvig, Vernor Vinge, or for that matter Isaac Asimov.  At one point, Holden says:

One of the things that bothers me most about SI is that there is practically no public content, as far as I can tell, explicitly addressing the idea of a "tool" and giving arguments for why AGI is likely to work only as an "agent."

If I take literally that this is one of the things that bothers Holden most... I think I'd start stacking up some of the literature on the number of different things that just respectable academics have suggested as the obvious solution to what-to-do-about-AI - none of which would be about non-self-modifying smarter-than-human planning Oracles - and beg him to have some compassion on us for what we haven't addressed yet.  It might be the right suggestion, but it's not so obviously right that our failure to prioritize discussing it reflects negligence.

The final point at the end is looking over all the preceding discussion and realizing that, yes, you want to have people specializing in Friendly AI who know this stuff, but as all that preceding discussion is actually the following discussion at this point, I shall reserve it for later.

1.  The math of optimization, and the similar parts of a planning Oracle.

What does it take to build a smarter-than-human intelligence, of whatever sort, and have it go well?

A "Friendly AI programmer" is somebody who specializes in seeing the correspondence of mathematical structures to What Happens in the Real World. It's somebody who looks at Hutter's specification of AIXI and reads the actual equations - actually stares at the Greek symbols and not just the accompanying English text - and sees, "Oh, this AI will try to gain control of its reward channel," as well as numerous subtler issues like, "This AI presumes a Cartesian boundary separating itself from the environment; it may drop an anvil on its own head." Similarly, working on TDT means e.g. looking at a mathematical specification of decision theory, and seeing "Oh, this is vulnerable to blackmail" and coming up with a mathematical counter-specification of an AI that isn't so vulnerable to blackmail.

Holden's post seems to imply that if you're building a non-self-modifying planning Oracle (aka 'tool AI') rather than an acting-in-the-world agent, you don't need a Friendly AI programmer because FAI programmers only work on agents. But this isn't how the engineering skills are split up. Inside the AI, whether an agent AI or a planning Oracle, there would be similar AGI-challenges like "build a predictive model of the world", and similar FAI-conjugates of those challenges like finding the 'user' inside an AI-created model of the universe.  The insides would look a lot more similar than the outsides.  An analogy would be supposing that a machine learning professional who does sales optimization for an orange company couldn't possibly do sales optimization for a banana company, because their skills must be about oranges rather than bananas.

Admittedly, if it turns out to be possible to use a human understanding of cognitive algorithms to build and run a smarter-than-human Oracle without it being self-improving - this seems unlikely, but not impossible - then you wouldn't have to solve problems that arise with self-modification.  But this eliminates only one dimension of the work.  And on an even more meta level, it seems like you would call upon almost identical talents and skills to come up with whatever insights were required - though if it were predictable in advance that we'd abjure self-modification, then, yes, we'd place less emphasis on e.g. finding a team member with past experience in reflective math, and wouldn't waste (additional) time specializing in reflection.  But if you wanted math inside the planning Oracle that operated the way you thought it did, and you wanted somebody who understood what could possibly go wrong and how to avoid it, you would need to make a function call to the same sort of talents and skills to build an agent AI, or an Oracle that was self-modifying, etc.

2.  Yes, planning Oracles have hidden gotchas too.

"Tool AI" may sound simple in English, a short sentence in the language of empathically-modeled agents — it's just "a thingy that shows you plans instead of a thingy that goes and does things." If you want to know whether this hypothetical entity does X, you just check whether the outcome of X sounds like "showing someone a plan" or "going and doing things", and you've got your answer.  It starts sounding much scarier once you try to say something more formal and internally-causal like "Model the user and the universe, predict the degree of correspondence between the user's model and the universe, and select from among possible explanation-actions on this basis."

Holden, in his dialogue with Jaan Tallinn, writes out this attempt at formalizing:

Here's how I picture the Google Maps AGI ...

utility_function = construct_utility_function(process_user_input());

foreach $action in $all_possible_actions {

$action_outcome = prediction_function($action,$data);

$utility = utility_function($action_outcome);

if ($utility > $leading_utility) { $leading_utility = $utility;

$leading_action = $action; }

}

report($leading_action);

construct_utility_function(process_user_input()) is just a human-quality function for understanding what the speaker wants. prediction_function is an implementation of a human-quality data->prediction function in superior hardware. $data is fixed (it's a dataset larger than any human can process); same with $all_possible_actions. report($leading_action) calls a Google Maps-like interface for understanding the consequences of $leading_action; it basically breaks the action into component parts and displays predictions for different times and conditional on different parameters.

Google Maps doesn't check all possible routes. If I wanted to design Google Maps, I would start out by throwing out a standard planning technique on a connected graph where each edge has a cost function and there's a good heuristic measure of the distance, e.g. A* search. If that was too slow, I'd next try some more efficient version like weighted A* (or bidirectional weighted memory-bounded A*, which I expect I could also get off-the-shelf somewhere). Once you introduce weighted A*, you no longer have a guarantee that you're selecting the optimal path.  You have a guarantee to within a known factor of the cost of the optimal path — but the actual path selected wouldn't be quite optimal. The suggestion produced would be an approximation whose exact steps depended on the exact algorithm you used. That's true even if you can predict the exact cost — exact utility — of any particular path you actually look at; and even if you have a heuristic that never overestimates the cost.

The reason we don't have God's Algorithm for solving the Rubik's Cube is that there's no perfect way of measuring the distance between any two Rubik's Cube positions — you can't look at two Rubik's cube positions, and figure out the minimum number of moves required to get from one to another. It took 15 years to prove that there was a position requiring at least 20 moves to solve, and then another 15 years to come up with a computer algorithm that could solve any position in at most 20 moves, but we still can't compute the actual, minimum solution to all Cubes ("God's Algorithm"). This, even though we can exactly calculate the cost and consequence of any actual Rubik's-solution-path we consider.

When it comes to AGI — solving general cross-domain "Figure out how to do X" problems — you're not going to get anywhere near the one, true, optimal answer. You're going to — at best, if everything works right — get a good answer that's a cross-product of the "utility function" and all the other algorithmic properties that determine what sort of answer the AI finds easy to invent (i.e. can be invented using bounded computing time).

As for the notion that this AGI runs on a "human predictive algorithm" that we got off of neuroscience and then implemented using more computing power, without knowing how it works or being able to enhance it further: It took 30 years of multiple computer scientists doing basic math research, and inventing code, and running that code on a computer cluster, for them to come up with a 20-move solution to the Rubik's Cube. If a planning Oracle is going to produce better solutions than humanity has yet managed to the Rubik's Cube, it needs to be capable of doing original computer science research and writing its own code. You can't get a 20-move solution out of a human brain, using the native human planning algorithm. Humanity can do it, but only by exploiting the ability of humans to explicitly comprehend the deep structure of the domain (not just rely on intuition) and then inventing an artifact, a new design, running code which uses a different and superior cognitive algorithm, to solve that Rubik's Cube in 20 moves. We do all that without being self-modifying, but it's still a capability to respect.

And I'm not even going into what it would take for a planning Oracle to out-strategize any human, come up with a plan for persuading someone, solve original scientific problems by looking over experimental data (like Einstein did), design a nanomachine, and so on.

Talking like there's this one simple "predictive algorithm" that we can read out of the brain using neuroscience and overpower to produce better plans... doesn't seem quite congruous with what humanity actually does to produce its predictions and plans.

If we take the concept of the Google Maps AGI at face value, then it actually has four key magical components.  (In this case, "magical" isn't to be taken as prejudicial, it's a term of art that means we haven't said how the component works yet.)  There's a magical comprehension of the user's utility function, a magical world-model that GMAGI uses to comprehend the consequences of actions, a magical planning element that selects a non-optimal path using some method other than exploring all possible actions, and a magical explain-to-the-user function.

report($leading_action) isn't exactly a trivial step either. Deep Blue tells you to move your pawn or you'll lose the game. You ask "Why?" and the answer is a gigantic search tree of billions of possible move-sequences, leafing at positions which are heuristically rated using a static-position evaluation algorithm trained on millions of games. Or the planning Oracle tells you that a certain DNA sequence will produce a protein that cures cancer, you ask "Why?", and then humans aren't even capable of verifying, for themselves, the assertion that the peptide sequence will fold into the protein the planning Oracle says it does.

"So," you say, after the first dozen times you ask the Oracle a question and it returns an answer that you'd have to take on faith, "we'll just specify in the utility function that the plan should be understandable."

Whereupon other things start going wrong. Viliam_Bur, in the comments thread, gave this example, which I've slightly simplified:

Example question: "How should I get rid of my disease most cheaply?" Example answer: "You won't. You will die soon, unavoidably. This report is 99.999% reliable". Predicted human reaction: Decides to kill self and get it over with. Success rate: 100%, the disease is gone. Costs of cure: zero. Mission completed.

Bur is trying to give an example of how things might go wrong if the preference function is over the accuracy of the predictions explained to the human— rather than just the human's 'goodness' of the outcome. And if the preference function was just over the human's 'goodness' of the end result, rather than the accuracy of the human's understanding of the predictions, the AI might tell you something that was predictively false but whose implementation would lead you to what the AI defines as a 'good' outcome. And if we ask how happy the human is, the resulting decision procedure would exert optimization pressure to convince the human to take drugs, and so on.

I'm not saying any particular failure is 100% certain to occur; rather I'm trying to explain - as handicapped by the need to describe the AI in the native human agent-description language, using empathy to simulate a spirit-in-a-box instead of trying to think in mathematical structures like A* search or Bayesian updating - how, even so, one can still see that the issue is a tad more fraught than it sounds on an immediate examination.

If you see the world just in terms of math, it's even worse; you've got some program with inputs from a USB cable connecting to a webcam, output to a computer monitor, and optimization criteria expressed over some combination of the monitor, the humans looking at the monitor, and the rest of the world. It's a whole lot easier to call what's inside a 'planning Oracle' or some other English phrase than to write a program that does the optimization safely without serious unintended consequences. Show me any attempted specification, and I'll point to the vague parts and ask for clarification in more formal and mathematical terms, and as soon as the design is clarified enough to be a hundred light years from implementation instead of a thousand light years, I'll show a neutral judge how that math would go wrong. (Experience shows that if you try to explain to would-be AGI designers how their design goes wrong, in most cases they just say "Oh, but of course that's not what I meant." Marcus Hutter is a rare exception who specified his AGI in such unambiguous mathematical terms that he actually succeeded at realizing, after some discussion with SIAI personnel, that AIXI would kill off its users and seize control of its reward button. But based on past sad experience with many other would-be designers, I say "Explain to a neutral judge how the math kills" and not "Explain to the person who invented that math and likes it.")

Just as the gigantic gap between smart-sounding English instructions and actually smart algorithms is the main source of difficulty in AI, there's a gap between benevolent-sounding English and actually benevolent algorithms which is the source of difficulty in FAI.  "Just make suggestions - don't do anything!" is, in the end, just more English.

3.  Why we haven't already discussed Holden's suggestion

One of the things that bothers me most about SI is that there is practically no public content, as far as I can tell, explicitly addressing the idea of a "tool" and giving arguments for why AGI is likely to work only as an "agent."

The above statement seems to lack perspective on how many different things various people see as the one obvious solution to Friendly AI. Tool AI wasn't the obvious solution to John McCarthy, I.J. Good, or Marvin Minsky. Today's leading AI textbook, Artificial Intelligence: A Modern Approach - where you can learn all about A* search, by the way - discusses Friendly AI and AI risk for 3.5 pages but doesn't mention tool AI as an obvious solution. For Ray Kurzweil, the obvious solution is merging humans and AIs. For Jurgen Schmidhuber, the obvious solution is AIs that value a certain complicated definition of complexity in their sensory inputs. Ben Goertzel, J. Storrs Hall, and Bill Hibbard, among others, have all written about how silly Singinst is to pursue Friendly AI when the solution is obviously X, for various different X. Among current leading people working on serious AGI programs labeled as such, neither Demis Hassabis (VC-funded to the tune of several million dollars) nor Moshe Looks (head of AGI research at Google) nor Henry Markram (Blue Brain at IBM) think that the obvious answer is Tool AI. Vernor Vinge, Isaac Asimov, and any number of other SF writers with technical backgrounds who spent serious time thinking about these issues didn't converge on that solution.

Obviously I'm not saying that nobody should be allowed to propose solutions because someone else would propose a different solution. I have been known to advocate for particular developmental pathways for Friendly AI myself. But I haven't, for example, told Peter Norvig that deterministic self-modification is such an obvious solution to Friendly AI that I would mistrust his whole AI textbook if he didn't spend time discussing it.

At one point in his conversation with Tallinn, Holden argues that AI will inevitably be developed along planning-Oracle lines, because making suggestions to humans is the natural course that most software takes. Searching for counterexamples instead of positive examples makes it clear that most lines of code don't do this.  Your computer, when it reallocates RAM, doesn't pop up a button asking you if it's okay to reallocate RAM in such-and-such a fashion. Your car doesn't pop up a suggestion when it wants to change the fuel mix or apply dynamic stability control. Factory robots don't operate as human-worn bracelets whose blinking lights suggest motion. High-frequency trading programs execute stock orders on a microsecond timescale. Software that does happen to interface with humans is selectively visible and salient to humans, especially the tiny part of the software that does the interfacing; but this is a special case of a general cost/benefit tradeoff which, more often than not, turns out to swing the other way, because human advice is either too costly or doesn't provide enough benefit. Modern AI programmers are generally more interested in e.g. pushing the technological envelope to allow self-driving cars than to "just" do Google Maps. Branches of AI that invoke human aid, like hybrid chess-playing algorithms designed to incorporate human advice, are a field of study; but they're the exception rather than the rule, and occur primarily where AIs can't yet do something humans do, e.g. humans acting as oracles for theorem-provers, where the humans suggest a route to a proof and the AI actually follows that route. This is another reason why planning Oracles were not a uniquely obvious solution to the various academic AI researchers, would-be AI-creators, SF writers, etcetera, listed above. Again, regardless of whether a planning Oracle is actually the best solution, Holden seems to be empirically-demonstrably overestimating the degree to which other people will automatically have his preferred solution come up first in their search ordering.

4.  Why we should have full-time Friendly AI specialists just like we have trained professionals doing anything else mathy that somebody actually cares about getting right, like pricing interest-rate options or something

I hope that the preceding discussion has made, by example instead of mere argument, what's probably the most important point: If you want to have a sensible discussion about which AI designs are safer, there are specialized skills you can apply to that discussion, as built up over years of study and practice by someone who specializes in answering that sort of question.

This isn't meant as an argument from authority. It's not meant as an attempt to say that only experts should be allowed to contribute to the conversation. But it is meant to say that there is (and ought to be) room in the world for Friendly AI specialists, just like there's room in the world for specialists on optimal philanthropy (e.g. Holden).

The decision to build a non-self-modifying planning Oracle would be properly made by someone who: understood the risk gradient for self-modifying vs. non-self-modifying programs; understood the risk gradient for having the AI thinking about the thought processes of the human watcher and trying to come up with plans implementable by the human watcher in the service of locally absorbed utility functions, vs. trying to implement its own plans in the service of more globally descriptive utility functions; and who, above all, understood on a technical level what exactly gets accomplished by having the plans routed through a human. I've given substantial previous thought to describing more precisely what happens — what is being gained, and how much is being gained — when a human "approves a suggestion" made by an AI. But that would be another a different topic, plus I haven't made too much progress on saying it precisely anyway.

In the transcript of Holden's conversation with Jaan Tallinn, it looked like Tallinn didn't deny the assertion that Friendly AI skills would be inapplicable if we're building a Google Maps AGI. I would deny that assertion and emphasize that denial, because to me it seems that it is exactly Friendly AI programmers who would be able to tell you if the risk gradient for non-self-modification vs. self-modification, the risk gradient for routing plans through humans vs. acting as an agent, the risk gradient for requiring human approval vs. unapproved action, and the actual feasibility of directly constructing transhuman modeling-prediction-and-planning algorithms through directly design of sheerly better computations than are presently run by the human brain, had the right combination of properties to imply that you ought to go construct a non-self-modifying planning Oracle. Similarly if you wanted an AI that took a limited set of actions in the world with human approval, or if you wanted an AI that "just answered questions instead of making plans".

It is similarly implied that a "philosophical AI" might obsolete Friendly AI programmers. If we're talking about PAI that can start with a human's terrible decision theory and come up with a good decision theory, or PAI that can start from a human talking about bad metaethics and then construct a good metaethics... I don't want to say "impossible", because, after all, that's just what human philosophers do. But we are not talking about a trivial invention here. Constructing a "philosophical AI" is a Holy Grail precisely because it's FAI-complete (just ask it "What AI should we build?"), and has been discussed (e.g. with and by Wei Dai) over the years on the old SL4 mailing list and the modern Less Wrong. But it's really not at all clear how you could write an algorithm which would knowably produce the correct answer to the entire puzzle of anthropic reasoning, without being in possession of that correct answer yourself (in the same way that we can have Deep Blue win chess games without knowing the exact moves, but understanding exactly what abstract work Deep Blue is doing to solve the problem).

Holden's post presents a restrictive view of what "Friendly AI" people are supposed to learn and know — that it's about machine learning for optimizing orange sales but not apple sales, or about producing an "agent" that implements CEV — which is something of a straw view, much weaker than the view that a Friendly AI programmer takes of Friendly AI programming. What the human species needs from an x-risk perspective is experts on This Whole Damn Problem, who will acquire whatever skills are needed to that end. The Singularity Institute exists to host such people and enable their research—once we have enough funding to find and recruit them.  See also, How to Purchase AI Risk Reduction.

I'm pretty sure Holden has met people who think that having a whole institute to rate the efficiency of charities is pointless overhead, especially people who think that their own charity-solution is too obviously good to have to contend with busybodies pretending to specialize in thinking about 'marginal utility'.  Which Holden knows about, I would guess, from being paid quite well to think about that economic details when he was a hedge fundie, and learning from books written by professional researchers before then; and the really key point is that people who haven't studied all that stuff don't even realize what they're missing by trying to wing it.  If you don't know, you don't know what you don't know, or the cost of not knowing.  Is there a problem of figuring out who might know something you don't, if Holden insists that there's this strange new stuff called 'marginal utility' you ought to learn about?  Yes, there is.  But is someone who trusts their philanthropic dollars to be steered just by the warm fuzzies of their heart, doing something wrong?  Yes, they are.  It's one thing to say that SIAI isn't known-to-you to be doing it right - another thing still to say that SIAI is known-to-you to be doing it wrong - and then quite another thing entirely to say that there's no need for Friendly AI programmers and you know it, that anyone can see it without resorting to math or cracking a copy of AI: A Modern Approach.  I do wish that Holden would at least credit that the task SIAI is taking on contains at least as many gotchas, relative to the instinctive approach, as optimal philanthropy compared to instinctive philanthropy, and might likewise benefit from some full-time professionally specialized attention, just as our society creates trained professionals to handle any other problem that someone actually cares about getting right.

On the other side of things, Holden says that even if Friendly AI is proven and checked:

"I believe that the probability of an unfavorable outcome - by which I mean an outcome essentially equivalent to what a UFAI would bring about - exceeds 90% in such a scenario."

It's nice that this appreciates that the problem is hard.  Associating all of the difficulty with agenty proposals and thinking that it goes away as soon as you invoke tooliness is, well, of this I've already spoken. I'm not sure whether this irreducible-90%-doom assessment is based on a common straw version of FAI where all the work of the FAI programmer goes into "proving" something and doing this carefully checked proof which then - alas, poor Spock! - turns out to be no more relevant than proving that the underlying CPU does floating-point arithmetic correctly if the transistors work as stated. I've repeatedly said that the idea behind proving determinism of self-modification isn't that this guarantees safety, but that if you prove the self-modification stable the AI might work, whereas if you try to get by with no proofs at all, doom is guaranteed. My mind keeps turning up Ben Goertzel as the one who invented this caricature - "Don't you understand, poor fool Eliezer, life is full of uncertainty, your attempt to flee from it by refuge in 'mathematical proof' is doomed" - but I'm not sure he was actually the inventor. In any case, the burden of safety isn't carried just by the proof, it's carried mostly by proving the right thing. If Holden is assuming that we're just running away from the inherent uncertainty of life by taking refuge in mathematical proof, then, yes, 90% probability of doom is an understatement, the vast majority of plausible-on-first-glance goal criteria you can prove stable will also kill you.

If Holden's assessment does take into account a great effort to select the right theorem to prove - and attempts to incorporate the difficult but finitely difficult feature of meta-level error-detection, as it appears in e.g. the CEV proposal - and he is still assessing 90% doom probability, then I must ask, "What do you think you know and how do you think you know it?" The complexity of the human mind is finite; there's only so many things we want or would-want. Why would someone claim to know that proving the right thing is beyond human ability, even if "100 of the world's most intelligent and relevantly experienced people" (Holden's terms) check it over? There's hidden complexity of wishes, but not infinite complexity of wishes or unlearnable complexity of wishes. There are deep and subtle gotchas but not an unending number of them. And if that were the setting of the hidden variables - how would you end up knowing that with 90% probability in advance? I don't mean to wield my own ignorance as a sword or engage in motivated uncertainty - I hate it when people argue that if they don't know something, nobody else is allowed to know either - so please note that I'm also counterarguing from positive facts pointing the other way: the human brain is complicated but not infinitely complicated, there are hundreds or thousands of cytoarchitecturally distinct brain areas but not trillions or googols.  If humanity had two hundred years to solve FAI using human-level intelligence and there was no penalty for guessing wrong I would be pretty relaxed about the outcome.  If Holden says there's 90% doom probability left over no matter what sane intelligent people do (all of which goes away if you just build Google Maps AGI, but leave that aside for now) I would ask him what he knows now, in advance, that all those sane intelligent people will miss.  I don't see how you could (well-justifiedly) access that epistemic state.

I acknowledge that there are points in Holden's post which are not addressed in this reply, acknowledge that these points are also deserving of reply, and hope that other SIAI personnel will be able to reply to them.

Reply to Holden on 'Tool AI'
New Comment
356 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

My summary (now with endorsement by Eliezer!):

  • SI can be a valuable organization even if Tool AI turns out to be the right approach:
    • Skills/organizational capabilities for safe Tool AI are similar to those for Friendly AI.
    • EY seems to imply that much of SI's existing body of work can be reused.
    • Offhand remark that seemed important: Superintelligent Tool AI would be more difficult since it would have to be developed in way that it would not recursively self-improve.
  • Tool AI is nontrivial:
    • The number of possible plans is way too large for an AI to realistically evaluate all them. Heuristics will have to be used to find suboptimal but promising plans.
    • The reasoning behind the plan the AI chooses might be way beyond the comprehension of the user. It's not clear how best to deal with this, given that the AI is only approximating the user's wishes and can't really be trusted to choose plans without supervision.
    • Constructing a halfway decent approximation of the user's utility function and having a model good enough to make plans with are also far from solved problems.
    • Potential Tool AI gotcha: The AI might give you a self-fulfilling negative prophecy that the AI didn't realize would h
... (read more)

This point seems missing:

You can't get a 20-move solution out of a human brain, using the native human planning algorithm. Humanity can do it, but only by exploiting the ability of humans to explicitly comprehend the deep structure of the domain (not just rely on intuition) and then inventing an artifact, a new design, running code which uses a different and superior cognitive algorithm, to solve that Rubik's Cube in 20 moves. We do all that without being self-modifying, but it's still a capability to respect.

A system that undertakes extended processes of research and thinking, generating new ideas and writing new programs for internal experiments, seems both much more effective and much more potentially risky than something like chess program with a simple fixed algorithm to search using a fixed narrow representation of the world (as a chess board).

Looks pretty good, actually. Nice.

-9SilasBarta

The difficulty of Friendliness is finite. The difficulties are big and subtle, but not unending.

How do we know that the problem is finite? When it comes to proving a computer program safe from being hacked the problem is considered NP-hard. Google Chrome got recently hacked by chaining 14 different bugs together. A working AGI is probably as least a complex as Google Chrome. Proving it safe will likely also be NP-hard.

Google Chrome doesn't even self modify.

3Curiouskid
I'm not really sure what's meant by this. For example, in computer vision, you can input an image and get a classification as output. The input is supplied by a human. The computation doesn't involve the human. The output is well defined. The same could be true of a tool AI that makes predictions.
3Curiouskid
Both Andrew Ng and Jeff Hawkins think that tool AI is the most likely approach.
0Weedlayer
I would consider 3 to be a few.
3beoShaffer
That is about how I read it.

When I read posts like this I feel like an independent everyman watching a political debate.

The dialogue is oversimplified and even then I don't fully grasp exactly what's being said and the implications thereof, so I can almost feel my opinion shifting back and forth with each point that sounds sort of, kinda, sensible when I don't really have the capacity to judge the statements. I should probably try and fix that.

The analogy is apt: blue-vs.-green politics aren't the only kind of politics, and debates over singularity policy have had big mind-killing effects on otherwise-pretty-rational LW folk before.

2John_Maxwell
The core points don't strike me as being inherently difficult or technical, although Eliezer uses some technical examples.

The core points don't strike me as being inherently difficult or technical

That's precisely the problem, given that Eliezer is arguing that a technical appreciation of difficult problems is necessary to judge correctly on this issue. My understanding, like pleeppleep's, is limited to the simplified level given here, which means I'm reduced to giving weight to presentation and style and things being "kinda sensible".

Hello,

I appreciate the thoughtful response. I plan to respond at greater length in the future, both to this post and to some other content posted by SI representatives and commenters. For now, I wanted to take a shot at clarifying the discussion of "tool-AI" by discussing AIXI. One of the the issues I've found with the debate over FAI in general is that I haven't seen much in the way of formal precision about the challenge of Friendliness (I recognize that I have also provided little formal precision, though I feel the burden of formalization is on SI here). It occurred to me that AIXI might provide a good opportunity to have a more precise discussion, if in fact it is believed to represent a case of "a rare exception who specified his AGI in such unambiguous mathematical terms that he actually succeeded at realizing, after some discussion with SIAI personnel, that AIXI would kill off its users and seize control of its reward button."

So here's my characterization of how one might work toward a safe and useful version of AIXI, using the "tool-AI" framework, if one could in fact develop an efficient enough approximation of AIXI to qualify as a powerful ... (read more)

Didn't see this at the time, sorry.

So... I'm sorry if this reply seems a little unhelpful, and I wish there was some way to engage more strongly, but...

Point (1) is the main problem. AIXI updates freely over a gigantic range of sensory predictors with no specified ontology - it's a sum over a huge set of programs, and we, the users, have no idea what the representations are talking about, except that at the end of their computations they predict, "You will see a sensory 1 (or a sensory 0)." (In my preferred formalism, the program puts a probability on a 0 instead.) Inside, the program could've been modeling the universe in terms of atoms, quarks, quantum fields, cellular automata, giant moving paperclips, slave agents scurrying around... we, the programmers, have no idea how AIXI is modeling the world and producing its predictions, and indeed, the final prediction could be a sum over many different representations.

This means that equation (20) in Hutter is written as a utility function over sense data, where the reward channel is just a special case of sense data. We can easily adapt this equation to talk about any function computed directly over sense data - we can g... (read more)

Thanks for the response. To clarify, I'm not trying to point to the AIXI framework as a promising path; I'm trying to take advantage of the unusually high degree of formalization here in order to gain clarity on the feasibility and potential danger points of the "tool AI" approach.

It sounds to me like your two major issues with the framework I presented are (to summarize):

(1) There is a sense in which AIXI predictions must be reducible to predictions about the limited set of inputs it can "observe directly" (what you call its "sense data").

(2) Computers model the world in ways that can be unrecognizable to humans; it may be difficult to create interfaces that allow humans to understand the implicit assumptions and predictions in their models.

I don't claim that these problems are trivial to deal with. And stated as you state them, they sound abstractly very difficult to deal with. However, it seems true - and worth noting - that "normal" software development has repeatedly dealt with them successfully. For example: Google Maps works with a limited set of inputs; Google Maps does not "think" like I do and I would not be able to look ... (read more)

So first a quick note: I wasn't trying to say that the difficulties of AIXI are universal and everything goes analogously to AIXI, I was just stating why AIXI couldn't represent the suggestion you were trying to make. The general lesson to be learned is not that everything else works like AIXI, but that you need to look a lot harder at an equation before thinking that it does what you want.

On a procedural level, I worry a bit that the discussion is trying to proceed by analogy to Google Maps. Let it first be noted that Google Maps simply is not playing in the same league as, say, the human brain, in terms of complexity; and that if we were to look at the winning "algorithm" of the million-dollar Netflix Prize competition, which was in fact a blend of 107 different algorithms, you would have a considerably harder time figuring out why it claimed anything it claimed.

But to return to the meta-point, I worry about conversations that go into "But X is like Y, which does Z, so X should do reinterpreted-Z". Usually, in my experience, that goes into what I call "reference class tennis" or "I'm taking my reference class and going home". The troub... (read more)

Thanks for the response. My thoughts at this point are that

  • We seem to have differing views of how to best do what you call "reference class tennis" and how useful it can be. I'll probably be writing about my views more in the future.
  • I find it plausible that AGI will have to follow a substantially different approach from "normal" software. But I'm not clear on the specifics of what SI believes those differences will be and why they point to the "proving safety/usefulness before running" approach over the "tool" approach.
  • We seem to have differing views of how frequently today's software can be made comprehensible via interfaces. For example, my intuition is that the people who worked on the Netflix Prize algorithm had good interfaces for understanding "why" it recommends what it does, and used these to refine it. I may further investigate this matter (casually, not as a high priority); on SI's end, it might be helpful (from my perspective) to provide detailed examples of existing algorithms for which the "tool" approach to development didn't work and something closer to "proving safety/usefulness up front" was necessary.
7oooo
Canonical software development examples emphasizing "proving safety/usefulness before running" over the "tool" software development approach are cryptographic libraries and NASA space shuttle navigation. At the time of writing this comment, there was recent furor over software called CryptoCat that didn't provide enough warnings that it was not properly vetted by cryptographers and thus should have been assumed to be inherently insecure. Conventional wisdom and repeated warnings from the security community state that cryptography is extremely difficult to do properly and attempting to create your own may result in catastrophic results. A similar thought and development process goes into space shuttle code. It seems that the FAI approach to "proving safety/usefulness" is more similar to the way cryptographic algorithms are developed than the (seemingly) much faster "tool" approach, which is more akin to web development where the stakes aren't quite as high. EDIT: I believe the "prove" approach still allows one to run snippets of code in isolation, but tends to shy away from running everything end-to-end until significant effort has gone into individual component testing.
2Nebu
The analogy with cryptography is an interesting one, because... In cryptography, even after you've proven that a given encryption scheme is secure, and that proof has been centuply (100 times) checked by different researchers at different institutions, it might still end up being insecure, for many reasons. Examples of reasons include: * The proof assumed mathematical integers/reals, of which computer integers/floating point numbers are just an approximation. * The proof assumed that the hardware the algorithm would be running on was reliable (e.g. a reliable source of randomness). * The proof assumed operations were mathematical abstractions and thus exist out of time, and thus neglected side channel attacks which measures how long a physical real world CPU took to execute a the algorithm in order to make inferences as to what the algorithm did (and thus recover the private keys). * The proof assumed the machine executing the algorithm was idealized in various ways, when in fact a CPU emits heat other electromagnetic waves, which can be detected and from which inferences can be drawn, etc.
0wedrifid
That's one way to "win" a game of reference class tennis. Declare unilaterally that what you are discussing falls into the reference class "things that are most effectively reasoned about by discussing low level details and abandoning or ignoring all observed evidence about how things with various kinds of similarity have worked in the past". Sure, it may lead to terrible predictions sometimes but by golly, it means you can score an 'ace' in the reference class tennis while pretending you are not even playing!

And atheism is a religion, and bald is a hair color.

The three distinguishing characteristics of "reference class tennis" are (1) that there are many possible reference classes you could pick and everyone engaging in the tennis game has their own favorite which is different from everyone else's; (2) that the actual thing is obviously more dissimilar to all the cited previous elements of the so-called reference class than all those elements are similar to each other (if they even form a natural category at all rather than having being picked out retrospectively based on similarity of outcome to the preferred conclusion); and (3) that the citer of the reference class says it with a cognitive-traffic-signal quality which attempts to shut down any attempt to counterargue the analogy because "it always happens like that" or because we have so many alleged "examples" of the "same outcome" occurring (for Hansonian rationalists this is accompanied by a claim that what you are doing is the "outside view" (see point 2 and 1 for why it's not) and that it would be bad rationality to think about the "individual details").

I have also termed this Argument by Greek Analogy after Socrates's attempt to argue that, since the Sun appears the next day after setting, souls must be immortal.

[-][anonymous]230

I have also termed this Argument by Greek Analogy after Socrates's attempt to argue that, since the Sun appears the next day after setting, souls must be immortal.

For the curious, this is from the Phaedo pages 70-72. The run of the argument are basically thus:

P1 Natural changes are changes from and to opposites, like hot from relatively cold, etc.

P2 Since every change is between opposites A and B, there are two logically possible processes of change, namely A to B and B to A.

P3 If only one of the two processes were physically possible, then we should expect to see only one of the two opposites in nature, since the other will have passed away irretrievably.

P4 Life and death are opposites.

P5 We have experience of the process of death.

P6 We have experience of things which are alive

C From P3, 4, 5, and 6 there is a physically possible, and actual, process of going from death to life.

The argument doesn't itself prove (haha) the immortality of the soul, only that living things come from dead things. The argument is made in support of the claim, made prior to this argument, that if living people come from dead people, then dead people must exist somewhere. The argument is particularly interesting for premises 1 and 2, which are hard to deny, and 3, which seems fallacious but for non-obvious reasons.

This sounds like it might be a bit of a reverent-Western-scholar steelman such as might be taught in modern philosophy classes; Plato's original argument for the immortality of the soul sounded more like this, which is why I use it as an early exemplar of reference class tennis:

-

Then let us consider the whole question, not in relation to man only, but in relation to animals generally, and to plants, and to everything of which there is generation, and the proof will be easier. Are not all things which have opposites generated out of their opposites? I mean such things as good and evil, just and unjust—and there are innumerable other opposites which are generated out of opposites. And I want to show that in all opposites there is of necessity a similar alternation; I mean to say, for example, that anything which becomes greater must become greater after being less.

True.

And that which becomes less must have been once greater and then have become less.

Yes.

And the weaker is generated from the stronger, and the swifter from the slower.

Very true.

And the worse is from the better, and the more just is from the more unjust.

Of course.

And is this true of all opposites? and are we convinced tha... (read more)

3[anonymous]
That was roughly my aim, but I don't think I inserted any premises that weren't there. Did you have a complaint about the accuracy of my paraphrase? The really implausible premise there, namely that death is the opposite of life, is preserved I think. As for reverence, why not? He was, after all, the very first person in our historical record to suggest that thinking better might make you happier. He was also an intellectualist about morality, at least sometimes a hedonic utilitarian, and held no great respect for logic. And he was a skilled myth-maker. He sounds like a man after your own heart, actually.
2thomblake
I think your summary didn't leave anything out, or even apply anything particularly charitable.
0thomblake
Esar's summary doesn't seem to be different from this, other than 1) adding the useful bit about "passed away irretrievably" and 2) yours makes it clear that the logical jump happens right at the end. I'm actually not sure now why you consider this like "reference class tennis". The argument looks fine, except for the part where "souls exist in the world below" jumps in as a conclusion, not having been mentioned earlier in the argument.
0[anonymous]
The 'souls exist in the world below' bit is directly before what Eliezer quoted: But you're right that nothing in the argument defends the idea of a world below, just that souls must exist in some way between bodies.
0TheAncientGeek
The argument omits that living things can come from living things and dead thingsfrom dead things Therefore, the fact that living things can come from dead things does not mean that have to in every case. Although, if everything started off dead, they must have at some point. So it's an argument for abiogenesis,
0bogdanb
Not even that, at least in the part of the argument I’ve seen (paraphrased?) above. He just mentions an ancient doctrine, and then claims that souls must exist somewhere while they’re not embodied, because he can’t imagine where they would come from otherwise. I’m not even sure if the ancient doctrine is meant as argument from authority or is just some sort of Chewbacca defense. (He doesn’t seem to explicitly claim the “ancient doctrine” to be true or plausible, just that it came to his mind. It feels like I’ve lost something in the translation.)
9wedrifid
Ok, it seems like under this definition of "reference class tennis" (particularly parts (2) and (3)) the participants must be wrong and behaving irrationality about it in order to be playing reference class tennis. So when they are either right or at least applying "outside view" considerations correctly, given all the information available to them they aren't actually playing "reference class tennis" but instead doing whatever it is that reasoning (boundedly) correctly using reference to actual relevant evidence about related occurrences is called when it isn't packaged with irrational wrongness. With this definition in mind it is necessary to translate replies such as those here by Holden: Holden's meaning is, of course, not that that he argues is actually a good thing but rather declaring that the label doesn't apply to what he is doing. He is instead doing that other thing that is actually sound thinking and thinks people are correct to do so. Come to think of it if most people in Holden's shoes heard Eliezer accuse them of "reference class tennis" and actually knew that he intended it with the meaning he explicitly defines here rather than the one they infer from context they would probably just consider him arrogant, rude and mind killed then write him and his organisation off as not worth engaging with. In the vast majority of cases where I have previously seen Eliezer argue against people using "outside view" I have agreed with Eliezer, and have grown rather fond of using the phrase "reference class tennis" as a reply myself where appropriate. But seeing how far Eliezer has taken the anti-outside-view position here and the extent to which "reference class tennis" is defined as purely an anti-outside-view semantic stop sign I'll be far more hesitant to make us of it myself. It is tempting to observe "Eliezer is almost always right when he argues against 'outside view' applications, and the other people are all confused. He is currently arguing against 'o
8Eliezer Yudkowsky
Which of 1, 2 and 3 do you disagree with in this case? Edit: I mean, I'm sorry to parody but I don't really want to carefully rehash the entire thing, so, from my perspective, Holden just said, "But surely strong AI will fall into the reference class of technology used to give users advice, just like Google Maps doesn't drive your car; this is where all technology tends to go, so I'm really skeptical about discussing any other possibility." Only Holden has argued to SI that strong AI falls into this particular reference class so far as I can recall, with many other people having their own favored reference classes e.g. Hanson et. al as cited above; a strong AI is far more internally dissimilar from Google Maps and Yelp than Google Maps and Yelp are internally similar to each other, plus there are many many other software programs that don't provide advice at all so arguably the whole class may be chosen-post-facto; and I'd have to look up Holden's exact words and replies to e.g. Jaan Tallinn to decide to what degree, if any, he used the analogy to foreclose other possibilities conversationally without further debate, but I do think it happened a little, but less so and less explicitly than in my Robin Hanson debate. If you don't think I should at this point diverge into explaining the concept of "reference class tennis", how should the conversation proceed further? Also, further opinions desired on whether I was being rude, whether logically rude or otherwise.
9Randaly
Viewed charitably, you were not being rude, although you did veer away from your main point in ways likely to be unproductive. (For example, being unnecessarily dismissive towards Hanson, who you'd previously stated had given arguments roughly as good as Holden's; or spending so much of your final paragraph emphasizing Holden's lack of knowledge regarding AI.) On the most likely viewing, it looks like you thought Holden was probably playing reference class tennis. This would have been rude, because it would imply that you thought the following inaccurate things about him: * He was "taking his reference class and going home" * That you can't "have a back-and-forth conversation" with him I don't think that you intended those implications. All the same, your final comment came across as noticeably less well-written than your post.
2Eliezer Yudkowsky
Thanks for the third-party opinion!
2TimS
I'm confused how you thought "reference class tennis" was anything but a slur on the other side's argument. Likewise "mindkilled." Sometimes, slurs about arguments are justified (agnostic in the instant case) - but that's a separate issue.
1[anonymous]
Do Karnofsky's contributions have even one of these characteristics, let alone all of them?
3Eliezer Yudkowsky
Empirically obviously 1 is true, I would argue strongly for 2 but it's a legitimate point of dispute, and I would say that there were relatively small but still noticeable but quite forgiveable traces of 3.
-1aaronsw
Then it does seem like your AI arguments are playing reference class tennis with a reference class of "conscious beings". For me, the force of the Tool AI argument is that there's no reason to assume that AGI is going to behave like a sci-fi character. For example, if something like On Intelligence turns out to be true, I think the algorithms it describes will be quite generally intelligent but hardly capable of rampaging through the countryside. It would be much more like Holden's Tool AI: you'd feed it data, it'd make predictions, you could choose to use the predictions. (This is, naturally, the view of that school of AI implementers. Scott Brown: "People often seem to conflate having intelligence with having volition. Intelligence without volition is just information.")
-2MatthewBaker
Your prospective AI plans for programmer-understandability seems very close to Starmap-AI by which I mean The best story I've read about a not so failed utopia involves this kind of accountability over the FAI. While I hate to generalize from fictional evidence it definitely seems like a necessary step to not becoming a galaxy that tiles over the aliens with happy faces instead of just freezing them in place to prevent human harm.
8JGWeissman
Explaining routes is domain specific and quite simple. When you are using domain specific techniques to find solutions to domain specific problems, you can use domain specific interfaces where human programmers and designers do all the heavy lifting to figure out the general strategy of how to communicate to the user. But if you want a tool AGI that finds solutions in arbitrary domains, you need a cross domain solution for communicating tool AGI's plans to the user. This is as much a harder problem than showing a route on a map, as cross domain AGI is a harder problem than computing the routes. Instead of the programmer figuring out how to plot road tracing curves on a map, the programmer has to figure out how to get the computer to figure out that displaying a map with route traced over it is a useful thing to do, in a way that generalizes figuring out other useful things to do to communicate answers to other types of questions. And among the hard subproblems of programming computers to find useful things to do in general problems is specifying the meaning of "useful". If that is done poorly, the tool AGI tries to trick the user into accepting plans that achieve some value negating distortion of what we actually want, instead of giving information that helps provide a good evaluation. Doing this right requires solving the same problems required to do FAI right.
-2private_messaging
To note something on making AIXI based tool: Instead of calculating rewards sum over the whole future (something that is simultaneously impractical, computationally expensive, and would only serve to impair performance on task at hand), one could use the single-step reward, with 1 for button being pressed any time and 0 for button not being pressed ever. It is still not entirely a tool, but it has very bounded range of unintended behaviour (much harder to speculate of the terminator scenario). In the Hutter's paper he outlines several not-quite-intelligences before arriving at AIXI. [edit2: also I do not believe that even with the large sum a really powerful AIXI-tl would be intelligently dangerous rather than simply clever at breaking the hardware that's computing it. All the valid models in AIXI-tl that affect the choice of actions have to magically insert actions being probed into some kind of internal world model. The hardware that actually makes those actions, complete with sensory apparatus, is incidental; a useless power drain; a needless fire hazard endangering the precious reward pathway] With regards to utility functions, the utility functions in the AI sense are real valued functions taken over the world model, not functions like number of paperclips in the world. The latter function, unsafe or safe, would be incredibly difficult or impossible to define using conventional methods. It would suffice for accelerating the progress to have an algorithm that can take in an arbitrary function and find it's maximum; while it would indeed seem to be "very difficult" to use that to cure cancer, it could be plugged into existing models and very quickly be used to e.g. design cellular machinery that would keep repairing the DNA alterations. Likewise, the speculative tool that can understand phrase 'how to cure cancer' and phrase 'what is the curing time of epoxy' would have to pick up most narrow least objectionable interpretation of the 'cure cancer' phrase to me
0CarlShulman
If the past sensory data include information about the internal workings, then there will be a striking correlation between the outputs that the workings would produce on their own (for physical reasons) and the AI's outputs. That rules out (or drives down expected utility of acting upon) all but very crazy hypotheses about how the Cartesian interaction works. Wrecking the hardware would break that correlation, and it's not clear what the crazy hypotheses would say about that, e.g. hypotheses that some simply specified intelligence is stage-managing the inputs, or that sometimes the AIXI-tl's outputs matter, and other times only the physical hardware matters.
-1private_messaging
Well, you can't include entire internal workings in the sensory data, and it can't model significant portion of itself as it has to try big number of hypotheses on the model on each step, so I would not expect the very crazy hypotheses to be very elaborate and have high coverage of the internals. If I closed my eyes and did not catch a ball, the explanation is that I did not see it coming and could not catch it, but this sentence is rife with self references of the sort that is problematic for AIXI. The correlation between closed eyes and lack of reward can be coded into some sort of magical craziness, but if I close my eyes and not my ears and hear where the ball lands after I missed catching it, there's the vastly simpler explanation for why I did not catch it - my hand was not in the right spot (and that works with total absence of sensorium as well). I don't see how AIXI-tl (with very huge constants) can value it's eyesight (it might have some value if there is some asymmetric in the long models, but it seems clear it would not assign the adequate, rational value to it's eyesight). In my opinion there is no single unifying principle to intelligence (or none was ever found), and AIXI-tl (with very huge constants) fails way short of even a cat in many important ways. edit: Some other thought: I am not sure that Solomonoff induction's prior is compatible with expected utility maximization. If the expected utility imbalance between crazy models grows faster than 2^length , and I would expect it to grow faster than any computable function (if the utility is unbounded), then the actions will be determined by imbalances between crazy, ultra long models. I would not privilege the belief that it just works without some sort of formal proof or some other very good reason to think it works.
3cousin_it
Your question seems to be about how sentient beings in a Game of Life universe are supposed to define "gliders" to the AI. 1) If they know the true laws of their cellular automaton, they can make a UDT-ish AI that examines statements like "if this logical algorithm has such-and-such output, then my prior over starting configurations of the universe logically implies such-and-such total number of gliders". 2) If they only know that their universe is some cellular automaton and have a prior over all possible automata, they can similarly say "maximize the number of smallest possible spaceships under the automaton rules" and give the AI some sensory channel wide enough to pin down the specific automaton with high probability. 3) If they only know what sensory experiences correspond to the existence of gliders, but don't know what gliders are... I guess we have a problem because sensory experiences can be influenced by the AI :-(
2TheOtherDave
Regarding #3: what happens given a directive like "Over there are a bunch of people who report sensory experiences of the kind I'm interested in. Figure out what differentially caused those experiences, and maximize the incidence of that."? (I'm not concerned with the specifics of my wording, which undoubtedly contains infinite loopholes; I'm asking about the general strategy of, when all I know is sensory experiences, referring to the differential causes of those experiences, whatever they may be. Which, yes, I would expect to include, in the case where there actually are no gliders and the recurring perception of gliders is the result of a glitch in my perceptual system, modifying my perceptual system to make such glitches more likely... but which I would not expect to include, in the case where my perceptual system is operating essentially the same way when it perceives gliders as when it perceives everything else, modifying my perceptual system to include such glitches (since such a glitch is not the differential cause of experiences of gliders in the first place.))
1cousin_it
Let's say you want the AI to maximize the amount of hydrogen, and you formulate the goal as "maximize the amount of the substance most likely referred to by such-and-such state of mind", where "referred to" is cashed out however you like. Now imagine that some other substance is 10x cheaper to make than hydrogen. Then the AI could create a bunch of minds in the same state, just enough to re-point the "most likely" pointer to the new substance instead of hydrogen, leading to huge savings overall. Or it could do something even more subversive, my imagination is weak. That's what I was getting at, when I said a general problem with using sensory experiences as pointers is that the AI can influence sensory experiences.
3TheOtherDave
Well, right, but my point is that "the thing which differentially caused the sensory experiences to which I refer" does not refer to the same thing as "the thing which would differentially cause similar sensory experiences in the future, after you've made your changes," and it's possible to specify the former rather than the latter. The AI can influence sensory experiences, but it can't retroactively influence sensory experiences. (Or, well, perhaps it can, but that's a whole new dimension of subversive. Similarly, I suppose a sufficiently powerful optimizer could rewrite the automaton rules in case #2, so perhaps we have a similar problem there as well.)
2cousin_it
You need to describe the sensory experience as part of the AI's utility computation somehow. I thought it would be something like a bitstring representing a brain scan, which can refer to future experiences just as easily as past ones. Do you propose to include a timestamp? But the universe doesn't seem to have a global clock. Or do you propose to say something like "the values of such-and such terms in the utility computation must be unaffected by the AI's actions"? But we don't know how to define "unaffected" mathematically...
1TheOtherDave
I was thinking in terms of referring to a brain. Or, rather, a set of them. But a sufficiently detailed brainscan would work just as well, I suppose. And, sure, the universe doesn't have a clock, but a clock isn't needed, simply an ordering: the AI attends to evidence about sensory experiences that occurred before the AI received the instruction. Of course, maybe it is incapable of figuring out whether a given sensory experience occurred before it received the instruction... it's just not smart enough. Or maybe the universe is weirder than I imagine, such that the order in which two events occur is not something the AI and I can actually agree on... which is the same case as "perhaps it can in fact retroactively influence sensory experiences" above.
0Nebu
I think LearnFun might be informative here. https://www.youtube.com/watch?v=xOCurBYI_gY LearnFun watches a human play an arbitrary NES games. It is hardcoded to assume that as time progresses, the game is moving towards a "better and better" state (i.e. it assumes the player's trying to win and is at least somewhat effective at achieving its goals). The key point here is that LearnFun does not know ahead of time what the objective of the game is. It infers what the objective of the game is from watching humans play. (More technically, it observes the entire universe, where the entire universe is defined to be the entire RAM content of the NES). I think there's some parallels here with your scenario where we don't want to explicitly tell the AI what our utility function is. Instead, we're pointing to a state, and we're saying "This is a good state" (and I guess either we'd explicitly tell the AI "and this other state, it's a bad state" or we assume the AI can somehow infer bad states to contrast the good states from), and then we ask the AI to come up with a plan (and possibly execute the plan) that would lead to "more good" states. So what happens? Bit of a spoiler, but sometimes the AI seems to make a pretty good inference for what the utility function a human would probably have had for a given NES game, but sometimes it makes a terrible inference. It never seems to make a "perfect" inference: the even in its best performance, it seems to be optimizing very strange things. The other part of it is that even if it does have a decent inference for the utility function, it's not always good at coming up with a plan that will optimize that utility function.
0itaibn0
I believe AIXI is much more inspectable than you make it out to be. I think it is important to challenge your claim here because Holden appears to have trusted your expertise and hereby concede an important part of the argument. AIXI's utility judgements are based a Solomonoff prior, which are based on the computer programs which return the input data. Computer programs are not black-boxes. A system implementing AIXI can easily also return a sample of typical expected future histories and the programs compressing these histories. By examining these programs, we can figure out what implicit model the AIXI system has of its world. These programs are optimized for shortness so they are likely to be very obfuscated, but I don't expect them to be incomprehensible (after all, they're not optimized for incomprehensibility). Even just sampling expected histories without their compressions is likely to be very informative. In the case of AIXItl the situation is better in the sense that it's output at any give time is guaranteed to be generated by just one length <l subprogram, and this subprogram comes with a proof justifying its utility judgement. It's also worse in that there is no way to sample its expected future histories. However, I expect the proof provided would implicitly contain such information. If either the programs or the proofs cannot be understood by humans, the programmers can just reject them and look at the next best candidates. As for "What will be its effect on _?", this can be answered as well. I already stated that with AIXI you can sample future histories. This is because AIXI has a specific known prior it implements for its future histories, namely Solomonoff induction. This ability may seem limited because it only shows the future sensory data, but sensory data can be whatever you feed AIXI as input. If you want it to a have a realistic model of the world, this includes a lot of relevant information. For example, if you feed it the entire database
1Nebu
Can you be a bit more specific in your interpretation of AIXI here? Here are my assumptions, let me know where you have different assumptions: * Traditional-AIXI is assumed to exists in the same universe as the human who wants to use AIXI to solve some problem. * Traditional-AIXI has a fixed input channel (e.g. it's connected to a webcam, and/or it receives keyboard signals from the human, etc.) * Traditional-AIXI has a fixed output channel (e.g. it's connected to a LCD monitor, or it can control a robot servo arm, or whatever). * The human has somehow pre-provided Traditional-AIXI with some utility function. * Traditional-AIXI operates in discrete time steps. * In the first timestep that elapses since Traditional-AIXI is activated, Traditional-AIXI examines the input it receives. It considers all possible programs that take pair (S, A) and emits an output P, where S is the prior state, A is an action to take, and P is the predicted output of taking the action A in state S. Then it discards all programs that would not have produced the input it received, regardless of what S or A it was given. Then it weighs the remaining program according to their Kolmorogov complexity. This is basically the Solomonoff induction step. * Now Traditional-AIXI has to make a decision about an output to generate. It considers all possible outputs it could produce, and feeds it to the programs under consideration, to produce a predicted next time step. Traditional-AIXI then calculates the expected utility of each output (using its pre-programmed utility function), picks the one with the highest utility, and emits that output. Note that it has no idea how any of its outputs would the universe, so this is essentially a uniformly random choice. * In the next timestep, Traditional-AIXI reads its inputs again, but this time taking into account what output it has generated in the previous step. It can now start to model correlation, and eventually causation, between its input and out
0Squark
Regarding the question of formalizing an optimization agent with goals defined in terms of external universe rather than sensory input. It is possible to attack the problem by generalizing the framework I described in http://lesswrong.com/lw/gex/save_the_princess_a_tale_of_aixi_and_utility/8ekk for solving the duality problem. Specifically, consider an "initial guess" stochastic model of the universe including the machine on which our agent is running. I call it the "innate model" M. Now consider a stochastic process with the same degrees of freedom as M but governed by the Solomonoff semi-measure. This is the "unbiased model" S. The two can be combined by assigning transition probabilities proportional to the product of the probabilities assigned by M and S. If M is sufficiently "insecure" (in particular it doesn't assign 0 to any transition probability) then the resulting model S', considered as prior, allows arriving at any computable model after sufficient learning. Fix a utility function on the space of histories of our model (note that the histories include both intrinsic and extrinsic degrees of freedom). The intelligence I(A) of any given agent A (= program written in M in the initial state) can now be defined to be the expected utility of A in S'. We can now consider optimal or near-optimal agents in this sense (as opposed to the Legg-Hutter formalism for measuring intelligence, there is no guarantee there is a maximum rather than a supremum; unless of course we limit the length of the programs we consider). This is a generalization of the Legg-Hutter formalism which accounts for limited computational resources, solves the duality problem (such agents take into account possibly wireheading) and also provides a solution for the ontology problem. This is essentially a special case of the Orseau-Ring framework. It is however much more specific than Orseau-Ring where the prior is left completely unspecified. You can think of it as a recipe for constructing Orse
0Squark
I realized that although the idea of a deformed Solomonoff semi-measure is correct, the multiplication prescription I suggested is rather ad hoc. The following construction is a much more natural and justifiable way of combining M and S. Fix t0 a time parameter. Consider a stochastic process S(-t0) that begins at time t = -t0, where t = 0 is the time our agent A "forms", governed by the Solomonoff semi-measure. Consider another stochastic process M(-t0) that begins from the initial conditions generated by S(-t0) (I'm assuming M only carries information about dynamics and not about initial conditions). Define S' to be the conditional probability distribution obtained from S by two conditions: a. S and M coincide on the time interval [-t0, 0] b. The universe contains A at time t=0 Thus t0 reflects the extent to which we are certain about M: it's like telling the agent we have been observing behavior M for time period t0. There is an interesting side effect to this framework, namely that A can exert "acausal" influence on the utility by affecting the initial conditions of the universe (i.e. it selects universes in which A is likely to exist). This might seem like an artifact of the model but I think it might be a legitimate effect: if we believe in one-boxing in Newcomb's paradox, why shouldn't we accept such acausal effects? For models with a concept of space and finite information velocity, like cellular automata, it might make sense to limit the domain of "observed M" in space as well as time, to A's past "light-cone"
0Eliezer Yudkowsky
I cannot even slightly visualize what you mean by this. Please explain how it would be used to construct an AI that made glider-oids in a Life-like cellular automaton universe.
0Squark
Is the AI hardware separate from the cellular automaton or is it a part of it? Assuming the latter, we need to decide which degrees of freedom of the cellular automaton form the program of our AI. For example we can select a finite set of cells and allow setting their values arbitrarily. Then we need to specify our utility function. For example it can be a weighted sum of the number of gliders at different moments of time, or a maximum or whatever. However we need to make sure the expectation values converge. Then the "AI" is simply the assignment of values to the selected cells in the initial state which yields the maximal expect utility. Note though that if we're sure about the law governing the cellular automaton then there's no reason to use the Solomonoff semi-measure at all (except maybe as a prior for the initial state outside the selected cells). However if our idea of the way the cellular automaton works is only an "initial guess" then the expectation value is evaluated w.r.t. a stochastic process governed by a "deformed Solomonoff" semi-measure in which transitions illegal w.r.t. assumed cellular automaton law are suppressed by some factor 0 < p < 1 w.r.t. "pure" Solomonoff inference. Note that, contrary to the case of AIXI, I can only describe the measure of intelligence, I cannot constructively describe the agent maximizing this measure. This is unsurprising since building a real (bounded computing resources) AI is a very difficult problem
-9private_messaging
1JGWeissman
If the tool is not sufficiently reflective to recommend improvements to itself, it will never become a worthy substituted for FAI. This case is not interesting. If the tool is sufficiently reflective to recommend improvements to itself, it will recommend that it be modified to just implement its proposed policies instead of printing them. So we would not actually implement that policy. But what then makes it recommend a policy that we will actually want to implement? What tweak to the program should we apply in that situation?
0Nebu
First of all, I'm assuming that we're taking as axiomatic that the tool "wants" to improve itself (or else why would it have even bothered to consider recommending that it be modified to improve itself?); i.e. improving itself is favorable according to its utility function. Then: It will recommend a policy that we will actually want to implement, because its model of the universe includes our minds and it can see that if it recommends a policy we will actually want to implement leads it to a higher ranked state in its utility function.
0hairyfigment
Perhaps. I noticed a related problem: someone will want to create a self-modifying AI. Let's say we ask the Oracle AI about this plan. At present (as I understand it) we have no mathematical way to predict the effects of self-modification. (Hence Eliezer's desire for a new decision theory that can do this.) So how did we give our non-self-modifying Oracle that ability? Wouldn't we need to know the math of getting the right answer in order to write a program that gets the right answer? And if it can't answer the question: * What will it even do at that point? * If it happens to fail safely, will humans as we know them interpret this non-answer to mean we should delay our plan for self-modifying AI?
0MattMahoney
If we were smart enough to understand its policy, then it would not be smart enough to be dangerous.
3wedrifid
That doesn't seem true. Simple policies can be dangerous and more powerful than I am.
0Nebu
To steelman the parent argument a bit, a simple policy can be dangerous, but if an agent proposed a simple and dangerous policy to us, we probably would not implement it (since we could see that it was dangerous), and thus the agent itself would not be dangerous to us. If the agent were to propose a policy that, as far as we could tell, appears safe, but was in fact dangerous, then simultaneously: 1. We didn't understand the policy. 2. The agent was dangerous to us.
-9timtyler

To clarify, for everyone:

There are now three "major" responses from SI to Holden's Thoughts on the Singularity Institute (SI): (1) a comments thread on recent improvements to SI as an organization, (2) a post series on how SI is turning donor dollars into AI risk reduction and how it could do more of this if it had more funding, and (3) Eliezer's post on Tool AI above.

At least two more major responses from SI are forthcoming: a detailed reply to Holden's earlier posts and comments on expected value estimates (e.g. this one), and a long reply from me that summarizes my responses to all (or almost all) of the many issues raised in Thoughts on the Singularity Institute (SI).

3Nick_Beckstead
How much of this is counting toward the 50,000 words of authorized responses?
3lukeprog
I told Holden privately that this would be explained in my final "summary" reply. I suspect the 5200 words of Eliezer's post above will be part of the 50,000.
1Wei Dai
Luke, do you know if there has been any official (or unofficial) response to my argument that Holden quoted in his post?
1lukeprog
Not that I know of. I fully agree with that comment, and I suspect Eliezer does as well.

Software that does happen to interface with humans is selectively visible and salient to humans, especially the tiny part of the software that does the interfacing; but this is a special case of a general cost/benefit tradeoff which, more often than not, turns out to swing the other way, because human advice is either too costly or doesn't provide enough benefit.

I suspect this is the biggest counter-argument for Tool AI, even bigger than all the technical concerns Eliezer made in the post. Even if we could build a safe Tool AI, somebody would soon build an agent AI anyway.

My five cents on the subject, from something that I'm currently writing:

Like with external constraints, Oracle AI suffers from the problem that there would always be an incentive to create an AGI that could act on its own, without humans in the loop. Such an AGI would be far more effective in furthering whatever goals it had been built to pursue, but also far more dangerous.

Current-day narrow-AI technology includes high-frequency trading (HFT) algorithms, which make trading decisions within fractions of a second, far too fast to keep humans in the loop. HFT seeks to make a very short-term profit, but even tr

... (read more)
0[anonymous]
But assuming that we could build a safe Tool AI, we could use it to build an safer agent AI than one would otherwise be able to build. This is related to Holden's point:
0torekp
Thank you for saying this (and backing it up better than I would have). I think we should concede, however, that a similar threat applies to FAI. The arms race phenomenon may create uFAI before FAI can be ready. This strikes me as very probable. Alternately, if AI does not "foom", uFAI might be created after FAI. (I'm mostly persuaded that it will foom, but I still think it's useful to map the debate.) The one advantage is that if Friendly Agent AI comes first and fooms, the threat is neutralized; whereas Friendly Tool AI can only advise us how to stop reckless AI researchers. If reckless agent AIs act more rapidly than we can respond, the Tool AI won't save us.
2Vladimir_Nesov
If uFAI doesn't "foom" either, they both get a good chunk of expected utility. FAI doesn't need any particular capability, it only has to be competitive with other possible things.