All of Benya's Comments + Replies

I don't follow. Could you make this example more formal, giving a set of outcomes, a set of lotteries over these outcomes, and a preference relation on these that corresponds to "I will act so that, at some point, there will have been a chance of me becoming a heavy-weight champion of the world", and which fails Continuity but satisfies all other VNM axioms? (Intuitively this sounds more like it's violating Independence, but I may well be misunderstanding what you're trying to do since I don't know how to do the above formalization of your argument.)

Take a reasoner who can make pre-commitments (or a UDT/TDT type). This reasoner, in effect, only has to make a single decision for all time. Let A, B, C... be pure outcomes, a, b, c,... be lotteries. Then define the following pseudo-utility function f: f(a) = 1 if the outcome A appears with non-zero probability in a, f(a) = 0 otherwise. The decision maker will use f to rank options. This clearly satisfies completeness and transitivity (because it uses a numerical scale). And then... It gets tricky. I've seen independence written both in a < form and a <= form (see vs ). I have a strong hunch that the two versions are equivalent, given the other axioms. Anyway, the above decision process satisfies <= independence (but not < independence). To see that the decision process satisfies <= independence, note that f(pa+(1-p)b)=max(f(a),f(b)). So if f(a) <= f(b), then f(pa+(1-p)c)=max(f(a),f(c)) <= max(f(b),f(c)) = f(pb+(1-p)c)).

Also, Magical Britain keeps Muggles out, going so far as to enforce this by not even allowing Muggles to know that Magical Britain exists. I highly doubt that Muggle Britain would do that to potential illegal immigrants even if it did have the technology...

That's just selection bias. You wouldn't know about it if they did.

Incidentally, the same argument also applies to Governor Earl Warren's statement quoted in Absence of evidence is evidence of absence: He can be seen as arguing that there are at least three possibilities, (1) there is no fifth column, (2) there is a fifth column and it supposed to do sabotage independent from an invasion, (3) there is a fifth column and it is supposed to aid a Japanese invasion of the West Coast. In case (2), you would expect to have seen sabotage; in case (1) and (3), you wouldn't, because if the fifth column were known to exist by the t... (read more)

The true message of the first video is even more subliminal: The whiteboard behind him shows some math recently developed by MIRI, along with a (rather boring) diagram of Botworld :-)

Sorry about that; I've had limited time to spend on this, and have mostly come down on the side of trying to get more of my previous thinking out there rather than replying to comments. (It's a tradeoff where neither of the options is good, but I'll try to at least improve my number of replies.) I've replied there. (Actually, now that I spent some time writing that reply, I realize that I should probably just have pointed to Coscott's existing reply in this thread.)

I'm not sure which of the following two questions you meant to ask (though I guess probably the second one), so I'll answer both:

(a) "Under what circumstances is something (either an l-zombie or conscious)?" I am not saying that something is an l-zombie only if someone has actually written out the code of the program; for the purposes of this post, I assume that all natural numbers exist as platonical objects, and therefore all observers in programs that someone could in principle write and run exist at least as l-zombies.

(b) "When is a prog... (read more)

Thanks for clarifying!

Thank you for the feedback, and sorry for causing you distress! I genuinely did not take into consideration that this choice could cause distress, and it could have occurred to me, and I apologize.

On how I came to think that it might be a good idea (as opposed to missing that it might be a bad idea): While there's math in this post, the point is really the philosophy rather than the math (whose role is just to help thinking more clearly about the philosophy, e.g. to see that PBDT fails in the same way as NBDT on this example). The original counterfactual m... (read more)

In short, I don't think SUDT (or UDT) by itself solves the problem of counterfactual mugging. [...] Perhaps SUDT also needs to specify a rule for selecting utility functions (e.g. some sort of disinterested "veil of ignorance" on the decider's identity, or an equivalent ban on utilities which sneak it in a selfish or self-interested term).

I'll first give an answer to a relatively literal reading of your comment, and then one to what IMO you are "really" getting at.

Answer to a literal reading: I believe that what you value is part of ... (read more)

5Wei Dai10y
I brought up some related points in At this point, I'm not totally sure that UDT solves counterfactual mugging correctly. The problem I see is that UDT is incompatible with selfishness. For example if you make a copy of a UDT agent, then both copy 1 and copy 2 will care equally about copy 1 relative to copy 2, but if you make a copy of a typical selfish human, each copy will care more about itself than the other copy. This kind of selfishness seems strongly related to intuitions for picking (H) over (T). Until we fully understand whether selfishness is right or wrong, and how it ought to be implemented or fixed (e.g., do we encode our current degrees of caring into a UDT utility function, or rewind our values to some past state, or use some other decision theory that has a concept of "self"?), it's hard to argue that UDT must be correct, especially in its handling of counterfactual mugging.
Thank you for a very comprehensive reply. That's fine. However, normal utility functions do have self-interested components, as well as parochial components (caring about people and things that are "close" to us in various ways, above those which are more "distant"). It's also true that utilities are not totally determined by such components, and include some general pro bono terms; further that we think in some sense that utilities ought to be disinterested rather than selfish or parochial. Hence my thought that SUDT could be strengthened by barring selfish or parochial terms, or imposing some sort of veil of ignorance so that only terms like u(+NotMe) and u(-NotMe) affect decisions. Allowing for self-interest, then in the counterfactual mugging scenario we most likely have u(+Me) >> u(+NotMe) > u(-NotMe) >> u(-Me), rather than u(+NotMe) = u(-NotMe). The decider will still be inclined to pick "H" (matching our initial intuition), but with some hesitation, particularly if Omega's coin was very heavily weighted to tails in the first place. The internal dialogue in that place will go something like this: "Hmm, it was so very unlikely that the coin fell heads - I can't believe that happened! Hmm, perhaps it didn't, and I'm in some sort of Omega-simulation. For the good of the world outside my simulation, I'd better pick T after all". That's roughly where I am with my own reaction to Counterfactual Mugging right now. Against a background of modal realism or a many-worlds-interpretation (which in my opinion is where UDT makes most sense), caring only about the good of "our" world looks like a sort of parochialism, which is why Counterfactual Mugging is interesting. Suddenly it seems to matter whether these other worlds exist or not, rather than just being a philosophical curiosity.

It's priors over logical states of affairs. Consider the following sentence: "There is a cellular automaton that can be described in at most 10 KB in programming language X, plus a computable function f() which can be described in another 10 KB in the same programming language, such that f() returns a space/time location within the cellular automaton corresponding to Earth as we know it in early 2014." This could be false even if Tegmark IV is true, and prior probability (i.e., probability without trying to do an anthropic update of the form "I observe this, so it's probably simple") says it's probably false.

Thanks. But how can I even think the concept "corresponding to Earth as we know it" without relying on a large body of empirical knowledge that influences my probability assignments? I'm having trouble understanding what the prior is prior to. Of course I can refrain from explicitly calculating the K-complexity, say, of the theory in a physics textbook. But even without doing such a calculation, I still have some gut level sense of the simplicity/complexity of physics, very much based on my concrete experiences. Does that not count as anthropic?

To summarize that part of the post: (1) The view I'm discussing there argues that the reason we find ourselves in a simple-looking world is that all possible experiences are consciously experienced, including the ones where the world looks simple, and we just happen to experience the latter. (2) If this is correct, then you cannot use the fact that you look around and see a simple-looking world to infer that you live in a simple-looking world, because there are plenty of complex interventionistic worlds that look deceptively simple. In fact, the prior prob... (read more)

But most worlds aren't "complex worlds appearing simple", most worlds are just "complex worlds", right? So the fact that we find ourselves in a simple world should still enormously surprise us. And any theory that causes us to "naturally" expect simple worlds would seem to have an enormous advantage.

I don't feel like considering these different ways to approach K-complexity addresses the point I was trying to make. The rebuttal seems to be arguing that we should weigh the TMs that don't read the end of the tape equally, rather than weighing TMs more that read less of the tape. But my point isn't that I don't want to weigh complex TMs as much as simple TMs; it is (1) that I seem to be willing to consider TMs with one obviously disorderly event "pretty simple", even though I think they have high K-complexity; and (2) given this, the utility I ... (read more)

So, I can see that you would care similarly as you would in a multiverse with magical reality fluid that's distributed in the same proportions as your measure of caring, and if your measure of caring is K-complexity with respect to a universal Turing machine (UTM) we would consider simple, it's at least one plausible possibility that the true magical reality fluid that's distributed in roughly those proportions. But given the state of our confusion, I think that conditional on there being a true measure, any single hypothesis as to how that measure is dist... (read more)

1Scott Garrabrant10y
Conditional on there being a true measure, I would think it is reasonably likely that that measure is 100% at one possible universe.

But you see Eliezer's comments because a conscious copy of Eliezer has been run.

A conscious copy of Eliezer that thought about what Eliezer would do when faced with that situation, not a conscious copy of Eliezer actually faced with that situation -- the latter Eliezer is still an l-zombie, if we live in a world with l-zombies.

Is Eliezer thinking about what he would do when faced with that situation not him running an extremely simplified simulation of himself? Obviously this simulation is not equivalent to real Eliezer, but there's clearly something being run here, so it can't be an L-zombie.

For l-zombies to do anything they need to be run, whereupon they stop being l-zombies.

Omega doesn't necessarily need to run a conscious copy of Eliezer to be pretty sure that Eliezer would pay up in the counterfactual mugging; it could use other information about Eliezer, like Eliezer's comments on LW, the way that I just did. It should be possible to achieve pretty high confidence that way about what Eliezer-being-asked-about-a-counterfactual-mugging would do, even if that version of Eliezer should happen to be an l-zombie.

But you see Eliezer's comments because a conscious copy of Eliezer has been run. If I'm figuring out what output a program "would" give "if" it were run, in what sense am I not running it? Suppose I have a program MaybeZombie, and I run a Turing Test with it as the Testee and you as the Tester. Every time you send a question to MaybeZombie, I figure out what MaybeZombie would say if it were run, and send that response back to you. Can I get MaybeZombie to pass a Turing Test, without ever running it?

(Agree with Coscott's comment.)

I meant useful in the context of AI since any such sequence would obviously have to be non-computable and thus not something the AI (or person) could make pragmatic use of.

I was replying to this:

Ultimately, you can always collapse any computable sequence of computable theories (necessary for the AI to even manipulate) into a single computable theory so there was never any hope this kind of sequence could be useful.

I.e., I was talking about computable sequences of computable theories, not about non-computable ones.

Also, it is far from clear that

... (read more)

Actually, the `proof' you gave that no true list of theories like this exists made the assumption (not listed in this paper) that the sequence of indexes for the computable theories is definable over arithmetic. In general there is no reason this must be true but of course for the purposes of an AI it must.

("This paper" being Eliezer's writeup of the procrastination paradox.) That's true, thanks.

Ultimately, you can always collapse any computable sequence of computable theories (necessary for the AI to even manipulate) into a single computabl

... (read more)
I meant useful in the context of AI since any such sequence would obviously have to be non-computable and thus not something the AI (or person) could make pragmatic use of. Also, it is far from clear that T_0 is the union of all theories (and this is the problem in the proof in the other rightup). It may well be that there is a sequence of theories like this all true in the standard model of arithmetic but that their construction requires that Tn add extra statements beyond the schema for the proof predicate in T{n+1} Also, the claim that Tn must be stronger than T{n+1} (prove a superset of be computable we can't take all these theories to be complete) is far from obvious if you don't require that T_n be true in the standard model. If T_n is true in the standard model than, as it proves that Pf(Tn+1, \phi) -> \phi this is true so if T{n+1} |- \phi then (as this witnessed in a finite proof) there is a proof that this holds from T_n and thus a proof of \phi. However, without this assumption I don't even see how to prove the containment claim.

I'm hard-pressed to this of any more I could want from [the coco-value] (aside from easy extensions to bigger classes of games).

Invariance to affine transformations of players' utility functions. This solution requires that both players value outcomes in a common currency, plus the physical ability to transfer utility in this currency outside the game (unless there are two outcomes o_1 and o_2 of the game such that A(o_1) + B(o_1) = A(o_2) + B(o_2) = max_o A(o) + B(o), and such that A(o_1) >= A's coco-value >= A(o_2), in which case the players can... (read more)

Invariance of the players' utility functions by the same affine transformation, or by independent transformations?
This is done by the transfer function between the players, since if I redefine my utility to be 10 times its previous value, then it takes only one of your utility to give me 10, and 10 of my utility to give you one. Now, of course, you want to lie about the transfer function instead of your utility; "no, I don't like dollars you've given me as much as dollars I've earned myself."
Oh, I definitely agree. I meant it's hard to hope for anything more inside environments with transferable/quasilinear utility. It's a big assumption, but I've resigned myself to it somewhat since we need it for most of the positive results in mechanism design. What you say is true but seems entirely irrelevant to the question what the superrational outcome in an asymmetric game should be.

I think the point is that in PD symmetry+precommitment => cooperation, and asymmetry + precommitment = symmetry (this is the "trivial fix"), so asymmetry + precommitment => cooperation.

Retracted my comment for being unhelpful (I don't recognize what I said in what you heard, so I'm clearly not managing to explain myself here).

Thanks for trying, anyway :)

Agree with Nisan's intuition, though I also agree with Wei Dai's position that we shouldn't feel sure that Bayesian probability is the right way to handle logical uncertainty. To more directly answer the question what it means to assign a probability to the twin prime conjecture: If Omega reveals to you that you live in a simulation, and it offers you a choice between (a) Omega throws a bent coin which has probability p of landing heads, and shuts down the simulation if it lands tails, otherwise keeps running it forever; and (b) Omega changes the code of t... (read more)

[This comment is no longer endorsed by its author]Reply
Not from your example, I do not. I suspect that if you remove this local Omega meme, you are saying that there are many different possible worlds in your inner simulator and in p*100% of them the conjecture ends up being proven... some day before that world ends. Unless you are a Platonist and assign mathematical "truths" independent immaterial existence.

I'm not saying we'll take the genome and read it to figure out how the brain does what it does, I'm saying that we run a brain simulation and do science (experiments) on it and study how it works, similarly how we study how DNA transcription or ATP production or muscle contraction or a neuron's ion pumps or the Krebs cycle or honeybee communication or hormone release or cell division or the immune system or chick begging or the heart's pacemaker work. There are a lot of things evolution hasn't obfuscated so much that we haven't been able to figure out what they're doing. Of course there's also a lot of things we don't understand yet, but I don't see how that leads to the conclusion that evolution is generally obfuscatory.

I guess it tends to create physical structures that are simple, but I think the computational stuff tends to be weird. If you have a strand of DNA, the only way to tell what kind of chemistry that will result in is to run it. From what little I've heard, it sounds like any sort of program made by a genetic algorithm that can actually run is too crazy to understand. For example, I've heard of a set of transistors hooked together to be able to tell "yes" and "no" apart, or something like that. There were transistors that were just draining energy, but were vital. Running it on another set of transistors wouldn't work. It required the exact specs of those transistors. That being said, the sort of sources I hear that from are also the kind that say ridiculous things about quantum physics, so I guess I'll need an expert to tell me if that's true. Has anyone here studied evolved computers?

Saying that all civilizations able to create strong AI will reliably be wise enough to avoid creating strong AI seems like a really strong statement, without any particular reason to be true. By analogy, if you replace civilizations by individual research teams, would it be safe to rely on each team capable of creating uFAI to realize the dangers of doing so and therefore refraining from doing so, so that we can safely take a much longer time to figure out FAI? Even if it were the case that most teams capable of creating uFAI hold back like this, one single rogue team may be enough to destroy the world, and it just seems really likely that there will be some not-so-wise people in any large enough group.


Good points.

evolution hit on some necessary extraordinarily unlikely combination to give us intelligence and for P vs NP reasons we can't find it

For this one, you also need to explain why we can't reverse-engineer it from the human brain.

no civilization smart enough to create strong AI is stupid enough to create strong AI

This seems particularly unlikely in several ways; I'll skip the most obvious one, but also it seems unlikely that humans are "safe" in that they don't create a FOOMing AI but it wouldn't be possible even with much thought... (read more)

It was designed by evolution. Say what you will about the blind idiot god, but it's really good at obfuscation. We could copy a human brain, and maybe even make some minor improvements, but there is no way we could ever hope to understand it.
What was the most obvious one?
"Reverse-engineer" is an almost perfect metaphor for "solve an NP problem."

Combining your ideas together -- our overlord actually is a Safe AI created by humans.

How it happened:

Humans became aware of the risks of intelligence explosions. Because they were not sure they could create a Friendly AI in the first attempt, and creating an Unfriendly AI would be too risky, instead they decided to first create a Safe AI. The Safe AI was planned to become a hundred times smarter than humans but not any smarter, answer some questions, and then turn itself off completely; and it had a mathematically proved safety mechanism to prevent it fro... (read more)

I would agree with your reasoning if CFAR claimed that they can reliably turn people into altruists free of cognitive biases within the span of their four-day workshop. If they claimed that and were correct in that, then it shouldn't matter whether they (a) require up-front payment and offer a refund or (b) have people decide what to pay after the workshop, since a bias-free altruist would make end up paying the same in either case. There would only be a difference if CFAR didn't achieve what, in this counterfactual scenario, it claimed to achieve, so they... (read more)

It's not so much what CFAR is claiming as what their goals are and which outcomes they prefer. The goal is to create people who are effective, rational do-gooders. I see four main possibilities here: First, that they succeed in doing so. Second, that they fail and go out of business. Third, that they become a sort of self-help cult like the Landmark Forum, i.e. they charge people money without delivering much benefit. Fourth, they become a sort of fraternal organization, i.e. membership does bring benefits mainly from being able to network with other members. Obviously (1) is the top choice. But if (1) does not occur, which would they prefer -- (2), or some combination of (3) and (4)? By charging money up front, they are on the path to (3) or (4) as a second choice. Which goes against their stated goal. So let's assume that they do not claim to be able to turn people into effective rational do-gooders. The fact remains that they hope to do so. And one needs to ask, what do they hope for as a second choice?

Yep: CFAR advertised their fundraiser in their latest newsletter, which I received on Dec. 5.

The only scenario I can see where this would make sense is if SIAI expects small donors to donate less than $(1/2)N in a dollar-for-dollar scheme, so that its total gain from the fundraiser would be below $(3/2)N, but expects to get the full $(3/2)N in a two-dollars-for-every-dollar scheme. But not only does this seem like a very unlikely story [...]

One year later, the roaring success of MIRI's Winter 2013 Matching Challenge, which is offering 3:1 matching for new large donors (people donating >= $5K who have donated less that $5K in total in the pas... (read more)

Yes, a real-life reasoner would have to use probabilistic reasoning to carry out these sorts of inference. We do not have a real understanding yet of how to do probabilistic reasoning about logical statements, though, although there has been a bit of work about it in the past. This is one topic MIRI is currently doing research on. In the meantime, we also examine problems of self-reference in ordinary deductive logic, since we understand it very well. It's not certain that the results there will carry over in any way into the probabilistic setting, and it'... (read more)

I'll accept that doing everything probabilistically is expensive, but I really don't see how it wouldn't solve the problem to at least assign probabilities to imported statements. The more elements in the chain of trust, the weaker it is. Eventually, someone needs it reliably enough that it becomes necessary to check it. And of course any chain of trust like that ought to have a system for providing proof upon demand, which will be invoked roughly every N steps of trust. The recipients of the proof would then become nodes of authority on the issue. This seems rather how actual people operate (though we often skip the 'where to get proof of this' step), and so any proof that it will become unworkable has a bit of a steep hill to climb.
I see. Thanks for the link.

There is a way to write a predicate Proves(p,f) in the language of PA which is true if f is the Gödel number of a formula and p is the Gödel number of a proof of that formula from the axioms of PA. You can then define a predicate Provable(f) := exists p. Proves(p,f); then Provable(f) says that f is the Gödel number of a provable formula. Writing "A" for the Gödel number of the formula A, we can then write

PA |- Provable("A")

to say that there's a proof that A is provable, and

PA |- Provable("Provable("A")")

to say that t... (read more)

Let me see if I can put that in my own words; if not, I didn't understand it. You are saying that humans, who do not operate strictly by PA, know that a proof of the existence of a proof is itself a proof; but a reasoner strictly limited to PA would not know any such thing, because it's not a theorem of PA. (PA being just an example - it could be any formal system, or at least any formal system that doesn't include the concept of proofs among its atoms, or concepts.) So such a reasoner can be shown a proof that a proof of A exists, but will not know that A is therefore a theorem of PA. Correct? To me this seems more like a point about limitations of PA than about AI or logic per se; my conclusion would be "therefore, any serious AI needs a formal system with more oomph than PA". Is this a case of looking at PA "because that's where the light is", ie it's easy to reason about; or is there a case that solving such problems can inform reasoning about more realistic systems?

An example of this: CFAR has published some results on an experiment where they tried to see if they could improve people's probability estimates by asking them how surprised they'd be by truth about some question turning out one way or another. They expected it would, but it turned out it didn't. And that doesn't surprise me. If imagined feelings of surprise contained some information naive probability-estimation methods didn't, why wouldn't we have evolved to tap that information automatically?

Because so few of our ancestors died because they got nume... (read more)

Ah, you're right. Will edit post to reflect that.

Mark, have you read Eliezer's article about the Löbian obstacle, and what was your reaction to it?

I'm in the early stages of writing up my own work on the Löbian obstacle for publication, which will need to include its own (more condensed, rather than expanded) exposition of the Löbian obstacle; but I liked Eliezer's article, so it would be helpful to know why you didn't think it argued the point well enough.

I have, although formal logic is not my field so please excuse me if I have misunderstood it. Eliezer does not demonstrate that overcoming the Löbian obstacle is necessary in the construction of tiling agents, he rather assumes it. No form of program verification is actually required, if you do not use the structure of a logical agent. Consider, for example, the GOLUM architecture[1] which is a form of tiling agent that proceeds by direct experimentation (simulation). It does not require an ability to prove logical facts about the soundness and behavior of its offspring, just an ability to run them in simulation. Of course logical program analysis helps in focusing in on the situations which give rise to differing behavior between the two programs, but there are no Gödelian difficulties there (even if there were you could fall back on probabilistic sampling of environments, searching for setups which trigger different results). The MIRI argument, as I understand it is: “a program which tried to predict the result of modifying itself runs into a Löbian obstacle; we need to overcome the Löbian obstacle to create self-modifying programs with steadfast goal systems.” (I hope I am not constructing a strawman in simplifying it as such.) The problem comes from the implicit assumption that the self-modifying agent will use methods of formal logic to reason about the future actions of its modified self. This need not be the case! There are other methods which work well in practice, converge on stable solutions under the right circumstances, and have been well explored in theory and in practice. I'm reminded of the apocryphal story of two space-age engineers that meet after the fall of the Soviet Union. The American, who oversaw a $1.5 million programme to develop the “Astronaut Pen” which would write in hard vacuum and microgravity environments, was curious to know how his Russian counterpart solved the same problem. “Simple,” he replied, “we used a pencil.” You could ex

Don't worry, I wasn't offended :)

Good to hear, and thanks for the reassurance :-) And yeah, I do too well know the problem of having too little time to write something polished, and I do certainly prefer having the discussion in fairly raw form to not having it at all.

One possibility is that MIRI's arguments actually do look that terrible to you

What I would say is that the arguments start to look really fishy when one thinks about concrete instantiations of the problem.

I'm not really sure what you mean by a "concrete instantiation". I c... (read more)

I don't have time to reply to all of this right now, but since you explicitly requested a reply to: The answer is yes, I think this is essentially right although I would probably want to add some hedges to my version of the statement (and of course the usual hedge that our intuitions probably conflict at multiple points but that this is probably the major one and I'm happy to focus in on it).

Since the PSM was designed without self-modification in mind, "safe but unable to improve itself in effective ways".

(Not sure how this thought experiment helps the discussion along.)

Can you please motivate? Suppose that in the recesses of the code there is an instantiation of the bubble sort algorithm. The planner proposes to change it with, say, merge sort. Do you think that the PSM would generally disapprove such change? Do you think it would approve it, but it would still be unable to approve modifications that would be needed for significant improvement?

MIRI stated goals are similar to those of mainstream AI research, and MIRI approach in particular includes as subgoals the goals of research fields such as model checking and automated theorem proving.

It's definitely not a goal of mainstream AI, and not even a goal of most AGI researchers, to create self-modifying AI that provably preserves its goals. MIRI's work on this topic doesn't seem relevant to what mainstream AI researchers want to achieve.

Zooming out from MIRI's technical work to MIRI's general mission, it's certainly true that MIRI's failure t... (read more)

I thought the example was pretty terrible.

Glad to see you're doing well, Benja :)

Sorry for being curmudgeonly there -- I did afterwards wish that I had tempered that. The thing is that when you write something like

I also agree that the idea of "logical uncertainty" is very interesting. I spend much of my time as a grad student working on problems that could be construed as versions of logical uncertainty.

that sounds to me like you're painting MIRI as working on these topics just because it's fun, and supporting its work by arguments tha... (read more)

Thought experiment: We have an AI which controls a robotic arm inside a box. The box also contains various cooking tools and ingredients to make cake, and a big red button that kills a puppy. We prefer to cake to no cake, and we prefer our cake to be delicious and moist cake, but above all we prefer the companion cu-... puppy to stay alive. Therefore, we implement in the AI a "puppy safety module" (PSM) which vetoes any course of action proposed by the planning module if it determines that there is any non-negligible probability of the red button being depressed. When the PSM can't make an accurate prediction it will always err on the safe side: things like throwing tools on the box walls where they can bounce unpredictably and hit the red button don't get approved. Athena, the Greek goddess of wisdom, justice and math, came down from Mt. Olympus and debugged the code for PSM. She was also so kind to check all the compilers, OS and hardware for bugs and glitches. Do you agree that with this AI the puppy is most likely safe? Now, let's assume that the cakes made by this AI aren't very good. We don't really know how to program it to make better cakes, we try our best reinforcement learning algorithms but it doesn't improve much. Then somebody suggest we try to allow the AI to modify it own code: In addition to sending motor commands to the arm, the planning module can also edit a copy of the AI code and eventually replace the whole program with this successor. Of course, these actions must still be approved by the PSM, but they are otherwise unrestricted: in principle they can modify the PSM, or add a bypass, or get rid of it entirely. Do you think that this AI is unsafe for the puppy? Do you think it is safe but unable to improve itself in effective ways?
MIRI claim to have had an important insight on AI design (this so called "Löbian obstacle") that experts in relevant fields (AI, model checking, automated theorem proving, etc.) didn't have. MIRI attempted to communicate their insight, but so far the experts have mostly ignored MIRI claims or denied that they are likely to be important and relevant. What is the most likely explanation for that? It seems that we can narrow it to two hypotheses: A) MIRI's insight is really relevant and important to AI design, but communication with the experts failed because of some problem on MIRI's side, or on the experts' side (e.g. stubbornness, stupidity) or both (e.g. excessively different backgrounds). B) MIRI is mistaken about the value of their insight (possible psychological causes may include confirmation bias, Dunning–Kruger effect, groupthink, overconfident personalities, etc.). I would say that, barring evidence to the contrary, hypothesis B is the most likely explanation.
I look forward to a clear, detailed explanation of MIRI's thinking on this subject. In particular this counter-intuitive result: deserves some technical elaboration.
Don't worry, I wasn't offended :) I don't think that MIRI is working on these topics just because they are fun, and I apologize for implying that. I should note here that I respect the work that you and Paul have done, and as I said at the beginning I was somewhat hesitant to start this discussion at all, because I was worried that it would have a negative impact on either you / Paul's reputation (regardless of whether my criticisms ended up being justified) or on our relationship. But in the end I decided that it was better to raise my objections in fairly raw form and deal with any damage later. What I would say is that the arguments start to look really fishy when one thinks about concrete instantiations of the problem. I'm not sure I understand what you're saying here, but I'm not convinced that this is the sort of reasoning I'd use. It seems like Paul's argument is similar to yours, though, and I'm going to talk to him in person in a few days, so perhaps the most efficient thing will be for me to talk to him and then report back. I don't think that "whole brain emulations can safely self-modify" is a good description of our disagreements. I think that this comment (the one you just made) does a better job of it. But I should also add that my real objection is something more like: "The argument in favor of studying Lob's theorem is very abstract and it is fairly unintuitive that human reasoning should run into that obstacle. Standard epistemic hygiene calls for trying to produce concrete examples to motivate this line of work. I have not seen this done by MIRI, and all of the examples I can think of, both from having done AI and verification work myself, and from looking at what my colleagues do in program analysis, points in the squarely opposite direction." When I say "failure to understand the surrounding literature", I am referring more to a common MIRI failure mode of failing to sanity-check their ideas / theories with concrete examples / evidence. I d

Jacob, have you seen Luke's interview with me, where I've tried to reply to some arguments of the sort you've given in this thread and elsewhere?

I don't think [the fact that humans' predictions about themselves and each other often fail] is sufficient to dismiss my example. Whether or not we prove things, we certainly have some way of reasoning at least somewhat reliably about how we and others will behave. It seems important to ask why we expect AI to be fundamentally different; I don't think that drawing a distinction between heuristics and logical pro

... (read more)
Glad to see you're doing well, Benja :) Here's a concrete way you could try to get stable self-modification: Suppose for concreteness that we have a C program, call it X, and that within the C program there is an array called "world_state" of length M and a floating point number called "utility". A simple instantiation of X would look something like: while(true){ action = chooseAction(worldState); world_state = propgateWorldState(worldState, action); utility = calculateUtility(worldState); } We would like to consider modifications to X where we replace chooseAction with some new method chooseAction2 to get a program X2. Suppose we want to ensure some condition such as: from the current world state, if we use X2 instead of X, then after some finite period of time the sequence of utilities we get from using chooseAction2 will always be larger than the corresponding sequence if we have used chooseAction. Abusing notation a bit, this is the same as verifying the statement: "there exists N such that for all n > N, utility2(n) > utility(n)" [although note that utility2 and utility have fairly complicated descriptions if you actually try to write them out]. Now I agree that reasoning about this for arbitrary choices of chooseAction and chooseAction2 will be quite difficult (probably undecidable although I haven't proved that). But the key point is that I get to choose chooseAction2, and there are many decision procedures that can prove such a statement in special cases. For instance, I could partition the space of world states into finitely many pieces, write down a transition function that over-approximates the possible transitions (for instance, by having a transition from Piece1 to Piece2 if any element of Piece1 can transition to any element of Piece2). Then I only need to reason about finite automata and those are trivially decidable. You could argue that this proof system is fairly weak, but again, the AI gets to tailor its choices of chooseAction2 to be easy

Things that result in fewer resources going into AI specifically would result in fewer UFAI resources without reducing overall economic growth, but it needs to be kept in mind that some such research occurs in financial firms pushing trading algorithms, and a lot more in Google, not just in places like universities.

To the extent that industry researchers publish less than academia (this seems particularly likely in financial firms, and to a lesser degree at Google), a hypothetical complete shutdown of academic AI research should reduce uFAI's paralleliz... (read more)

I'd definitely be interested to talk more about many of these, especially anthropics and reduced impact / Oracle AI, and potentially collaborate. Lots of topics for future Oxford visits! :-)

Hope you'll get interest from others as well.

Yep, we'll have a lot to talk about!

Sorry for the long-delayed reply, Wei!

So you think that humans do not have a built-in solution to the Löbstacle, and you must also think we are capable of building an FAI that does have a built-in solution to the Löbstacle. That means an intelligence without a solution to the Löbstacle can produce another intelligence that shares its values and does have a solution to the Löbstacle.


But then why is it necessary for us to solve this problem? [...] Why can't we instead built an FAI without solving this problem, and depend on the FAI to solve the pro

... (read more)

Drats. But also, yay, information! Thanks for trying this!

ETA: Worth noting that I found that post useful, though.

Glad to hear that & looking forward to seeing how it works! I very much understand that one might be concerned about posting "quick and dirty" thoughts (I find it so very difficult to lower my own standards even when it's obviously blocking me from getting stuff done), but there seems to be little cost of trying it with a Discussion post and seeing how it goes -- yay value of information! :-)

The experiment seems to have failed.

For future readers: The discussion has continued here.

Note that you're wrongly discouraging people from doing strategy research by saying that they need to catch up to insiders' unpublished knowledge when they really don't.

What makes you say that? I believe you can reinvent much of what Eliezer and Carl and Bostrom and a few others already know but haven't written down. Not sure that's true for almost most everyone else.

I read the idea as being that people rediscovering and writing up stuff that goes 5% towards what E/C/N have already figured out but haven't written down would be a net positive and it's... (read more)

One way to accelerate the production of strategy exposition is to lower one's standards. It's much easier to sketch one's quick thoughts on an issue than it is to write a well-organized, clearly-expressed, well-referenced, reader-tested analysis (like When Will AI Be Created?), and this is often enough to provoke some productive debate (at least on Less Wrong). See e.g. Reply to Holden on Tool AI and Do Earths with slower economic growth have a better chance at FAI?. So, in the next few days I'll post my "quick and dirty" thoughts on one strategic issue (IA and FAI) to LW Discussion, and see what comes of it.

You should frequently change your passwords, use strong passwords, and not use the same password for multiple services (only one point of failure where all your passwords get compromised rather than every such service being a point of failure). It's not easy to live up to this in practice, but there are approximations that are much easier:

  • Using a password manager is better than using the same password for lots of services. Clipperz is a web service that does the encryption on your computer (so your passwords never get sent to the server), and can be inst

... (read more)
You can generate a very strong passphrase with Diceware. Physical dice are more secure than almost any electronic device, and dictionary words let you memorize the randomness very efficiently. This can then be used with KeePass or some other password manager. Also useful for brainwallets and other kinds of data where offline attacks are likely.
I like the approach of password recipes to have a unique password for each service without needing to memorize very much.
... or you can just store your KeePass database in Google Drive.

As a pedestrian or cyclist, you're not all that easy to see from a car at night, worse if you don't wear white. High-visibility vests (that thing that construction workers wear, yellow or orange with reflective stripes) fix the problem and cost around $7-$8 from Amazon including shipping, or £3 in the UK.

Less than £2 on eBay. I bought mine for 99p including postage, but I can't find any for that price now.
Load More