All of Eliezer Yudkowsky's Comments + Replies

Rationalism before the Sequences

Just jaunt superquantumly to another quantum world instead of superluminally to an unobservable galaxy.  What about these two physically impossible counterfactuals is less than perfectly isomorphic?  Except for some mere ease of false-to-fact visualization inside a human imagination that finds it easier to track nonexistent imaginary Newtonian billiard balls than existent quantum clouds of amplitude, with the latter case, in reality, covering both unobservable galaxies distant in space and unobservable galaxies distant in phase space.

Rationalism before the Sequences

I reiterate the galaxy example; saying that you could counterfactually make an observation by violating physical law is not the same as saying that something's meaning cashes out to anticipated experiences.  Consider the (exact) analogy between believing that galaxies exist after they go over the horizon, and that other quantum worlds go on existing after we decohere them away from us by observing ourselves being inside only one of them.  Predictivism is exactly the sort of ground on which some people have tried to claim that MWI isn't meaningful... (read more)

1Eric Raymond2dIt seems to me that you've been taking your model of predictivism from people who need to read some Kripke. In Peirce's predictivism, to assert that a statement is meaningful is precisely to assert that you have a truth condition for it, but that doesn't mean you necessarily have the capability to test the condition. Consider Russell's teapot. "A teapot orbits between Earth and Mars" is a truth claim that must unambiguously have a true or false value. There is a truth condition on on it; if you build sufficiently powerful telescopes and perform a whole-sky survey you will find it. It would be entirely silly to claim that the claim is meaningless because the telescopes don't exist. The claim "Galaxies continue to exist when they exit our light-cone" has exactly the same status. The fact that you happen to to believe the right sort of telescope not only does not exist but cannot exist is irrelevant - you could after all be mistaken in believing that sort of observation is impossible. I think it is quite likely you are mistaken, as nonlocal realism seems the most likely escape from the bind Bell's inequalities put us in. MWI presents a a subtler problem, not like Russell's Teapot, because we haven't the faintest idea what observing another quantum world would be like. In the case of the overly-distant galaxies, I can sketch a test condition for the claim that involves taking a superluminal jaunt 13 billion light-years thataway and checking all around me to see if the distribution of galaxies has a huge NOT THERE on the side away from Earth. I think a predictivist would be right to ask that you supply an analogous counterfactual before the claim "other quantum worlds exist" can be said to have a meaning.
Rationalism before the Sequences

One minor note is that, among the reasons I haven't looked especially hard into the origins of "verificationism"(?) as a theory of meaning, is that I do in fact - as I understand it - explicitly deny this theory.  The meaning of a statement is not the future experimental predictions that it brings about, nor isomorphic up to those predictions; all meaning about the causal universe derives from causal interactions with us, but you can have meaningful statements with no experimental consequences, for example:  "Galaxies continue to exist after the ... (read more)

7Eric Raymond9d"Galaxies continue to exist after the expanding universe carries them over the horizon of observation from us" trivially unpacks to "If we had methods to make observations outside our light cone, we would pick up the signatures that galaxies after the expanding universe has carried them over the horizon of observation from us defined by c." You say "Any meaningful belief has a truth-condition". This is exactly Peirce's 1878 insight about the meaning of truth claims, expressed in slightly different language - after all, your "truth-condition" unpacks to a bundle of observables, does it not? The standard term of art you are missing when you say "verificationist" is "predictivist". I can grasp no way in which you are not a predictivist other than terminological quibbles, Eliezer. You can refute me by uttering a claim that you consider meaningful, e.g. having a "truth-condition", where the truth condition does not implicitly cash out as hypothetical-future observables - or, in your personal terminology, "anticipated experiences" Amusingly, your "anticipated experiences" terminology is actually closer to the language of Peirce 1878 than the way I would normally express it, which is influenced by later philosophers in the predictivist line, notably Reichenbach.
AI and the Probability of Conflict

My point is that plausible scenarios for Aligned AGI give you AGI that remains aligned only when run within power bounds, and this seems to me like one of the largest facts affecting the outcome of arms-race dynamics.

3tonyoconnor13dThanks for the clarification. If that's the plausible scenario for Aligned AGI, then I was drawing a sharper line between Aligned and Unaligned than was warranted. I will edit some part of the text on my website to reflect that.
AI and the Probability of Conflict

This all assumes that AGI does whatever its supposed operator wants it to do, and that other parties believe as much?  I think the first part of this is very false, though the second part alas seems very realistic, so I think this misses the key thing that makes an AGI arms race lethal.

I expect that a dignified apocalypse looks like, "We could do limited things with this software and hope to not destroy the world, but as we ramp up the power and iterate the for-loops more times, the probability of destroying the world goes up along a logistic curve." ... (read more)

5tonyoconnor15dThanks for your comment. If someone wants to estimate the overall existential risk attached to AGI, then it seems fitting that they would estimate the existential risk attached to the scenarios where we have 1) only unaligned AGI, 2) only aligned AGI, or 3) both. The scenario you portray is a subset of 1). I find it plausible. But most relevant discussion on this forum is devoted to 1) so I wanted to think about 2). If some non-zero probability is attached to 2), that should be a useful exercise. I thought it was clear I was referring to Aligned AGI in the intro and the section heading. And of course, exploring a scenario doesn't mean I think it is the only scenario that could materialise.
Disentangling Corrigibility: 2015-2021

Thank you very much!  It seems worth distinguishing the concept invention from the name brainstorming, in a case like this one, but I now agree that Rob Miles invented the word itself.

The technical term corrigibility, coined by Robert Miles, was introduced to the AGI safety/alignment community in the 2015 paper MIRI/FHI paper titled Corrigibility.

Eg I'd suggest that to avoid confusion this kind of language should be something like "The technical term corrigibility, a name suggested by Robert Miles to denote concepts previously discussed at MIRI, was introduced..." &c.

1Koen.Holtman14dThanks at lot all! I just edited the post above to change the language as suggested. FWIW, Paul's post on corrigibility here [] was my primary source for the into that Robert Miles named the technical term. Nice to see the original suggestion as made on Facebook too.
6Ben Pace16dYou're welcome. Yeah "invented the concept" and "named the concept" are different (and both important!).
How do we prepare for final crunch time?

Seems rather obvious to me that the sort of person who is like, "Oh, well, we can't possibly work on this until later" will, come Later, be like, "Oh, well, it's too late to start doing basic research now, we'll have to work with whatever basic strategies we came up with already."


Seems true, but also didn't seem to be what this post was about?

Disentangling Corrigibility: 2015-2021

Why do you think the term "corrigibility" was coined by Robert Miles?  My autobiographical memory tends to be worryingly fallible, but I remember coining this term myself after some brainstorming (possibly at a MIRI workshop).  This is a kind of thing that I usually try to avoid enforcing because it would look bad if all of the concepts that I did in fact invent were being cited as traceable to me - the truth about how much of this field I invented does not look good for the field or for humanity's prospects - but outright errors of this sort sho... (read more)

1Koen.Holtman14dI wrote that paper [] and abstract back in 2019. Just re-read the abstract. I am somewhat puzzled how you can read the abstract and feel that it makes 'very large claims' that would be 'very surprising' when fulfilled. I don't feel that the claims are that large or hard to believe. Feel free to tell me more when you have read the paper. My more recent papers make somewhat similar claims about corrigibility results, but they use more accessible math.
3Robert Miles15dYeah I definitely wouldn't say I 'coined' it, I just suggested the name
6Ben Pace16dI'm 94% confident it came from a Facebook thread where you blegged for help naming the concept and Rob suggested it. I'll have a look now to find it and report back. Edit: having a hard time finding it, though note that Paul repeats the claim at the top of his post [] on corrigibility in 2017.
Logan Strohl on exercise norms

Lots of people work for their privileges!  I practiced writing for a LONG time - and remain continuously aware that other people cannot be expected to express their ideas clearly, even assuming their ideas to be clear, because I have Writing Privilege and they do not.  Does my Writing Privilege have an innate component?  Of course it does; my birth lottery placed me in a highly literate household full of actually good books, which combined with genuine genetic talent got me a 670 Verbal score on the pre-restandardized SAT at age eleven; but ... (read more)

My research methodology

Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won't collapse the moment they're finished.

Yeah, that kiiiinda relies on literally anybody anywhere being able to sketch a bridge that wouldn't fall over, which is not the situation we are currently in.

My research methodology

But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.

Paul, this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips).  Not only do I not understand what you actually did mean by this, it seems like phrasing that potentially leads astray other readers coming in for the first time.  Say, if you imagine somebody at Deepmind coming in without a lot of... (read more)

  • I still feel fine about what I said, but that's two people finding it confusing (and thinking it is misleading) so I just changed it to something that is somewhat less contentful but hopefully clearer and less misleading.
  • Clarifying what I mean by way of analogy: suppose I'm worried about unzipping a malicious file causing my computer to start logging all my keystrokes and sending them to a remote server. I'd say that seems like a strange and extreme failure mode that you should be able to robustly avoid if we write our code right, regardless of how the log
... (read more)
8SDM18dIs a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won't collapse the moment they're finished. I'm not saying this is an exact analogy for AGI alignment - there are lots of specific technical reasons to expect that alignment is not like bridge building and that there are reasons why the approaches we're likely to try will break on us suddenly in ways we can't fix as we go - treacherous turns, inner misalignment or reactions to distributional shift. It's just that there are different answers to the question of what's the default outcome depending on if you're asking what to expect abstractly or in the context of how things are in fact done. Instrumental Convergence plus a specific potential failure mode (like e.g. we won't pay sufficient attention to out of distribution robustness), is like saying 'you know the vast majority of physically possible bridge designs fall over straight away and also there's a giant crack in that load-bearing concrete pillar over there' - if for some reason your colleague has a mental block around the idea that a bridge could in principle fall down then the first part is needed (hence why IC is important for presentations of AGI risk because lots of people have crazily wrong intuitions about the nature of AI or intelligence), but otherwise IC doesn't do much to help the case for expecting catastrophic misalignment and isn't enough to establish that failure is a default outcome. It seems like your reason for saying that catastrophic misalignment can't be considered an abnormal or extreme failure mode comes down to this pre-technical-detail Instrumental Convergence thesis - that IC by itself gives us a significant reason to worry, even if we all agree that IC is not the whole story. = 'because strongly optimizing for almost
Why those who care about catastrophic and existential risk should care about autonomous weapons

To answer your research question, in much the same way that in computer security any non-understood behavior of the system which violates our beliefs about how it's supposed to work is a "bug" and very likely en route to an exploit - in the same way that OpenBSD treats every crash as a security problem, because the system is not supposed to crash and therefore any crash proves that our beliefs about the system are false and therefore our beliefs about its security may also be false because its behavior is not known - in AI safety, you would expect system s... (read more)

2jimrandomh23dI started a reply to this comment and it turned into this shortform post [] .
Strong Evidence is Common

Corollary: most beliefs worth having are extreme.

9Davidmanheim1mo"Worth having" is a separate argument about relative value of new information. It is reasonable when markets exist or we are competing in other ways where we can exploit our relative advantage. But there's a different mistake that is possible which I want to note. Most extreme beliefs are false; for every correct belief, there are many, many extreme beliefs that are false. Strong consensus on some belief is (evidence for the existence of) strong evidence of the truth of that belief, at least among the considered alternatives. So picking a belief on the basis of extremity ("Most sheeple think X, so consider Y") is doing this the wrong way around, because extremity alone is negligible evidence of value. (Prosecutor's fallacy.) What makes the claim that extremity isn't a useful indicator of value, less valid? That is, where should we think that extreme beliefs should even be considered? I think the answer is when the evidence is both novel and cumulatively outweighs the prior consensus, or the belief is new / previously unconsidered. ("We went to the moon to inspect the landing site," not "we watched the same video again and it's clearly fake.") So we should only consider extreme beliefs, even on the basis of our seemingly overwhelming evidence, if the proposed belief is significantly newer than the extant consensus AND we have a strong argument that the evidence is not yet widely shared / understood.

Though any belief so extreme wouldn't really feel like a "belief" in the colloquial sense, I don't internally label my belief that there is a chair under my butt as a "belief". That label instinctually gets used for things I am much less certain about, so most normal people doing an internal search for "beliefs" will only think of things that they are not extremely certain of. Most beliefs worth having are extreme, but most beliefs internally labeled as "belief" worth having are not extreme.

MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

I expect there to be a massive and important distinction between "passive transparency" and "active transparency", with the latter being much more shaky and potentially concealing of fatality, and the former being cruder as tech at the present rate which is unfortunate because it has so many fewer ways to go wrong.  I hope any terminology chosen continues to make the distinction clear.

Possibly relevant here is my transparency trichotomy between inspection transparency, training transparency, and architectural transparency. My guess is that inspection transparency and training transparency would mostly go in your “active transparency” bucket and architectural transparency would mostly go in your “passive transparency” bucket. I think there is a position here that makes sense to me, which is perhaps what you're advocating, that architectural transparency isn't relying on any sort of path-continuity arguments in terms of how your training ... (read more)

Excerpt from Arbital Solomonoff induction dialogue

Seem just false.  If you're not worried about confronting agents of equal size (which is equally a concern for a Solomonoff inductor) then a naive bounded Solomonoff inductor running on a Grahamputer will give you essentially the same result for all practical purposes as a Solomonoff inductor.  That's far more than enough compute to contain our physical universe as a hypothesis.  You don't bother with MCMC on a Grahamputer.

2Jameson Quinn2moIf we're positing a Grahamputer, then "yeah but it's essentially the same if you're not worried about agents of equal size" seems too loose. In other words, with great compute power, comes great compute responsibility.
Excerpt from Arbital Solomonoff induction dialogue

(IIRC, that dialogue is basically me-written.)

Your Cheerful Price

I used it this afternoon to pay a housemate to sterilize the contents of a package. They said $5.

3adamzerner2moDo any other examples come to mind? I'm finding it difficult to think about without, say, 5 concrete examples to latch on to.
Extensions and Intensions

Correction for future note:  The extensional definition is the complete set of objects obeying a definition.  To define a thing by pointing out some examples (without pointing out all possible examples) has the name "ostensive definition".  H/t @clonusmini on Twitter.  Original discussion in "Language in Thought and Action" here.

Why I'm excited about Debate

Now, consider the following simplistic model for naive (un)aligned AGI:

The AGI outputs English sentences.  Each time the AGI does, the human operator replies on a scale of 1 to 100 with how good and valuable and useful that sentence seemed to the human.  The human may also input other sentences to the AGI as a hint about what kind of output the human is currently looking for; and the AGI also has purely passive sensory inputs like a fixed webcam stream or a pregathered internet archive.

How does this fail as an alignment methodology?  Doesn't... (read more)

1William_S3moIf the High-Rated Sentence Producer was restricted to output only single steps of a mathematical proof and the single steps were evaluated independently, with the human unable to look at previous steps, then I wouldn't expect this kind of reward hacking to occur. In math proofs, we can build proofs for more complex questions out of individual steps that don't need to increase in complexity. As I see it, debate on arbitrary questions could work if we figured out how to do something similar, having arguments split into single steps and evaluated independently (as in the recent OpenAI debate work), such that the debate AI can tackle more complicated questions with steps that are restricted to the complexity that humans can currently work with. Hard to know if this is possible, but still seems worth trying to work on.
1ofer3moOne might argue: So I think it's important to also note that, getting a neural network to "perform roughly at human-level in an aligned manner" may be a much harder task than getting a neural network to achieve maximal rating by breaking the operator. The former may be a much narrower target. This point is closely related to what you wrote here [] in the context of amplification:

I think I agree with all of this. In fact, this argument is one reason why I think Debate could be valuable, because it will hopefully increase the maximum complexity of arguments that humans can reliably evaluate.

This eventually fails at some point, but hopefully it fails after the point at which we can use Debate to solve alignment in a more scalable way. (I don't have particularly strong intuitions about whether this hope is justified, though.)

Why I'm excited about Debate

I’m reasonably compelled by Sperber and Mercer’s claim that explicit reasoning in humans primarily evolved not in order to help us find out about the world, but rather in order to win arguments.

Seems obviously false.  If we simplistically imagine humans as being swayed by, and separately arguing, an increasingly sophisticated series of argument types that we could label 0, 1, 2, ...N, N+1, and which are all each encoded in a single allele that somehow arose to fixation, then the capacity to initially recognize and be swayed by a type N+1 argument is a... (read more)

If arguments had no meaning but to argue other people into things, if they were being subject only to neutral selection or genetic drift or mere conformism, there really wouldn't be any reason for "the kind of arguments humans can be swayed by" to work to build a spaceship.  We'd just end up with some arbitrary set of rules fixed in place.

I agree with this. My position is not that explicit reasoning is arbitrary, but that it developed via an adversarial process where arguers would try to convince listeners of things, and then listeners would try to di... (read more)

Now, consider the following simplistic model for naive (un)aligned AGI:

The AGI outputs English sentences.  Each time the AGI does, the human operator replies on a scale of 1 to 100 with how good and valuable and useful that sentence seemed to the human.  The human may also input other sentences to the AGI as a hint about what kind of output the human is currently looking for; and the AGI also has purely passive sensory inputs like a fixed webcam stream or a pregathered internet archive.

How does this fail as an alignment methodology?  Doesn't... (read more)

2adamShimi3moTo check if I understand correctly, you're arguing that the selection pressure to use argument in order to win requires the ability to be swayed by arguments, and the latter already requires explicit reasoning? That seems convincing as a counter-argument to "explicit reasoning in humans primarily evolved not in order to help us find out about the world, but rather in order to win arguments.", but I'm not knowledgeable enough about the work quoted to check if they don't have a more subtle position.
Inner Alignment in Salt-Starved Rats

Now, for the rats, there’s an evolutionarily-adaptive goal of "when in a salt-deprived state, try to eat salt". The genome is “trying” to install that goal in the rat’s brain. And apparently, it worked! That goal was installed! 

This is importantly technically false in a way that should not be forgotten on pain of planetary extinction:

The outer loss function training the rat genome was strictly inclusive genetic fitness.  The rats ended up with zero internal concept of inclusive genetic fitness, and indeed, no coherent utility function; and instea... (read more)

Thanks for your comment! I think that you're implicitly relying on a different flavor of "inner alignment" than the one I have in mind.

(And confusingly, the brain can be described using either version of "inner alignment"! With different resulting mental pictures in the two cases!!)

See my post Mesa-Optimizers vs "Steered Optimizers" for details on those two flavors of inner alignment.

I'll summarize here for convenience.

I think you're imagining that the AGI programmer will set up SGD (or equivalent) and the thing SGD does is analogous to evolution acting on... (read more)

Matt Botvinick on the spontaneous emergence of learning algorithms

What is all of humanity if not a walking catastrophic inner alignment failure? We were optimized for one thing: inclusive genetic fitness. And only a tiny fraction of humanity could correctly define what that is!

1maximkazhenkov8moIsn't evolution a better analogy for deep learning anyway? All natural selection does is gradient descent (hill climbing technically), with no capacity for lookahead. And we've known this one for 150 years!

I mean, it could both be the case that there exists catastrophic inner alignment failure between humans and evolution, and also that humans don't regularly experience catastrophic inner alignment failures internally.

In practice I do suspect humans regularly experience internal (within-brain) inner alignment failures, but given that suspicion I feel surprised by how functional humans manage to be. That is, I notice expecting that regular inner alignment failures would cause far more mayhem than I observe, which makes me wonder whether brains are implementing some sort of alignment-relevant tech.

Developmental Stages of GPTs
I don't want to take away from MIRI's work (I still support them, and I think that if the GPTs peter out, we'll be glad they've been continuing their work), but I think it's an essential time to support projects that can work for a GPT-style near-term AGI

I'd love to know of a non-zero integer number of plans that could possibly, possibly, possibly work for not dying to a GPT-style near-term AGI.

1[comment deleted]8mo

Here are 11. I wouldn't personally assign greater than 50/50 odds to any of them working, but I do think they all pass the threshold of “could possibly, possibly, possibly work.” It is worth noting that only some of them are language modeling approaches—though they are all prosaic ML approaches—so it does sort of also depend on your definition of “GPT-style” how many of them count or not.

Maybe put out some sort of prize for the best ideas for plans?

Open & Welcome Thread - February 2020

Thank you for sharing this info. My faith is now shaken.

2Mason Bially10moI always thought the EMH was obviously invalid due to it's connection with the P=NP issue (which is to say the EMH iff P=NP).

From someone replying to you on Twitter:

Someone made a profitable trade ergo markets aren’t efficient?

This is why I said "at least for me". You'd be right to discount the evidence and he would be right to discount the evidence even more, because of more room for selection bias.

ETA: Hmm, intuitively this makes sense but I'm not sure how it squares up with Aumann Agreement. Maybe someone can try to work out the actual math?

Time Binders

Yes, via "Language in Thought and Action" and the Null-A novels.

Is Clickbait Destroying Our General Intelligence?

(Deleted section on why I thought cultural general-intelligence software was not much of the work of AGI:)

...because the soft fidelity of implicit unconscious cultural transmission can store less serially deep and intricate algorithms than the high-fidelity DNA transmission used to store the kind of algorithms that appear in computational neuroscience.

I recommend Terrence Deacon's The Symbolic Species for some good discussion of the surprising importance of the shallow algorithms and parameters that can get transmitted culturally. The human-raised chi... (read more)

Anatomically modern humans appeared around 300 K years ago, but the civilisation started only 5K years. It seems that this time was needed to polish the training data set for general intelligence.

I read a book about prehistoric art, and it strikes me that the idea of a drawing took tens thousand years to consolidate. This idea of drawing later evolved in symbols and text.

Paul's research agenda FAQ

It would be helpful to know to what extent Paul feels like he endorses the FAQ here. This makes it sound like Yet Another Stab At Boiling Down The Disagreement would say that I disagree with Paul on two critical points:

  • (1) To what extent "using gradient descent or anything like it to do supervised learning" involves a huge amount of Project Chaos and Software Despair before things get straightened out, if they ever do;
  • (2) Whether there's a simple scalable core to corrigibility that you can find by searching for thought processes that seem
... (read more)
It would be helpful to know to what extent Paul feels like he endorses the FAQ here... I don't want to invest huge amounts arguing with this until I know to what extent Paul agrees with either the FAQ, or that this sounds like a plausible locus of disagreement.

Note that the second paragraph of zhukeepa's post now contains this:

ETA: Paul does not have major disagreements with anything expressed in this FAQ. There are many small points he might have expressed differently, but he endorses this as a reasonable representation of his views. This is in
... (read more)


It's difficult to tell, having spent some time (but not a very large amount of time) following this back-and-forth, whether much progress is being made in furthering Eliezer's and Paul's understanding of each other's positions and arguments. My impression is that there has been some progress, mostly from Paul vetoing Eliezer's interpretations of Paul's agenda, but by nature this is a slow kind of progress - there are likely many more substantially incorrect interpretations than substantially correct ones, so even... (read more)

9paulfchristiano3yI agree with the first part of this. The second isn't really true because the resulting AI might be very inefficient (e.g. suppose you could tell which cognitive strategies are safe but not which are effective). Overall I don't think it's likely to be useful to talk about this topic until having much more clarity on other stuff (I think this section is responding to a misreading of my proposal). This stuff about inspecting thoughts fits into the picture when you say: "But even if you are willing to spend a ton of time looking at a particular decision, how could you tell if it was optimized to cause a catastrophic failure?" and I say "if the AI has learned how to cause a catastrophic failure, we can hope to set up the oversight process so it's not that much harder to explain how it's causing a catastrophic failure" and then you say "I doubt it" and I say "well that's the hope, it's complicated" and then we discuss whether that problem is actually soluble. And that does have a bunch of hard steps, especially the one where we need to be able to open up some complex model that our AI formed of the world in order to justify a claim about why some action is catastrophic.
Eliezer thinks that in the alternate world where this is true, GANs pretty much worked the first time they were tried

Note that GANs did in fact pretty much work the first time they were tried, at least according to Ian's telling, in the strong sense that he had them working on the same night that he came up with the idea over drinks. (That wasn't a journalist editorializing, that's the story as he tells it.)

GANs seem to be unstable in just about the ways you'd expect them to be unstable on paper, we don't have to posit any magical... (read more)

Eliezer thinks that if you have any optimization powerful enough to reproduce humanlike cognition inside a detailed boundary by looking at a human-labeled dataset trying to outline the boundary, the thing doing the optimization is powerful enough that we cannot assume its neutrality the way we can assume the neutrality of gradient descent.

To clarify: it's not that you think that gradient descent can't in fact find human-level cognition by trial and error, it's that you think "the neutrality of gradient descent" is an artifact of ... (read more)

But you will get the kind of weird squiggles in the learned function that adversarial examples expose in current nets - special inputs that weren't in the training distribution, but look like typical members of the training distribution from the perspective of the training distribution itself, will break what we think is the intended labeling from outside the system.

I don't really know what you mean by "squiggles." If you take data that is off the distribution, then your model can perform poorly. This can be a problem if your distribut... (read more)

Eliezer thinks that while corrigibility probably has a core which is of lower algorithmic complexity than all of human value, this core is liable to be very hard to find or reproduce by supervised learning of human-labeled data, because deference is an unusually anti-natural shape for cognition, in a way that a simple utility function would not be an anti-natural shape for cognition. Utility functions have multiple fixpoints requiring the infusion of non-environmental data, our externally desired choice of utility function would be non-natural in that sens
... (read more)
A Rationalist Argument for Voting

Voting in elections is a wonderful example of logical decision theory in the wild. The chance that you are genuinely logically correlated to a random trade partner is probably small, in cases where you don't have mutual knowledge of LDT; leaving altruism and reputation as sustaining reasons for cooperation. With millions of voters, the chance that you are correlated to thousands of them is much better.

Or perhaps you'd prefer to believe the dictate of Causal Decision Theory that if an election is won by 3 votes, nobody's vote influenced it,... (read more)

8steven04613yIt seems to me there are also millions of potential acausal trade partners in non-voting contexts, e.g. in the context of whether to spend most of your effort egoistically or altruistically and toward which cause, whether to obey the law, etc. The only special feature of voting that I can see is it gives you a share in years' worth of policy at the cost of only a much smaller amount of your time, making it potentially unusually efficient for altruists.
9Wei_Dai3yDo you know if anyone has done/published a calculation on whether, given reasonable beliefs about (i.e., a large amount of uncertainty over) opportunity costs and logical correlations, voting is actually a good thing to do from an x-risk perspective?
Toolbox-thinking and Law-thinking

Savage's Theorem isn't going to convince anyone who doesn't start out believing that preference ought to be a total preorder. Coherence theorems are talking to anyone who starts out believing that they'd rather have more apples.

I can't make sense of this comment.

If one is talking about one's preferences over number of apples, then the statement that it is a total preorder, is a weaker statement than the statement that more is better. (Also, you know, real number assumptions all over the place.) If one is talking about preferences not just over number of apples but in general, then even so it seems to me that the complete class theorem seems to be making some very strong assumptions, much stronger than the assumption of a total preorder! (Again, look at all those real number assum... (read more)

Local Validity as a Key to Sanity and Civilization

There will be a single very cold day occasionally regardless of whether global warming is true or false. Anyone who knows the phrase "modus tollens" ought to know that. That said, if two unenlightened ones are arguing back and forth in all sincerity by telling each other about the hot versus cold days they remember, neither is being dishonest, but both are making invalid arguments. But this is not the scenario offered in the original, which concerns somebody who does possess the mental resources to know better, but is tempted to rationalize in... (read more)

1Said Achmiz3yYes, that is what I was saying. The (apparent) reasoning I am challenging, Eliezer, is this: “Alice is making an invalid argument for side A” -> “therefore Alice would not make an invalid argument for side B”. This seems faulty. You seem to be taking someone’s use of invalid reasoning as evidence of one-sidedness (and thus dishonesty), whereas it could just be evidence of not understanding what does and does not constitute a valid argument (but no dishonesty). In other words, “this” (i.e., “somebody who … is tempted to rationalize in order to reach the more agreeable conclusion”) is not quite the scenario offered in the original. The scenario you offer, rather, is one where you conclude that somebody is rationalizing—but what I am saying is that your conclusion rests on a faulty inference. -------------------------------------------------------------------------------- This is all a minor point, of course, and does not take away from your main points. The reason I bring it up, is that I see you as imputing ill intent (or at least, blameworthy bias) to at least some people who aren’t deserving of that judgment. Avoiding this sort of thing is also important for sanity, civilization, etc.
A LessWrong Crypto Autopsy

This is pretty low on the list of opportunities I'd kick myself for missing. A longer reply is here:

Arbital postmortem

The vision for Arbital would have provided incentives to write content, but those features were not implemented before the project ran out of time. I did not feel that at any point the versions of Arbital that were in fact implemented were at a state where I predicted they'd attract lots of users, and said so.

6ChristianKl3yGiven that the project did have time to pivot and try something different, it seems to me as if time was there. It sounds to me like the main problem was communication and agreeing on a common vision?

Interesting, any chance you could describe it?

I'm very curious how you solved the insentives problem, would you mind detailing it? Alexei mentioned that you already did the write-up, so even a link to your rough-draft would satisfy me.

Pascal’s Muggle Pays

Unless I'm missing something, the trouble with this is that, absent a leverage penalty, all of the reasons you've listed for not having a muggable decision algorithm... drumroll... center on the real world, which, absent a leverage penalty, is vastly outweighed by tiny probabilities of googolplexes and ackermann numbers of utilons. If you don't already consider the Mugger's claim to be vastly improbable, then all the considerations of "But if I logically decide to let myself be mugged that retrologically increases his probability ... (read more)

5Zvi3yI thought for a while about the best way to formalize what I'm thinking in a way that works here. I do observe that I keep getting tempted by "hey there's an obvious leverage penalty here" by the "this can only happen for real one time in X where X is the number of lives saved times the percent of people who agree" because of the details of the mugging. Or alternatively, the maximum total impact people together can expect to have in paying off things like this (before the Mugger shows up) seems reasonably capped at one life per person, so our collective real-world decisions clearly matter far more than that very-much-not-hard upper bound. I think that points to the answer I like most, which is that my reasons aren't tied only to the real world. They're also tied to the actions of other logically correlated agents, which includes the people on other worlds that I'm possibly being offered the chance to save, and also the Matrix lords. I mean, we'd hate to have bad results presented at the Matrix Lord decision theory conference he's doing research for, that seems rather important. Science. My decision reaches into the origin worlds of the Matrix Lords, and if they spent all their days going around Mugging each other with lies about Matrix Lords during their industrial period, it's doubtful the Matrix gets constructed at all. Then I wouldn't even exist to pay the guy. It reaches into all the other worlds in the Matrix, which have to make the same decisions I do, and we can't all be given this opportunity to win ackermann numbers of utilons. I think I don't need to do that, and I can get out of this simply by saying that before I see the Mugger I have to view potential genuine Muggers offering googleplexes of utility without evidence as not a large source of potential wins (e.g. at this point my probability really is that low). I mean, assuming the potential Mugger hasn't seen any of these discussions (I do think there's at least a one in a ten chance someone trie

Additional predition: it was more fun to write this than the book, and the writing involved an initial long contiguous chunk.

(Where it's coming from: my enjoyment of reading the above was mostly a little bit of thrill, of the kind I get from watching someone break rules that I always wished I could break. If that makes sense.)

Zombies Redacted

Sure. Measure a human's input and output. Play back the recording. Or did you mean across all possible cases? In the latter case see

4CronoDAS5yYeah, I meant in all possible cases. Start with a Brain In A Vat. Scan that brain and implement a GLUT in Platospace, then hook up the Brain-In-A-Vat and the GLUT to identical robots, and you'll have one robot that's conscious and one that isn't, right?
3HungryHobo5y"Error rendering the page: Couldn't find page"
0Stuart_Armstrong5yYep, very much related.
Machine learning and unintended consequences

Ed Fredkin has since sent me a personal email:

By the way, the story about the two pictures of a field, with and without army tanks in the picture, comes from me. I attended a meeting in Los Angeles, about half a century ago where someone gave a paper showing how a random net could be trained to detect the tanks in the picture. I was in the audience. At the end of the talk I stood up and made the comment that it was obvious that the picture with the tanks was made on a sunny day while the other picture (of the same field without the tanks) was made on

... (read more)
0casebash5yI'm curious, do you disagree with the post? I believe that the point being made is a) overwhelming supported by logic, or at the very least a logically consistent alternate viewpoint b) important to rationality (by preventing people trying to solve problems with no solution) c) overlooked in previous discussion or at least underdeveloped. Because of this I took the socially risky gambit of moving a low voted (positive at the time) post to main.
6Gleb_Tsipursky5yYup, will not, based on the FB discussion here [] . Thanks for helping me update!
A toy model of the control problem

I assume the point of the toy model is to explore corrigibility or other mechanisms that are supposed to kick in after A and B end up not perfectly value-aligned, or maybe just to show an example of why a non-value-aligning solution for A controlling B might not work, or maybe specifically to exhibit a case of a not-perfectly-value-aligned agent manipulating its controller.

A toy model of the control problem

When I consider this as a potential way to pose an open problem, the main thing that jumps out at me as being missing is something that doesn't allow A to model all of B's possible actions concretely. The problem is trivial if A can fully model B, precompute B's actions, and precompute the consequences of those actions.

The levels of 'reason for concern about AI safety' might ascend something like this:

  • 0 - system with a finite state space you can fully model, like Tic-Tac-Toe
  • 1 - you can't model the system in advance and therefore it may exhibit unantici
... (read more)
1Stuart_Armstrong6yAdded a cheap way to get us somewhat in the region of 2, just by assuming that B/C can model A, which precludes A being able to model B/C in general.
Procedural Knowledge Gaps

I recall originally reading something about a measure of exercise-linked gene expression and I'm pretty sure it wasn't that New Scientist article, but regardless, it's plausible that some mismemory occurred and this more detailed search screens off my memory either way. 20% of the population being immune to exercise seems to match real-world experience a bit better than 40% so far as my own eye can see - I eyeball-feel more like a 20% minority than a 40% minority, if that makes sense. I have revised my beliefs to match your statements. Thank you for tracking that down!

0gwern5yThat's certainly possible. Bouchard and others, after observing that some subjects were exercise-resistant and finding that like everything else it's heritable, have moved onto gene expression and GWAS hits. Any of those papers could've generated some journalism covering the earlier HERITAGE results as background. Another study suggests it's more like 7%. Probably hard to get a real estimate: how do you do the aggregation across multiple measured traits? If someone appears to be exercise resistant on visceral fat, but not blood glucose levels, do you count them as a case of exercise resistance? On top of the usual sampling error.
Don't You Care If It Works? - Part 1

"Does somebody being right about X increase your confidence in their ability to earn excess returns on a liquid equity market?" has to be the worst possible question to ask about whether being right in one thing should increase your confidence about them being right elsewhere. Liquid markets are some of the hardest things in the entire world to outguess! Being right about MWI is enormously being easier than being right about what Microsoft stock will do relative to the rest of S&P 500 over the next 6 months.

There's a gotcha to the gotcha wh... (read more)

2Lumifer6yWhat do you mean, "knowable"? Showing that MWI is correct while other interpretations are not is straight-up Nobel material.
If MWI is correct, should we expect to experience Quantum Torment?

You're confusing subjective probability and objective quantum measure. If you flip a quantum coin, half your measure goes to worlds where it comes up heads and half goes to where it comes up tails. This is an objective fact, and we know it solidly. If you don't know whether cryonics works, you're probably still already localized by your memories and sensory information to either worlds where it works or worlds where it doesn't; all or nothing, even if you're ignorant of which.

-3Fivehundred6yHow far do "memories and sensory information" extend? I'm worried about what happens during sleep. It's been argued that dreams [] are a stability mechanism that prevent self-change, but I don't know if that applies to the external world. Following this line of argument, our memories could change while we are awake if we aren't actively remembering them.
Pascal's Muggle: Infinitesimal Priors and Strong Evidence

can even strip out the part about agents and carry out the reasoning on pure causal nodes; the chance of a randomly selected causal node being in a unique100 position on a causal graph with respect to 3↑↑↑3 other nodes ought to be at most 100/3↑↑↑3 for finite causal graphs.

0G0W516yYou're absolutely right. I'm not sure how I missed or forgot about reading that.
Rationality is about pattern recognition, not reasoning

You have not understood correctly regarding Carl. He claimed, in hindsight, that Zuckerberg's potential could've been distinguished in foresight, but he did not do so.

4JonahS6yI'm puzzled, is there a way to read his comment other than as him doing it at the time?
Pascal's Muggle: Infinitesimal Priors and Strong Evidence

I don't think you can give me a moment of pleasure that intense without using 3^^^3 worth of atoms on which to run my brain, and I think the leverage penalty still applies then. You definitely can't give me a moment of worthwhile happiness that intense without 3^^^3 units of background computation.

0G0W516yThe article said the leverage penalty "[penalizes] hypotheses that let you affect a large number of people, in proportion to the number of people affected." If this is all the leverage penalty does, then it doesn't matter if it takes 3^^^3 atoms or units of computation, because atoms and computations aren't people. That said, the article doesn't precisely define what the leverage penalty is, so there could be something I'm missing. So, what exactly is the leverage penalty? Does it penalize how many units of computation, rather than people, you can affect? This sounds much less arbitrary than the vague definition of "person" and sounds much easier to define: simply divide the prior of a hypothesis by the number of bits flipped by your actions in it and then normalize.
Load More