I reiterate the galaxy example; saying that you could counterfactually make an observation by violating physical law is not the same as saying that something's meaning cashes out to anticipated experiences. Consider the (exact) analogy between believing that galaxies exist after they go over the horizon, and that other quantum worlds go on existing after we decohere them away from us by observing ourselves being inside only one of them. Predictivism is exactly the sort of ground on which some people have tried to claim that MWI isn't meaningful... (read more)
One minor note is that, among the reasons I haven't looked especially hard into the origins of "verificationism"(?) as a theory of meaning, is that I do in fact - as I understand it - explicitly deny this theory. The meaning of a statement is not the future experimental predictions that it brings about, nor isomorphic up to those predictions; all meaning about the causal universe derives from causal interactions with us, but you can have meaningful statements with no experimental consequences, for example: "Galaxies continue to exist after the ... (read more)
My point is that plausible scenarios for Aligned AGI give you AGI that remains aligned only when run within power bounds, and this seems to me like one of the largest facts affecting the outcome of arms-race dynamics.
This all assumes that AGI does whatever its supposed operator wants it to do, and that other parties believe as much? I think the first part of this is very false, though the second part alas seems very realistic, so I think this misses the key thing that makes an AGI arms race lethal.
I expect that a dignified apocalypse looks like, "We could do limited things with this software and hope to not destroy the world, but as we ramp up the power and iterate the for-loops more times, the probability of destroying the world goes up along a logistic curve." ... (read more)
Thank you very much! It seems worth distinguishing the concept invention from the name brainstorming, in a case like this one, but I now agree that Rob Miles invented the word itself.
The technical term corrigibility, coined by Robert Miles, was introduced to the AGI safety/alignment community in the 2015 paper MIRI/FHI paper titled Corrigibility.
Eg I'd suggest that to avoid confusion this kind of language should be something like "The technical term corrigibility, a name suggested by Robert Miles to denote concepts previously discussed at MIRI, was introduced..." &c.
Seems rather obvious to me that the sort of person who is like, "Oh, well, we can't possibly work on this until later" will, come Later, be like, "Oh, well, it's too late to start doing basic research now, we'll have to work with whatever basic strategies we came up with already."
Seems true, but also didn't seem to be what this post was about?
Why do you think the term "corrigibility" was coined by Robert Miles? My autobiographical memory tends to be worryingly fallible, but I remember coining this term myself after some brainstorming (possibly at a MIRI workshop). This is a kind of thing that I usually try to avoid enforcing because it would look bad if all of the concepts that I did in fact invent were being cited as traceable to me - the truth about how much of this field I invented does not look good for the field or for humanity's prospects - but outright errors of this sort sho... (read more)
Lots of people work for their privileges! I practiced writing for a LONG time - and remain continuously aware that other people cannot be expected to express their ideas clearly, even assuming their ideas to be clear, because I have Writing Privilege and they do not. Does my Writing Privilege have an innate component? Of course it does; my birth lottery placed me in a highly literate household full of actually good books, which combined with genuine genetic talent got me a 670 Verbal score on the pre-restandardized SAT at age eleven; but ... (read more)
Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won't collapse the moment they're finished.
Yeah, that kiiiinda relies on literally anybody anywhere being able to sketch a bridge that wouldn't fall over, which is not the situation we are currently in.
But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.
Paul, this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips). Not only do I not understand what you actually did mean by this, it seems like phrasing that potentially leads astray other readers coming in for the first time. Say, if you imagine somebody at Deepmind coming in without a lot of... (read more)
To answer your research question, in much the same way that in computer security any non-understood behavior of the system which violates our beliefs about how it's supposed to work is a "bug" and very likely en route to an exploit - in the same way that OpenBSD treats every crash as a security problem, because the system is not supposed to crash and therefore any crash proves that our beliefs about the system are false and therefore our beliefs about its security may also be false because its behavior is not known - in AI safety, you would expect system s... (read more)
Corollary: most beliefs worth having are extreme.
Though any belief so extreme wouldn't really feel like a "belief" in the colloquial sense, I don't internally label my belief that there is a chair under my butt as a "belief". That label instinctually gets used for things I am much less certain about, so most normal people doing an internal search for "beliefs" will only think of things that they are not extremely certain of. Most beliefs worth having are extreme, but most beliefs internally labeled as "belief" worth having are not extreme.
I expect there to be a massive and important distinction between "passive transparency" and "active transparency", with the latter being much more shaky and potentially concealing of fatality, and the former being cruder as tech at the present rate which is unfortunate because it has so many fewer ways to go wrong. I hope any terminology chosen continues to make the distinction clear.
Possibly relevant here is my transparency trichotomy between inspection transparency, training transparency, and architectural transparency. My guess is that inspection transparency and training transparency would mostly go in your “active transparency” bucket and architectural transparency would mostly go in your “passive transparency” bucket. I think there is a position here that makes sense to me, which is perhaps what you're advocating, that architectural transparency isn't relying on any sort of path-continuity arguments in terms of how your training ... (read more)
Seem just false. If you're not worried about confronting agents of equal size (which is equally a concern for a Solomonoff inductor) then a naive bounded Solomonoff inductor running on a Grahamputer will give you essentially the same result for all practical purposes as a Solomonoff inductor. That's far more than enough compute to contain our physical universe as a hypothesis. You don't bother with MCMC on a Grahamputer.
(IIRC, that dialogue is basically me-written.)
I used it this afternoon to pay a housemate to sterilize the contents of a package. They said $5.
Correction for future note: The extensional definition is the complete set of objects obeying a definition. To define a thing by pointing out some examples (without pointing out all possible examples) has the name "ostensive definition". H/t @clonusmini on Twitter. Original discussion in "Language in Thought and Action" here.
Now, consider the following simplistic model for naive (un)aligned AGI:
The AGI outputs English sentences. Each time the AGI does, the human operator replies on a scale of 1 to 100 with how good and valuable and useful that sentence seemed to the human. The human may also input other sentences to the AGI as a hint about what kind of output the human is currently looking for; and the AGI also has purely passive sensory inputs like a fixed webcam stream or a pregathered internet archive.
How does this fail as an alignment methodology? Doesn't... (read more)
I think I agree with all of this. In fact, this argument is one reason why I think Debate could be valuable, because it will hopefully increase the maximum complexity of arguments that humans can reliably evaluate.
This eventually fails at some point, but hopefully it fails after the point at which we can use Debate to solve alignment in a more scalable way. (I don't have particularly strong intuitions about whether this hope is justified, though.)
I’m reasonably compelled by Sperber and Mercer’s claim that explicit reasoning in humans primarily evolved not in order to help us find out about the world, but rather in order to win arguments.
Seems obviously false. If we simplistically imagine humans as being swayed by, and separately arguing, an increasingly sophisticated series of argument types that we could label 0, 1, 2, ...N, N+1, and which are all each encoded in a single allele that somehow arose to fixation, then the capacity to initially recognize and be swayed by a type N+1 argument is a... (read more)
If arguments had no meaning but to argue other people into things, if they were being subject only to neutral selection or genetic drift or mere conformism, there really wouldn't be any reason for "the kind of arguments humans can be swayed by" to work to build a spaceship. We'd just end up with some arbitrary set of rules fixed in place.
I agree with this. My position is not that explicit reasoning is arbitrary, but that it developed via an adversarial process where arguers would try to convince listeners of things, and then listeners would try to di... (read more)
Now, consider the following simplistic model for naive (un)aligned AGI:
The AGI outputs English sentences. Each time the AGI does, the human operator replies on a scale of 1 to 100 with how good and valuable and useful that sentence seemed to the human. The human may also input other sentences to the AGI as a hint about what kind of output the human is currently looking for; and the AGI also has purely passive sensory inputs like a fixed webcam stream or a pregathered internet archive.
How does this fail as an alignment methodology? Doesn't... (read more)
Now, for the rats, there’s an evolutionarily-adaptive goal of "when in a salt-deprived state, try to eat salt". The genome is “trying” to install that goal in the rat’s brain. And apparently, it worked! That goal was installed!
This is importantly technically false in a way that should not be forgotten on pain of planetary extinction:
The outer loss function training the rat genome was strictly inclusive genetic fitness. The rats ended up with zero internal concept of inclusive genetic fitness, and indeed, no coherent utility function; and instea... (read more)
Thanks for your comment! I think that you're implicitly relying on a different flavor of "inner alignment" than the one I have in mind.
(And confusingly, the brain can be described using either version of "inner alignment"! With different resulting mental pictures in the two cases!!)
See my post Mesa-Optimizers vs "Steered Optimizers" for details on those two flavors of inner alignment.
I'll summarize here for convenience.
I think you're imagining that the AGI programmer will set up SGD (or equivalent) and the thing SGD does is analogous to evolution acting on... (read more)
What is all of humanity if not a walking catastrophic inner alignment failure? We were optimized for one thing: inclusive genetic fitness. And only a tiny fraction of humanity could correctly define what that is!
I mean, it could both be the case that there exists catastrophic inner alignment failure between humans and evolution, and also that humans don't regularly experience catastrophic inner alignment failures internally.
In practice I do suspect humans regularly experience internal (within-brain) inner alignment failures, but given that suspicion I feel surprised by how functional humans manage to be. That is, I notice expecting that regular inner alignment failures would cause far more mayhem than I observe, which makes me wonder whether brains are implementing some sort of alignment-relevant tech.
I don't want to take away from MIRI's work (I still support them, and I think that if the GPTs peter out, we'll be glad they've been continuing their work), but I think it's an essential time to support projects that can work for a GPT-style near-term AGI
I'd love to know of a non-zero integer number of plans that could possibly, possibly, possibly work for not dying to a GPT-style near-term AGI.
Here are 11. I wouldn't personally assign greater than 50/50 odds to any of them working, but I do think they all pass the threshold of “could possibly, possibly, possibly work.” It is worth noting that only some of them are language modeling approaches—though they are all prosaic ML approaches—so it does sort of also depend on your definition of “GPT-style” how many of them count or not.
Maybe put out some sort of prize for the best ideas for plans?
Thank you for sharing this info. My faith is now shaken.
From someone replying to you on Twitter:
Someone made a profitable trade ergo markets aren’t efficient?
This is why I said "at least for me". You'd be right to discount the evidence and he would be right to discount the evidence even more, because of more room for selection bias.
ETA: Hmm, intuitively this makes sense but I'm not sure how it squares up with Aumann Agreement. Maybe someone can try to work out the actual math?
Yes, via "Language in Thought and Action" and the Null-A novels.
(Deleted section on why I thought cultural general-intelligence software was not much of the work of AGI:)
...because the soft fidelity of implicit unconscious cultural transmission can store less serially deep and intricate algorithms than the high-fidelity DNA transmission used to store the kind of algorithms that appear in computational neuroscience.
I recommend Terrence Deacon's The Symbolic Species for some good discussion of the surprising importance of the shallow algorithms and parameters that can get transmitted culturally. The human-raised chi... (read more)
Anatomically modern humans appeared around 300 K years ago, but the civilisation started only 5K years. It seems that this time was needed to polish the training data set for general intelligence.
I read a book about prehistoric art, and it strikes me that the idea of a drawing took tens thousand years to consolidate. This idea of drawing later evolved in symbols and text.
It would be helpful to know to what extent Paul feels like he endorses the FAQ here. This makes it sound like Yet Another Stab At Boiling Down The Disagreement would say that I disagree with Paul on two critical points:
It would be helpful to know to what extent Paul feels like he endorses the FAQ here... I don't want to invest huge amounts arguing with this until I know to what extent Paul agrees with either the FAQ, or that this sounds like a plausible locus of disagreement.
Note that the second paragraph of zhukeepa's post now contains this:
ETA: Paul does not have major disagreements with anything expressed in this FAQ. There are many small points he might have expressed differently, but he endorses this as a reasonable representation of his views. This is in... (read more)
Meta-comment:
It's difficult to tell, having spent some time (but not a very large amount of time) following this back-and-forth, whether much progress is being made in furthering Eliezer's and Paul's understanding of each other's positions and arguments. My impression is that there has been some progress, mostly from Paul vetoing Eliezer's interpretations of Paul's agenda, but by nature this is a slow kind of progress - there are likely many more substantially incorrect interpretations than substantially correct ones, so even... (read more)
Eliezer thinks that in the alternate world where this is true, GANs pretty much worked the first time they were tried
Note that GANs did in fact pretty much work the first time they were tried, at least according to Ian's telling, in the strong sense that he had them working on the same night that he came up with the idea over drinks. (That wasn't a journalist editorializing, that's the story as he tells it.)
GANs seem to be unstable in just about the ways you'd expect them to be unstable on paper, we don't have to posit any magical... (read more)
Eliezer thinks that if you have any optimization powerful enough to reproduce humanlike cognition inside a detailed boundary by looking at a human-labeled dataset trying to outline the boundary, the thing doing the optimization is powerful enough that we cannot assume its neutrality the way we can assume the neutrality of gradient descent.
To clarify: it's not that you think that gradient descent can't in fact find human-level cognition by trial and error, it's that you think "the neutrality of gradient descent" is an artifact of ... (read more)
But you will get the kind of weird squiggles in the learned function that adversarial examples expose in current nets - special inputs that weren't in the training distribution, but look like typical members of the training distribution from the perspective of the training distribution itself, will break what we think is the intended labeling from outside the system.
I don't really know what you mean by "squiggles." If you take data that is off the distribution, then your model can perform poorly. This can be a problem if your distribut... (read more)
Eliezer thinks that while corrigibility probably has a core which is of lower algorithmic complexity than all of human value, this core is liable to be very hard to find or reproduce by supervised learning of human-labeled data, because deference is an unusually anti-natural shape for cognition, in a way that a simple utility function would not be an anti-natural shape for cognition. Utility functions have multiple fixpoints requiring the infusion of non-environmental data, our externally desired choice of utility function would be non-natural in that sens... (read more)
Voting in elections is a wonderful example of logical decision theory in the wild. The chance that you are genuinely logically correlated to a random trade partner is probably small, in cases where you don't have mutual knowledge of LDT; leaving altruism and reputation as sustaining reasons for cooperation. With millions of voters, the chance that you are correlated to thousands of them is much better.
Or perhaps you'd prefer to believe the dictate of Causal Decision Theory that if an election is won by 3 votes, nobody's vote influenced it,... (read more)
Savage's Theorem isn't going to convince anyone who doesn't start out believing that preference ought to be a total preorder. Coherence theorems are talking to anyone who starts out believing that they'd rather have more apples.
I can't make sense of this comment.
If one is talking about one's preferences over number of apples, then the statement that it is a total preorder, is a weaker statement than the statement that more is better. (Also, you know, real number assumptions all over the place.) If one is talking about preferences not just over number of apples but in general, then even so it seems to me that the complete class theorem seems to be making some very strong assumptions, much stronger than the assumption of a total preorder! (Again, look at all those real number assum... (read more)
There will be a single very cold day occasionally regardless of whether global warming is true or false. Anyone who knows the phrase "modus tollens" ought to know that. That said, if two unenlightened ones are arguing back and forth in all sincerity by telling each other about the hot versus cold days they remember, neither is being dishonest, but both are making invalid arguments. But this is not the scenario offered in the original, which concerns somebody who does possess the mental resources to know better, but is tempted to rationalize in... (read more)
This is pretty low on the list of opportunities I'd kick myself for missing. A longer reply is here: https://www.facebook.com/yudkowsky/posts/10156147605134228
The vision for Arbital would have provided incentives to write content, but those features were not implemented before the project ran out of time. I did not feel that at any point the versions of Arbital that were in fact implemented were at a state where I predicted they'd attract lots of users, and said so.
Interesting, any chance you could describe it?
I'm very curious how you solved the insentives problem, would you mind detailing it? Alexei mentioned that you already did the write-up, so even a link to your rough-draft would satisfy me.
Unless I'm missing something, the trouble with this is that, absent a leverage penalty, all of the reasons you've listed for not having a muggable decision algorithm... drumroll... center on the real world, which, absent a leverage penalty, is vastly outweighed by tiny probabilities of googolplexes and ackermann numbers of utilons. If you don't already consider the Mugger's claim to be vastly improbable, then all the considerations of "But if I logically decide to let myself be mugged that retrologically increases his probability ... (read more)
Zvi's probably right.
Additional predition: it was more fun to write this than the book, and the writing involved an initial long contiguous chunk.
(Where it's coming from: my enjoyment of reading the above was mostly a little bit of thrill, of the kind I get from watching someone break rules that I always wished I could break. If that makes sense.)
Sure. Measure a human's input and output. Play back the recording. Or did you mean across all possible cases? In the latter case see http://lesswrong.com/lw/pa/gazp_vs_glut/
Ed Fredkin has since sent me a personal email:
... (read more)By the way, the story about the two pictures of a field, with and without army tanks in the picture, comes from me. I attended a meeting in Los Angeles, about half a century ago where someone gave a paper showing how a random net could be trained to detect the tanks in the picture. I was in the audience. At the end of the talk I stood up and made the comment that it was obvious that the picture with the tanks was made on a sunny day while the other picture (of the same field without the tanks) was made on
Moving to Discussion.
Please don't.
I assume the point of the toy model is to explore corrigibility or other mechanisms that are supposed to kick in after A and B end up not perfectly value-aligned, or maybe just to show an example of why a non-value-aligning solution for A controlling B might not work, or maybe specifically to exhibit a case of a not-perfectly-value-aligned agent manipulating its controller.
When I consider this as a potential way to pose an open problem, the main thing that jumps out at me as being missing is something that doesn't allow A to model all of B's possible actions concretely. The problem is trivial if A can fully model B, precompute B's actions, and precompute the consequences of those actions.
The levels of 'reason for concern about AI safety' might ascend something like this:
I recall originally reading something about a measure of exercise-linked gene expression and I'm pretty sure it wasn't that New Scientist article, but regardless, it's plausible that some mismemory occurred and this more detailed search screens off my memory either way. 20% of the population being immune to exercise seems to match real-world experience a bit better than 40% so far as my own eye can see - I eyeball-feel more like a 20% minority than a 40% minority, if that makes sense. I have revised my beliefs to match your statements. Thank you for tracking that down!
"Does somebody being right about X increase your confidence in their ability to earn excess returns on a liquid equity market?" has to be the worst possible question to ask about whether being right in one thing should increase your confidence about them being right elsewhere. Liquid markets are some of the hardest things in the entire world to outguess! Being right about MWI is enormously being easier than being right about what Microsoft stock will do relative to the rest of S&P 500 over the next 6 months.
There's a gotcha to the gotcha wh... (read more)
You're confusing subjective probability and objective quantum measure. If you flip a quantum coin, half your measure goes to worlds where it comes up heads and half goes to where it comes up tails. This is an objective fact, and we know it solidly. If you don't know whether cryonics works, you're probably still already localized by your memories and sensory information to either worlds where it works or worlds where it doesn't; all or nothing, even if you're ignorant of which.
can even strip out the part about agents and carry out the reasoning on pure causal nodes; the chance of a randomly selected causal node being in a unique100 position on a causal graph with respect to 3↑↑↑3 other nodes ought to be at most 100/3↑↑↑3 for finite causal graphs.
Yes, as his post facto argument.
You have not understood correctly regarding Carl. He claimed, in hindsight, that Zuckerberg's potential could've been distinguished in foresight, but he did not do so.
Moved to Discussion.
I don't think you can give me a moment of pleasure that intense without using 3^^^3 worth of atoms on which to run my brain, and I think the leverage penalty still applies then. You definitely can't give me a moment of worthwhile happiness that intense without 3^^^3 units of background computation.
Just jaunt superquantumly to another quantum world instead of superluminally to an unobservable galaxy. What about these two physically impossible counterfactuals is less than perfectly isomorphic? Except for some mere ease of false-to-fact visualization inside a human imagination that finds it easier to track nonexistent imaginary Newtonian billiard balls than existent quantum clouds of amplitude, with the latter case, in reality, covering both unobservable galaxies distant in space and unobservable galaxies distant in phase space.