Putting two matching ideas I've encountered together.

The Question:


Considering all humans dead, do you still think it's going to be the boring paperclip kind of AGI to eat all reachable resources? Any chance that inscrutable large float vectors and lightspeed coordination difficulties will spawn godshatter AGI shards that we might find amusing or cool in some way? (Value is fragile notwithstanding)

Eliezer Yudkowsky:

Yep and nope respectively. That's not how anything works.

Yudkowsky's confidence here puzzled me at first. Why be so sure that powerful ML training won't even produce something interesting from the human perspective?

Admittedly, the humane utility function is very particular and not one we're likely to find easily. And a miss in modeling it along plenty of value dimensions would mean a future optimized into a shape we see no value in. But what rules out a near miss in modeling the humane utility function, a miss along other value dimensions, that loses some but not all of what humanity cares about? A near miss would entail some significant loss of value and interestingness from our perspective -- but it'd still be somewhat interesting.

The Answer:

Scott Alexander:

The base optimizer is usually something stupid that doesn’t “know” in any meaningful sense that it has an objective - eg evolution, or gradient descent. The first thing it hits upon which does a halfway decent job of optimizing its target will serve as a mesa-optimizer objective. There’s no good reason this should be the real objective. In the human case, it was “a feeling of friction on the genitals”, which is exactly the kind of thing reptiles and chimps and australopithecines can understand. Evolution couldn’t have lucked into giving its mesa-optimizers the real objective (“increase the relative frequency of your alleles in the next generation”) because a reptile or even an australopithecine is millennia away from understanding what an “allele” is.

Even a "near miss" model is impossibly tiny in simplicity-weighted model space. A search algorithm with a simplicity bias isn't going to stumble all the way over to that particular model when so many other far-simpler models will perform just as well.

But isn't there some chance that ML training will happen to find a near-miss model anyways?

No. That forlorn hope is something you couldn't ever expect given those premises, like, e.g., hoping countertop puddles will spontaneously reassemble into ice cubes, once you know thermodynamics. An agent that wants to maximize its inclusive genetic fitness just wasn't an option for evolution, and having interesting utility functions won't be an option for powerful ML training on an arbitrary task without seriously altering how we do that kind of search to make it so.


12 comments, sorted by Click to highlight new comments since: Today at 11:27 PM
New Comment

The emergence of the meme ecosystem could be considered an "interesting" partial success from evolution's perspective.

I think it's plausible that an unaligned singularity could lead to things we consider interesting, because human values might be more generic than they appear, the apparent complexity an emergent feature resulting from power-seeking and curiosity drives or mesa-optimization. I also think the whole framework of "a singleton with a fixed utility function becomes all-powerful and optimizes that for all eternity" might be wrong, since human values don't seem to work that way.

I think "somewhat interesting" is a very low bar and should be distinguished from near-misses which still preserve a lot of what humanity values.

I don’t agree with this line of reasoning because actual ML systems don’t implement either a simplicity or speed prior. If we assume away inner alignment failures, then current ML systems implement something like a speed-capped simplicity prior.

If we allow inner alignment failures, then things become FAR more complex. Self perpetuating mesa optimizer will try to influence the distribution of future hypotheses to make sure the system retains the mesa optimizer in question. The system’s “prior” thus becomes path dependent in a way that neither the speed nor simplicity priors capture at all.

A non-inner aligned system is more likely to retain hypotheses that arise earlier in training, since the mesa optimizers implementing those hypotheses can guide the system’s learning process away from direction that would remove the mesa optimizer in question. This seems like the sort of thing that could cause an unaligned AI to nonetheless have a plurality of nonhuman values that it perpetuates into the future.

I don't know. I think the space of things that some humans would find kind of cool is large. 

LIke I have the instinctive reaction of "warp drives and dyson spheres, wow cool". 

That isn't to say that I place a large amount of utility on that outcome, just that my initial reaction is ooh, cool. 


The chance of a mars rover nearly but not quite hitting mars is large. It was optimized in the direction of mars, and one little mistake can cause a near miss.

For AI, if humans try to make FAI, but make some fairly small mistake, we could well get a near miss. Of course, if we make lots of big mistakes, we don't get that.  

I agree with the first bit! I think a universe of relativistic grabby AGI spheres-of-control is at least slightly cool.

It's possible that the actual post-unaligned-singularity world doesn't look like that, though. Like, "space empires" is a concept that reflects a lot of what we care about. Maybe large-scale instrumentally efficient behavior in our world is really boring, and doesn't end up keeping any of the cool bits of our hard sci-fi's speculative mature technology.

The chance of a mars rover nearly but not quite hitting mars is large. It was optimized in the direction of mars, and one little mistake can cause a near miss.

For AI, if humans try to make FAI, but make some fairly small mistake, we could well get a near miss. Of course, if we make lots of big mistakes, we don't get that.  

I don't think the mars mission analogy goes through given the risk of deceptive alignment.

Optimizing in the direction of a mars landing when there's a risk of a deceptive agent just means everything looks golden and mars-oriented up until you deploy the model … and then the deployed model starts pursuing the unrelated goal it's been waiting to pursue all along. The deployed model's goal is allowed to be completely detached from mars, as long as the model with that unrelated goal was good at playing along once it was instantiated in training.

Well if we screw up that badly with deceptive misalignment, that corresponds to crashing on the launchpad. 

It is reasonably likely that humans will have some technique they use that is intended to minimize deceptive misalignment. Or that gradient descent shapes the goals to something similar to what we want before the AI is smart enough to be deceptive. 

Why did 'evolution (which is not interesting?) end up creating something interesting (us)' while (your premise) 'ML (which is not interesting) will not end up creating something interesting (to us)'?

Edited to add italics, above.

Evolution lacks the foresight that humans have. It seems relatively plausible that if evolution had only a little more foresight, it would have been able to instill its objectives much more robustly. Consider that even with the very stupid objective function we were given (maximize inclusive genetic fitness) many humans still say they want children intrinsically. That's not exactly the same thing as wanting to maximize genetic fitness, but it's pretty close.

Plus, my interpretation of agentofuser's question was that they weren't asking whether unaligned AGI would produce something we'd consider good outright; merely that it would retain some elements of human value. I think this is far more plausible than it seems Eliezer does, for reasons that Paul Christiano talked about here.

Why does the person asking this question care about whether "interesting"-to-humans things happen, in a future where no humans exist to find them interesting?

Because beings like us (in some relevant respects) that outlive us might carry the torch of our values into that future! Don't read too much into the word "interesting" here: I just meant "valuable by our lights in some respect, even if only slightly.

Sure, it sucks if humanity doesn't live to be happy in the distant future, but if some other AGI civilization is happy in that future, I prefer that to an empty paperclip-maximized universe without any happiness.

That all sounds fair. I've seen rationalists claim before that it's better for "interesting" things (in the literal sense) to exist than not, even if nothing sentient is interested by them, so that's why I assumed you meant the same.