Wiki Contributions


Morality is Scary

The idea that the AI should defer to the "most recent" human values is an instance of the sort of trap I'm worried about. I suspect we could be led down an incremental path of small value changes in practically any direction, which could terminate in our willing and eager self-extinction or permanent wireheading. But how much tyranny should present-humanity be allowed to have over the choices of future humanity? 

I don't think "none" is as wise an answer as it might sound at first. To answer "none" implies a kind of moral relativism that none of us actually hold, and which would make us merely the authors of a process that ultimately destroys everything we currently value.

But also, the answer of "complete control by the future by the present" seems obviously wrong, because we will learn about entirely new things worth caring about that we can't predict now, and sometimes it is natural to change what we like.

More fundamentally, I think the assumption that there exist "human terminal goals" presumes too much. Specifically, it's an assumption that presumes that our desires, in anticipation and in retrospect, are destined to fundamentally and predictably cohere. I would bet money that this isn't the case.

Morality is Scary

Yes, there is a broad class of wireheading solutions that we would want to avoid, and it is not clear how to specify a rule that distinguishes them from outcomes that we would want. When I was a small child I was certain that I would never want to move away from home. Then I grew up, changed my mind, and moved away from home. It is important that I was able to do something which a past version of myself would be horrified by. But this does not imply that there should be a general rule allowing all such changes. Understanding which changes to your utility function are good or bad is, as far as decision theory is concerned, undefined.

Morality is Scary

I am also scared of futures where "alignment is solved" under the current prevailing usage of "human values."

Humans want things that we won't end up liking, and prefer things that we will regret getting relative to other options that we previously dispreferred. We are remarkably ignorant of what we will, in retrospect, end up having liked, even over short timescales. Over longer timescales, we learn to like new things that we couldn't have predicted a priori, meaning that even our earnest and thoughtfully-considered best guess of our preferences in advance will predictably be a mismatch for what we would have preferred in retrospect. 

And this is not some kind of bug, this is centrally important to what it is to be a person; "growing up" requires a constant process of learning that you don't actually like certain things you used to like and now suddenly like new things. This truth ranges over all arenas of existence, from learning to like black coffee to realizing you want to have children.

I am personally partial to the idea of something like Coherent Extrapolated Volition. But it seems suspicious that I've never seen anybody on LW sketch out how a decision theory ought to behave in situations where the agents utility function will have predictably changed by the time the outcome arrives so the "best choice" is actually a currently dispreferred choice. (In other words, situations where the "best choice" in retrospect, and in expectation, do not match.) It seems dangerous to throw ourselves into a future where "best-in-retrospect" wins every time, because I can imagine many alterations to my utility function that I definitely wouldn't want to accept in advance, but which would make me "happier" in the end. And it also seems awful to accept a process by which "best-in-expectation" wins every time, because I think a likely result is that we are frozen into whatever our current utility function looks like forever. And I do not see any principled and philosophically obvious method by which we ought to arbitrate between in-advance and in-retrospect preferences.

Another way of saying the above is that it seems that "wanting" and "liking" ought to cohere but how they ought to cohere seems tricky to define without baking in some question-begging assumptions.

Why do you believe AI alignment is possible?

As I see it there are mainly two hard questions in alignment. 

One is, how do you map human preferences in such a way that you can ask a machine to satisfy them. I don't see any reason why this would be impossible for a superintelligent being to figure out. It is somewhere similar (though obviously not identical) to asking a human to figure out how to make fish happy.

The second is, how do you get a sufficiently intelligent machine to anything whatsoever without doing a lot of terrible stuff you didn't want as a side effect? As Yudkoswky says:

The way I sometimes put it is that I think that almost all of the difficulty of the alignment problem is contained in aligning an AI on the task, “Make two strawberries identical down to the cellular (but not molecular) level.” Where I give this particular task because it is difficult enough to force the AI to invent new technology. It has to invent its own biotechnology, “Make two identical strawberries down to the cellular level.” It has to be quite sophisticated biotechnology, but at the same time, very clearly something that’s physically possible.

This does not sound like a deep moral question. It does not sound like a trolley problem. It does not sound like it gets into deep issues of human flourishing. But I think that most of the difficulty is already contained in, “Put two identical strawberries on a plate without destroying the whole damned universe.” There’s already this whole list of ways that it is more convenient to build the technology for the strawberries if you build your own superintelligences in the environment, and you prevent yourself from being shut down, or you build giant fortresses around the strawberries, to drive the probability to as close to 1 as possible that the strawberries got on the plate.

When I consider whether this implied desiderata is even possible, I just note that I and many others continue to not inject heroin. In fact, I almost never seem to act in ways that look much like driving the probability of any particular number as close to 1 as possible. So clearly it's possible to embed some kind of motivational wiring into an intelligent being, such that the intelligent being achieves all sorts of interesting things without doing too many terrible things as a side effect. If I had to guess, I would say that the way we go about this is something like: wanting a bunch of different, largely incommensurable things at the same time, some of which are very abstract, some of which are mutually contradictory, and somehow all these different preferences keep the whole system mostly in balance most of the time. In other words, it's inelegant and messy and not obvious how you would translate it into code, but it is there, and it seems to basically work. Or, at least, I think it works as well as we can expect, and serves as a limiting case.

What would we do if alignment were futile?

After seeing a number of rather gloomy posts on the site in the last few days, I feel a need to point out that problems that we don't currently know how to solve always look impossible. A smart guy once pointed out how silly it was the Lord Kelvin claimed "The influence of animal or vegetable life on matter is infinitely beyond the range of any scientific inquiry hitherto entered on." Kelvin just didn't know how to do it. That's fine. Deciding it's a Hard Problem just sort of throws up mental blocks to finding potential obvious solutions.

Maybe alignment will seem really easy in retrospect. Maybe it's the sort of thing that requires only two small insights that we don't currently have. Maybe we already have all the insights we need and somebody just needs to connect them together in a non-obvious way. Maybe somebody has already had the key idea, and just thought to themselves, no, it can't be that simple! (I actually sort of viscerally suspect that the lynchpin of alignment will turn out to be something really dumb and easy that we've simply overlooked, and not something like Special Relativity.) Everything seems hard in advance, and we've spent far more effort as a civilization studying asphalt than we have alignment. We've tried almost nothing so far. 

In the same way that we have an existence-proof of AGI (humans existing) we also have a highly suggestive example of something that looks a lot like alignment (humans existing and often choosing not to do heroin), except probably not robust to infinite capability increase, blah blah.

The "probabilistic mainline path" always looks really grim when success depends on innovations and inventions you don't currently know how to do. Nobody knows what probability to put on obtaining such innovations in advance. If you asked me ten years ago I would have put the odds of SpaceX Starship existing at like 2%, probably even after thinking really hard about it. 

Speaking of Stag Hunts

One thing we are working on in the Guild of the ROSE is a sort of accreditation or ranking system, which we informally call the "belt system" because it has many but not all of the right connotations. It is possible to have expertise in how to think better and it's desirable to have a way of recognizing people who demonstrate their expertise, for a variety of reasons. Currently the ranking system is planned to be a partly based on performance within the courses we are providing, and party based in objective tests of skill ("belt tests"). But we are still experimenting with various ideas and haven't rolled it out.

Reality-Revealing and Reality-Masking Puzzles

The specific example I would go to here would be when I was vegetarian for a couple of years and eventually gave up on it and went back to eating meat. I basically felt like I had given up on what was "right" and went back to doing "evil". But remaining vegetarian was increasingly miserable for me, so eventually I quit trying. 

I think it's actually more fair to call this a conflict between two different System-1 subagents. One of the definitive aspects of System-1 is that it doesn't tend to be coherent. Part of System-1 wanted to not feel bad about killing animals, and a different part of System-1 wanted to eat bacon and not feel low-energy all the time. So here there was a very evident clash between two competing System-1 felt needs, and the one that System-2 disapproved of ended up being the winner, despite months and months of consistent badgering by System-2.

I think you see this a lot especially in younger people who think that they can derive their "values" logically and then become happier by pursuing their logically derived values. It takes a bit of age and experience to just empirically observe what your values appear to be based on what you actually end up doing and enjoying.

Self-Integrity and the Drowning Child

I thought I agreed but upon rereading your comment I am no longer sure. As you say, the notion of a utility function implies a consistent mapping between world states and utility valuations, which is something that humans do not do in practice, and cannot do even in principle because of computational limits.

But I am not sure I follow the very last bit. Surely the best map of the dath ilan parable is just a matrix, or table, describing all the possible outcomes, with degrees of distinction provided to whatever level of detail the subject considers relevant. This, I think, is the most practical and useful amount of compression. Compress further, into a “utility function”, and you now have the equivalent of a street map that includes only topology but without street names, if you’ll forgive the metaphor.

Further, if we aren’t at any point multiplying utilities by probabilities in this thought experiment, one has to ask why you would even want utilities in the first place, rather than simply ranking the outcomes in preference order and picking the best one.

Self-Integrity and the Drowning Child

Perhaps the parable could have been circumvented entirely by never teaching the children that such a thing as a “utility function” existed in the first place. I was mildly surprised to learn that the dath ilani used the concept at all, rather than speaking of preferences directly. There are very few conversations about relative preference that are improved by introducing the phrase “utility function.”

Shoulder Advisors 101

This phenomenon is also why we have the term "role model." Successful examples of people similar to us are extremely valuable, and it is in fact very difficult to succeed without such examples.

Load More