Some background:

I have followed the writing of Eliezer on AI and AI safety with great interest (and mostly, I agree with his conclusions).

I have done my share of programming.

But, I confess, most of the technical side of AI alignment is beyond my current level of understanding (currently I am reading and trying to understand the sequence on brain-like AGI safety).

I do, however, find the ethical side of AI alignment very interesting.


In 2004, Eliezer Yudkowsky wrote a 38-page paper on Coherent Extrapolated Volition, or CEV. An attempt to create a philosophy of Friendliness, to somewhat formalize our understanding of how we would want a Friendly AI to behave.

In calculating CEV, an AI would predict what an idealized version of us would want, "if we knew more, thought faster, were more the people we wished we were, had grown up farther together". It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge. This initial dynamic would be used to generate the AI's utility function. 

There are many objections to CEV. I have browsed the posts tagged CEV, and in particular enjoyed the list of CEV-tropes, a slightly tongue-in-cheek categorization of common speculations (or possibly misconceptions) about CEV.

So I think it is rather uncontroversial to say that we do not understand Friendliness yet. Not enough to actually say what we would want a Friendly AI to do once is is created and becomes a superintelligence.

Or perhaps we do have a decent idea of what we would want it do, but not how we would formalize that understanding in a way that doesn't result in some perverse instantiation of our ethics (as some people argue CEV would. Some versions of CEV, anyway - CEV is underspecified. There are many possible ways to implement CEV).

In the above-mentioned paper on CEV, Eliezer Yudkowsky writes the following warning.

Arguing about Friendliness is easy, fun, and distracting. Without a technical solution to FAI, it doesn’t matter what the would-be designer of a superintelligence wants; those intentions will be irrelevant to the outcome. Arguing over Friendliness content is planning the Victory Party After The Revolution—not just before winning the battle for Friendly AI, but before there is any prospect of the human species putting up a fight before we go down. The goal is not to put up a good fight, but to win, which is much harder. But right now the question is whether the human species can field a non-pathetic force in defense of six billion lives and futures.

While I can see the merits of this warning, I do have some objections to it, and I think that some part of out effort might be well-spent talking about Friendliness.

Part of the argument, I feel, is that building a GAI that is safe and does "what we want it to do" is orthogonal to "what we want it to do". That we just build a general-purpose super-agent that can fullfill any utility function, that load our utility function into it.

I'm not sure I agree.
After all, Friendliness is not some static function.
As the AI grows, so will its understanding of Friendliness.
That's a rather unusual behavior for a utility function, isn't it? To start with a "seed" that grows and (and changes?) with improved understanding of the world and humanity's values.
Perhaps there is some synergy in considering exactly how that would affect our nascent AI.
Perhaps there is some synergy in, from the beginning, considering the exact utility function we would want our AI to have, the exact thing we want it to do, rather than focusing on building an AI that could have all possible utility functions.
Perhaps an improved understanding of Friendliness would improve the rest of our alignment efforts.

Even if it were true that we only need to understand Friendliness at the last moment, before FAI is ready to launch:

We don't know how long it would take to solve the technical problem of AI alignment. But we don't know how long it would take to solve the ethical problem of AI alignment, either. Why do you assume it's an easy problem to solve?

Or perhaps it would take time to convince other humans of the validity of our solution, to, if you will, align humans and stakeholders of various AI projects, or those who have influence over AI research, to our understanding of Friendliness.

I also have this, perhaps very far-fetched idea, that an improved understanding of Friendliness might be of benefit to humans, even if we completely set aside the possiblity of superhuman AI.

After all, if we agree that there is a set of values, a set of behaviors that we would want to a superintelligence acting in humanity's best interest to have, why wouldn't I myself choose to hold these values and do these behaviors?
If there is a moral philosophy that we agree is if not universal then best approximation to human-value-universality, why wouldn't humans find it compelling? More compelling, perhaps, then any existing philosophy or value system, if they truly thought about it?
If we would want superhuman minds to be aligned to some optimal implementation of human values, why wouldn't we want human minds to be aligned to the very same values?

(ok, this part was indeed far-fetched and I can imagine many counter-arguments to it. I apologize for getting ahead of myself)


Nevertheless, I do not suggest that those working on the technical side of the AI alignment redirect their efforts to think about Friendliness. After all, the technical part of alignment is very difficult and very important.

Personally, as someone who finds the ethical side of AI alignment far more compelling, I (as of now, anyway, before I have received feedback on this post) intend to attempt to write a series of posts further exploring the concept of Friendliness.

Epistemically, this post is a shot in the dark. Am I confused? Am I wasting my time while I should be doing something completely different? I welcome you to correct my understanding, or to offer counterpoints.

New Comment
6 comments, sorted by Click to highlight new comments since:

I lean in a similar direction.

My guess is that Friendliness and the first-person perspective are fundamentally entangled. The launching of FAI wouldn't seem like some outside process booting up and making the world mysteriously good. It'd feel like your own intelligence, compassion, and awareness expanding, and like you are solving the challenges of the world as you joyfully accept full responsibility for all of existence.

I also suspect that the tendency to avoid or merely footnote this point is a major cause of technical hurdles. There's a reason self-reference (a la the Löbstacle) keeps showing up: That's the math reflecting the role of consciousness. Or said the other way, consciousness is the evolved solution to problems like the Löbstacle.

These are guesses & intuitions on my part though. I'm not saying this from a place of technical expertise.



I guess I would put it this way, I don't see a fundamental difference between a human intelligence executing an aligned-version-of-human-ethics and a machine intelligence executing the same aligned-version-of-human-ethics. Even though there may be significant differences in actual implementation.


After all, if we agree that there is a set of values, a set of behaviors that we would want to a superintelligence acting in humanity's best interest to have, why wouldn't I myself choose to hold these values and do these behaviors?

We have finite computational power, and time costs are not free.

Taking 10 decades to make every decision is itself suboptimal. If we agree that yes, this is what we would do given sufficient resources, but we don't so we fall back on heuristics that are themselves suboptimal, but less suboptimal than the time cost of making the optimal decision...


I agree.

The unspoken assumption here is that what is limiting us is not (just) the computation limits of our brains, but our philosophical understanding of humanity's ethics.

That there is (or could be, anyway), a version of human-optimal-ethics that my brain could execute, and that would outperform existing heuristics.

But the hypothesis that the current heuristics we have are already somewhat-optimal for our biological brains (or at least not worth the effort trying to improve) is perfectly reasonable.

I fully agree here. This is a very valuable post. 

After all, if we agree that there is a set of values, a set of behaviors that we would want to a superintelligence acting in humanity's best interest to have, why wouldn't I myself choose to hold these values and do these behaviors?

I know Jordan Peterson is quite the controversial figure, but that's some core advice of his. Aim for the highest, the best you could possibly aim for - what else is there to do? We're bounded by death, you've got nothing to lose, and everything to gain - why not aim for the highest?

What’s quite interesting is that, if you do what it is that you’re called upon to do—which is to lift your eyes up above the mundane, daily, selfish, impulsive issues that might upset you—and you attempt to enter into a contractual relationship with that which you might hold in the highest regard, whatever that might be—to aim high, and to make that important above all else in your life—that fortifies you against the vicissitudes of existence, like nothing else can. I truly believe that’s the most practical advice that you could possibly receive.

I sincerely believe there is nothing more worthwhile for us humans to do than that: aim for the best, for ourselves, for our families, for our communities, for the world, in the now, in the short term, and in the long term. It seems... obvious? And if we truly work that out and act on it, wouldn't that help convince an AGI to do the same? 

(You might be interested in this recent post of mine)


Having read your post, I have disagreements with your expectations about AGI.

But it doesn't matter. It seems that we agree that "human alignment", and self-alignment to a better version of human ethics is a very worthwhile task. (and so is civilizational alignment, even though I don't hold much hope for it yet).

To put this it way, if we align our civilization, we win. Because, once aligned, we wouldn't build AGI unless we were absolutely sure it would be safe and aligned with our values.

My hope is that we can, perhaps, at least align humans who are directly involved with building systems that might become AGI, with our principles regarding AI safety.