Can we simulate human evolution to create a somewhat aligned AGI?

[-][anonymous]4y100

I personally place much higher likelihood on the thesis that recovering basic cooperative values (where an ASI is nice to humans and gives us some of what we wants) requires way way less than simulating "evolution" - most human values seem like they may be emergent behaviors in repeated positive-sum multi-agent games. It seems like, at least to prevent treacherous turns, we mostly need (1) bias towards multi-agent positive-sum solutions, (2) dislike of defection, (3) the "golden rule" of treating other agents as you would like to be treated (4) respect for (and gaining utility from the utility of) lesser life-forms/animals.

The primary outlier is "respect for lesser life-forms", which I wouldn't assume would emerge from standard cooperative multi-agent games. That seems like it might be elicitable in a repeated game of either emerging or not emerging from a Rawlsian veil of ignorance (being an animal or a human in each round).

Obviously, it'd also be good if we could transmit lots of other concepts like beauty and novelty intact to an ASI. Thankfully, people have already thought about a lot of this; there's a whole field of "evolutionary psychology" which can be thought of as people coming up with hypotheses for the conditions of multi-agent RL environments under which different observed human/non-human behavioral patterns may emerge. We don't know whether they're right in practice (they primarily rely on observational evidence) but these are empirically-testable hypotheses once you have reasonably-general RL agents.

Note that a few extremely challenging concepts do remain, like "beauty". I'm personally very skeptical that even a good simulation of all of evolution would reliably end up with the human concept of beauty - do we know if animals have any related concepts? But we may still get substantial leverage just from an ASI having sympathy for us and knowing we care about beauty.

Concretely, it'll be useful to see people continuing to try and elicit as many such behaviors as possible in multi-agent RL, and progress on that will give us a pretty good sense of how good an alignment heuristic this would be. It could be very valuable to write out a "theory of impact" for this agenda, outlining exactly what type of success indicators would be valuable to alignment and what the components of porting a good solution would be.

[-]Steven Byrnes4y*50

If it helps, I have some discussion on this topic here (Section 8.3 and especially 8.3.3.1).

This is a nice post and I was mostly nodding along.

I expect it’s moot because of the training competitiveness issue.

I also happen to believe that this evolutionary scenario would only count as “success” if we have a very very low bar for what constitutes successful alignment (e.g. “not worse than a hot-tempered psycho human who grew up on an alien planet”), and if we have that low a bar for “success”, then I’m actually pretty optimistic about our prospects for non-evolutionary alignment “success”.

I also think I'm less optimistic than you about the simulated evolved aliens creating unaligned AGIs (and/or blowing each other to smithereens in other ways). Your Section 5 arguments are not convincing to me because (1) this could happen after they break out of the simulation into the real world, (2) competition could favor AGIs that lack social instincts and other things that make for a good life worth living, and if so, it doesn't matter whether they build such AGIs from scratch or self-modify into them. Or something like that, I guess.

[-]Donald Hobson4y40

I think that to pull this off well, you would need to match pretty closely to reality.

Genome based AI, start with the human genome, simulate that growing into a person, sounds easier.

Once you replace evolution with SGD, replace DNA and proteins with something easier to simulate, replace learning memories with downloading them, replace the ancestral environment with some video game. Then the approximation is so crude that you are basically training a neural net to do things that seem nice, and hoping for the best.

If you could rerun evolution starting from chimps, you may well get creatures with fairly similar values. If you rerun evolution, and then post select on various pieces of text, very similar values.

If you start from the first RNA, getting near human values is hard.

Then consider that human values can vary by culture a fair bit.

Consider the question of whether or not simulations of human minds are morally important.

Answer yes and you get endless virtual utopia. The person who answered no sees humanity wiped out and the universe filled with worthless computers.

Answer no and you get a smaller and less fun real utopia, plus people simulating whatever they feel like. Quite possibly the vast majority of human minds live unpleasant lives as characters in violence filled video games.

Now consider that you will probably find both positions on lesswrong. This isn't a cultural difference between us and ancient mongols. This is a cultural difference between people that are very culturally similar.

Now you can say that one side is right. You can optimize some combination and get a world that both sides like.

On a sufficiently basic level, most humans value tasty food (some people will refuse it for all sorts of reasons)

Far from the day to day world, human values are unconstrained by survival constraints. (Evolution so far has not selected for any particular view on whether simulations are morally you.)

There may be a single truth that all humans are converging towards. But maybe not.

If you just simulate the whole world, and put an "exit simulation" button that only an ASI could press, then these aliens have no better shot at alignment than us.

If you zoom in on the world, picking out the alien equivalent of MIRI, and giving them extra help over the careless aliens creating UFAI, then you need to locate the alien MIRI, when the aliens speak an alien language. They still might screw up anyway.

[-]Yitz4y40

I’m honestly really confused why more effort isn’t being put into contingency alignment plans; it seems quite likely to me that partial alignment should be easier and faster to develop than full alignment, and it isn’t inevitable that alignment will be an all-or-nothing endeavor. Thanks for the thought-provoking analysis!

[-]Andrew Vlahos4y30

No. Humans do major harm to each other, often even when they are trying to help. And that's if things go right; an AI based on human behavior has a high chance of causing harm deliberately.

[-]TAG4y10

The way you have explained this idea assumes a certain model of ethics/friendliness -- that ethics is human value,and all human value indifferently. Other models make the problem a lot simpler.
It's started already. Current technologies already share an ecosystem with humans and are being selected for some kind of friendliness.
It would probably be stymied by rapid takeoff, but so would all the alternatives....rapid takeoff towards ASI is the hard problem.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

21

Can we simulate human evolution to create a somewhat aligned AGI?

21

21

How to create a successor AI by simulating evolution?

Will this actually work?

1. Aren't human values fragile?

2. Will this be more tractable than alignment?

3. Will the training process approximate alien CEV?

4. Will this be competitive?

5. Won't the simulated aliens create unaligned AGI?

6. Won't implementing this plan require dangerous capabilities?