It is often said that a partial alignment solution will bring about an S-risk, as your agent cares enough about humans to keep them around, but not enough about them to allow them to flourish. This is usually not worried about because the thought is that we are currently very far away from even a partial solution, and the insight we gain by developing a partial solution will be easily applied to take us into full solution territory.

The short & simple reasoning for this is that conditional on humans being around, most things your agent could do to your humans are bad.

Shard theory makes a weak claim of alignment by default, and the strong claim that shard theoretic agents will be easily partially aligned to their human overseers (human-like values will get a seat at the shard negotiating table). In this regime, we get partial alignment for free, and must work for the full alignment. Per the above paragraph, this is a worrying world to be in!

In humans, partial altruism leading to devastating consequences is a regular event. Children refuse to kill their dying parents and instead send them to various torturous hospice centers, or else send them to slightly better (but still bad) nursing homes. They do this out of a sense of care to their parents, but not enough of a sense of care to prioritize those parents above other goals they have, such as work or relaxation.

Out of a sense of fairness, justice, and empathy[1], many very smart people often advocate for net-harmful policies such as the establishment of Communism on one extreme, and rent ceilings on the other.

Out of a sense of community[2], many otherwise caring & helpful people end up complicit in genocides & the establishment of boarders between countries, and the enforcement & continuation of harmful social norms.

In humans we see many instances of such mis-placed caring, and assuming shard theory is essentially correct except for the part where it says human values will be transferred exactly into our agent, I am skeptical that our shard theoretic agent will correct these flaws. In each instance of the flaw, there is the choice for it to either say 'no, this is a misapplication of the principle I'm trying to follow here', and change the action, or 'actually, I prefer worlds with rent-ceilings, and so I guess I don't care so much about net-harm in these circumstances afterall', and change the principle. 

In practice, humans contend with these decisions all the time, so I don't think its just a matter of always sticking with the principle.

The hard part lies in the establishment & verification of a reflection mechanism which would resolve these tradeoffs in ways I'd like, and is itself safe against deals among other shards and itself made to short-circuit it so it is not subject to lost-purposes cases like the above. The CEV problem (pdf warning).

An argument I anticipate: I want to make a corrigibility-bot, it doesn't need to understand all of human values, it just needs to listen to me when I tell it to go do something in the world. The things I tell it to do are going to be piecemeal and small, so there is a fast[3] feedback loop between its actions and me telling it to do stuff.

My response: I agree this seems like a good thing to aim for. I anticipate someone saying small and slow actions do not make a pivotal act, and another saying this does not ultimately defend you against the agent catastrophically misunderstanding your feedback, or incorporating it in ways you did not intend. I myself think both are good criticisms if you succeed, but also that I do not see a way of making such an agent using current techniques.

Give me a concrete training story for this, and perhaps we can have further discussion. Extending arguments in Diamond Alignment & assuming shard theory is basically correct still does not lead me feeling happy about our prospects. I anticipate you will get an agent which occasionally goes to you for guidance, but mostly has a bunch of shallow ethical compunctions I expect to lead to the above S-risks, or become deep ethical considerations but corrupted away via inhuman reflective reasoning. It still (mostly) performs the same actions in the same situations, but does so for strange reasons which don't generalize in the ways we expect. It would ruin its suit to save a drowning child, but refuse to let the child die if it was suffering.

  1. ^

    Or a drive to signal a sense of fairness, justice, or empathy; a root cause which has more application to the problem when using an RLHF type approach to alignment.

  2. ^

    Or a drive to signal a sense of community.

  3. ^

    In the control theory sense. Few things happen between feedback cycles.

59

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 12:11 AM

such as the establishment of Communism on one extreme, and rent ceilings on the other.

pedantic comment: I don't think these particularly represent either end of the spectrum well.

I don't really like those examples either. I'm curious to hear what alternatives you have in mind, if any.

I think they're fine examples, my pedantry mostly points in the direction of "don't claim it's a spectrum". (but, note this is all kinda unimportant)

I really loved the post! I wish more people took S-risks completely seriously before dismissing them, and you make some really great points. 

In most of your examples, however, it seems the majority of the harm is in an inability to reason about the consequences of our actions, and if humans became smarter and better informed it seems like a lot of this would be ironed out. 

I will say the hospice/euthanasia example really strikes a chord with me, but even there, isn't it more a product of cowardice than a failure of our values?

Isn't this only S-risk in the weak sense of "there's a lot of suffering" - not the strong sense of "literally maximize suffering"? E.g. it seems plausible to me mistakes like "not letting someone die if they're suffering" still gives you a net positive universe.

Also, insofar as shard theory is a good description of humans, would you say random-human-god-emperor is an S-risk? and if so, with what probability?

I'm making the claim that there will be more suffering than eudomania.

You need to model the agent as having values other than 'keep everyone alive'. Any resources being spent on making sure people don't die would be diverted to those alternative ends. For example, suppose you got a super-intelligent shard-theoretic agent which 1) If anyone is about to die, it prevents them from dying, 2) if its able to make diamonds, it makes diamonds, and 3) If its able to get more power, it gets more power. It notices that it can definitely make more diamonds & get more power by chopping off the arms and legs of humans, and indeed this decreases the chance the humans die since they're less likely to get into fights or jump off bridges! So it makes this plan, the plan goes through, and nobody has arms or legs anymore. No part of it makes decisions on the basis of whether or not the humans will end up happy with the new state of the world.

Moral of the story: The word will likely rapidly change when we get a superintelligence, and unless the superintelligence is sufficiently aligned to us, no decision will be made for reasons we care about, and because most ways the world could be conditional on humans existing are bad, we conclude net-bad world for many billions or trillions of years.

Also, insofar as shard theory is a good description of humans, would you say random-human-god-emperor is an S-risk? and if so, with what probability?

40% you take a random human, make them god, and you get an S-risk. Humans turn consistently awful when you give them a lot of power, and many humans are already pretty awful. I like to think in terms of CEV style stuff, and if you perform the CEV on a random human, I think maybe only 5% of the time do you get an S-risk.

An example I think about a lot is the naturalistic fallacy. There is a lot horrible suffering that happens in the natural world, and a lot of people seem to be way too comfortable with that. We don't have any really high leverage options right now to do anything about it, but it strikes me as plausible that even if we could do something about it, we wouldn't want to. (perhaps even even make it worse by populating other planets with life https://www.youtube.com/watch?v=HpcTJW4ur54)

It's not just comfort. Institutions have baked in mechanisms to deflect blame when an event happens that is "natural". So rather than comparing possible outcomes as a result of their actions and always picking the best one, letting nature happen is ok.

Examples: if a patient refuse treatment, letting them die "naturally" is less bad than attempting treatment and they die during the attempt.

Or the NRC protecting the public from radiation leaks with heavy regulation but not from the radioactives in coal ash that will be released as a consequence of NRC decisions.

Or the FDA delaying Moderna after it was ready in 1 weekend because it's natural to die of a virus.

Presumably an AI this advanced would have the ability to eliminate all superficial forms of suffering such as hunger and disease. So how would we suffer? If the AI cannot fix higher order suffering such as ennui or existential dread, that is not an alignment problem.

You make some good points, but your examples of negative actions are uniformly terrible. Mainly because there are no obvious positive action alternatives to hospices, borders, legislated economic equality etc. and yet you make it sound like they exist. There are other approaches, definitely, but they are fraught with different issues.

I agree my examples are terrible. Alternatives would be helpful, if you’ve got any.

In each of the cases I mention, I expect doing nothing would be better than performing the action relevant actors in our society perform(ed). I am dissatisfied with the particular examples I gave because I also suspect many like you would disagree due to what I see as a misapplication of ethical principles. I also don’t think they highlight the primary point I’m trying to make, except the hospice example. Though they may be useful for intuition building, which is why I include them.

imo if we get close enough to aligned that "the AI doesn't support euthanasia" is an issue, we're well out of the valley of actually dangerous circumstances. Human values already vary extensively and this post feels like trying to cook out some sort of objectivity in a place it doesn't really exist.

The horror story people are worried about is "we suffer a lot but the AI doesn't care/makes it worse, and the AI doesn't allow you to escape by death."