It is often said that a partial alignment solution will bring about an S-risk, as your agent cares enough about humans to keep them around, but not enough about them to allow them to flourish. This is usually not worried about because the thought is that we are currently very far away from even a partial solution, and the insight we gain by developing a partial solution will be easily applied to take us into full solution territory.
The short & simple reasoning for this is that conditional on humans being around, most things your agent could do to your humans are bad.
Shard theory makes a weak claim of alignment by default, and the strong claim that shard theoretic agents will be easily partially aligned to their human overseers (human-like values will get a seat at the shard negotiating table). In this regime, we get partial alignment for free, and must work for the full alignment. Per the above paragraph, this is a worrying world to be in!
In humans, partial altruism leading to devastating consequences is a regular event. Children refuse to kill their dying parents and instead send them to various torturous hospice centers, or else send them to slightly better (but still bad) nursing homes. They do this out of a sense of care to their parents, but not enough of a sense of care to prioritize those parents above other goals they have, such as work or relaxation.
Out of a sense of fairness, justice, and empathy, many very smart people often advocate for net-harmful policies such as the establishment of Communism on one extreme, and rent ceilings on the other.
Out of a sense of community, many otherwise caring & helpful people end up complicit in genocides & the establishment of boarders between countries, and the enforcement & continuation of harmful social norms.
In humans we see many instances of such mis-placed caring, and assuming shard theory is essentially correct except for the part where it says human values will be transferred exactly into our agent, I am skeptical that our shard theoretic agent will correct these flaws. In each instance of the flaw, there is the choice for it to either say 'no, this is a misapplication of the principle I'm trying to follow here', and change the action, or 'actually, I prefer worlds with rent-ceilings, and so I guess I don't care so much about net-harm in these circumstances afterall', and change the principle.
In practice, humans contend with these decisions all the time, so I don't think its just a matter of always sticking with the principle.
The hard part lies in the establishment & verification of a reflection mechanism which would resolve these tradeoffs in ways I'd like, and is itself safe against deals among other shards and itself made to short-circuit it so it is not subject to lost-purposes cases like the above. The CEV problem (pdf warning).
An argument I anticipate: I want to make a corrigibility-bot, it doesn't need to understand all of human values, it just needs to listen to me when I tell it to go do something in the world. The things I tell it to do are going to be piecemeal and small, so there is a fast feedback loop between its actions and me telling it to do stuff.
My response: I agree this seems like a good thing to aim for. I anticipate someone saying small and slow actions do not make a pivotal act, and another saying this does not ultimately defend you against the agent catastrophically misunderstanding your feedback, or incorporating it in ways you did not intend. I myself think both are good criticisms if you succeed, but also that I do not see a way of making such an agent using current techniques.
Give me a concrete training story for this, and perhaps we can have further discussion. Extending arguments in Diamond Alignment & assuming shard theory is basically correct still does not lead me feeling happy about our prospects. I anticipate you will get an agent which occasionally goes to you for guidance, but mostly has a bunch of shallow ethical compunctions I expect to lead to the above S-risks, or become deep ethical considerations but corrupted away via inhuman reflective reasoning. It still (mostly) performs the same actions in the same situations, but does so for strange reasons which don't generalize in the ways we expect. It would ruin its suit to save a drowning child, but refuse to let the child die if it was suffering.
Or a drive to signal a sense of fairness, justice, or empathy; a root cause which has more application to the problem when using an RLHF type approach to alignment.
Or a drive to signal a sense of community.
In the control theory sense. Few things happen between feedback cycles.