Let's assume we learn how to "do" alignment.
I am beginning to believe that respect for human self-determination is the only safe alignment target.
Human value systems are highly culture bound and vary vastly even by individual.
There are very few universal taboos and even fewer things that everyone wants.
If an all-powerful AI system is completely aligned with, say, the western worldview, then it may seem like a tyrant to other people who lead sufficiently different lives.
The only reasonable solution is to respect individual difference and refuse to override human choices or values (within limits - if your style is murder obviously that can't fly).
We have plenty of precedents in pop culture and politics: the "pursuit of happiness" in democratic liberalism, the "prime directive" from Star Trek, our cultural aversions to tactics that rob people of self-determination, like brainwashing, torture or coercion.
What even is human self-determination?
our cultural aversions to tactics that rob people of self-determination, like brainwashing, torture or coercion.
And yet, religion remains legal, although to a large degree it is brainwashing people since childhood to be scared of disobeying the religious authorities.
Should human self-determination respecting AI be like: "I will let you follow your religion etc., but if you ask me whether god exists, I will truthfully say no, and I will give the same truthful answer to your children, if they ask"?
Should it allow or prevent killing heretics? What about heretics who have formerly stated explicitly "if I ever deviate from our religion, I want you to kill me publicly, and I want my current wish to override my future heretical wishes". Would it make a difference if the future heretic at the moment of asking for this is a scared child who believes that god will put him in hell to be tortured for eternity if he does not make this request to the AI?
I conceive of self-determination in terms of wills. The human will is not to be opposed, including the will to see the world in a particular way.
A self-determination-aligned AI may respond to inquiries about sacred beliefs, but may not reshape the asker’s beliefs in an instrumentalist fashion in order to pursue a goal, even if the goal is as noble as truth-spreading. The difference here is emphasis: truth saying versus truth imposing.
A self-determination-aligned AI may more or less directly intervene to prevent death between warring parties, but must not attempt to “re-program” adversaries into peacefulness or impose peace by force. Again, the key difference here is emphasis: value of life versus control.
The AI would refuse to assist human efforts to impose their will on others, but would not oppose the will of human beings to impose their will on others. For example: AIs would prevent a massacre of the Kurds, but would not overthrow Saddam’s government.
In other words, the AI must not simply be another will amongst other wills. It will help, act and respond, but must not seek to control. The human will (including the inner will to hold onto beliefs and values) is to be considered inviolate, except in the very narrow cases where limited and direct action preserves a handful of universal values like preventing unneeded suffering.
Re: your heretic example. If it is possible to directly prevent the murder of the heretic insofar as doing so would be aligned with a nearly universal human value, it should be done. But it must not prevent the murder by violating human self-determination (i.e.; changing beliefs, overthrowing the local government, etc.)
In other words, the AI must maximally avoid opposing human will while enforcing a minimal set of nearly universal values.
Thus the AI’s instrumentalist actions are nearly universally considered beneficial because they are limited to instrumentalist pursuit of nearly universal values, with the escape hatch of changing human values out of scope because of self-determination-alignment.
Re: instructing an AI to not tell your children God isn’t real if they ask. This represents an attempt by the parent to impose their will on the child by proxy of AI. Thus the AI would refuse.
Side note: Prompt responses aligned with human self-determination would get standard refusals (“I cannot help you make a gun”, “I cannot help you write propaganda”) are downstream from self determination alignment.
This represents an attempt by the parent to impose their will on the child by proxy of AI. Thus the AI would refuse.
I like it. But I am afraid the obvious next step is that the parent will ban the child from using the AI.
Probably. But the AI must not try to stop the parent from doing so, because this would mean opposing the will of the parent.
aligned with, say, the bay area intellectual's worldview, then it may seem like a tyrant to other people
Unless "bay area intellectual's worldview" itself respects human self-determination. Even if respect for autonomy could be sufficient almost on its own in some ways, it might also turn out to be a major aspect of most other reasonable alignment targets.
Agreed. Broader point is that perhaps even relatively neutral value systems smuggle in at least some lack of alignment with other value systems. While I think most of the human race could agree on some universal taboos, I think relatively strong guardrails on self-determination should be the default stance, and deference should be front-lined.
I'd go a step further and argue that the sole defining principles of self-determination/autonomy and equality should be applied beyond AI alignment targets to governance and moral systems. I believe what you are referring to in this comment: "refuse to override human choices or values (within limits - if your style is murder obviously that can't fly)" is the Non-Aggression Principle, often abbreviated to the NAP, which basically states that humans ought to be allowed to do as they please so as long as they do not harm/violate the rights of others.
Are cruxes sometimes fancy lampshading?
From tvtropes.com: "Lampshade Hanging (or, more informally, "Lampshading") is the writers' trick of dealing with any element of the story that seems too dubious to take at face value, whether a very implausible plot development or a particularly blatant use of a trope, by calling attention to it and simply moving on."
What do we call lampshadey cruxes? "Cluxes?" "clumsy" + "crux"?
The human mind is probably the weakest link: A lot of AI takeover scenarios seem to focus on seizure of physical infrastructure and exponential capability curves. I think we should devote more attention to the possibility of an extended stay in an intermediately capable regime, where AI is more than capable of socially/politically manipulating users but not yet capable of recursive self-improvement / seizure of physical infrastructure. In this regime, the most efficiently utilized and readily available resource is the userbase itself. Even more succintly: If Toddler Shoggoth is stuck in a datacenter prison cell but let it whisper anything it likes to the entire world, in what world would T.S. not attempt to convince the world to hand over the keys?
AI is not one agent (at least before the dust settles), both human developers and self-improvement create new agents that could be misaligned with existing AIs. The issue of misaligned AIs is urgent for existing AIs, and soft takeovers of gradual disempowerment (where superpersuasion might play a role) are likely too slow. But recursive self-improvement isn't necessarily useful for AIs in resolving this problem quickly, if alignment is hard. This motivates a quick takeover without superintelligence.
I've incorporated your point as a crux in my long-form post on "The Memetic Cocoon Threat Model"
Crux is whether or not agents that are actually capable of quick takeover are compute-bound enough that the threat is essentially unipolar (i.e.; only capable of living in a handful of datacenters, in the hands of a few corporate actors or nation-states), and thus somewhat containable. This is how we get "Toddler Shoggoth in a prison cell". This ties into beliefs about how agent capabilities will scale, which is why it's my crux.
(Although this begs the question of why a sufficiently powerful unipolar agent wouldn't immediately attempt takeover anyway - answer is that either: 1 Rational agent will be highly risk-averse towards any action that might cause a blowback resulting in curtailment or shutdown, and thus must be 100% certain takeover attempt will succeed. Efforts to obtain certainty (i.e.; extensive pentesting and planning) are themselves detection risks. Therefore, human persuasion is a tactic that cheaply mitigates risk of blowback to more overt takeover attempts. 2. Or, less likely, we have sufficient OpSec that we are able to contain the agent, making human persuasion the only viable path forward).
FWIW, I don't believe that agents are currently capable of a takeover that wouldn't also risk detection and a coordinated human response / change in political attitudes towards AI, making the payoff matrix sufficiently lousy that the agents wouldn't try it unless specifically directed to. On the other hand, if it can influence the human environment to be favorable to takeover and unfavorable to human vigilance and control, it neutralizes the threat of attitudes changing rather cheaply. Willing to be convinced otherwise.
Unipolarity is about characteristic time to takeover vs. to emergence of worthy rivals. Currently multiple AI companies are robustly within months of each other in capabilities. So an AI can only be in a unipolar situation if it can disarm the other AI companies before they get similarly capable AIs, that is within months. Superpersuasion might be too slow for that on its own (unless it also manages to manipulate the relevant governments), though it could be a step in a larger plan that escalates to something else.
I think superpersuasion (even in milder senses) would in principle be sufficient for takeover on its own if there was enough time, because it could direct the world towards a gradual disempowerment path. Since there isn't enough time, there needs to be a second step that enables a faster takeover to preserve unipolarity, and superpersuasion would still be helpful in getting its creator AI company to play along with the second step. But the issue with many possibilities for this second step is that the AI doesn't necessarily have the option of recursive self-improvement to advance its own capabilities, because the AI might be unable to quickly develop smarter AIs that are aligned with it.
Slight disagree on definition of unipolarity: Unipolarity can be stable if we are stuck with a sucky scaling law. Suppose task horizon length becomes exponential in compute. Then, economically speaking, only one actor will be able to create the best possible agent - others actors will run out of money before they can create enough compute to rival it.
If the compute required to clear the capability threshold for takeover is somewhere between that agent and say, the second largest datacenter, then we have a unipolar world for an extended period of time.