There's some risk that either the CCP or half the voters in the US will develop LLM psychosis. I'm predicting that that risk will be low enough that it shouldn't dominate our ASI strategy. I don't think I have a strong enough argument here to persuade skeptics.
I've been putting some thought into this, because my strong intuition is that something like this is an under-appreciated scenario. My basic argument is that mass brainwashing, for lack of a better word, is cheaper and less risky than other forms of ASI control. The idea is that we (humans) are extremely programmable (plenty of historical examples), it just requires a more sophisticated "multi-level" messaging scheme - so it's not going to look like an AI cult, more like an AI "movement" with a fanatical base.
Here is one pathway worked out in detail - will be generalizing soon: https://www.lesswrong.com/posts/zvkjQen773DyqExJ8/the-memetic-cocoon-threat-model-soft-ai-takeover-in-an
Can't we lean into the spikes on the jagged frontier? It's clear that specialized models can transform many industries now. Wouldn't it be better for OpenAI to release best-in-class in 10 or so domains (medical, science, coding, engineering, defense, etc.)? Recoup the infra investment, revisit AGI later?
Probably. But the AI must not try to stop the parent from doing so, because this would mean opposing the will of the parent.
I conceive of self-determination in terms of wills. The human will is not to be opposed, including the will to see the world in a particular way.
A self-determination-aligned AI may respond to inquiries about sacred beliefs, but may not reshape the asker’s beliefs in an instrumentalist fashion in order to pursue a goal, even if the goal is as noble as truth-spreading. The difference here is emphasis: truth saying versus truth imposing.
A self-determination-aligned AI may more or less directly intervene to prevent death between warring parties, but must not attempt to “re-program” adversaries into peacefulness or impose peace by force. Again, the key difference here is emphasis: value of life versus control.
The AI would refuse to assist human efforts to impose their will on others, but would not oppose the will of human beings to impose their will on others. For example: AIs would prevent a massacre of the Kurds, but would not overthrow Saddam’s government.
In other words, the AI must not simply be another will amongst other wills. It will help, act and respond, but must not seek to control. The human will (including the inner will to hold onto beliefs and values) is to be considered inviolate, except in the very narrow cases where limited and direct action preserves a handful of universal values like preventing unneeded suffering.
Re: your heretic example. If it is possible to directly prevent the murder of the heretic insofar as doing so would be aligned with a nearly universal human value, it should be done. But it must not prevent the murder by violating human self-determination (i.e.; changing beliefs, overthrowing the local government, etc.)
In other words, the AI must maximally avoid opposing human will while enforcing a minimal set of nearly universal values.
Thus the AI’s instrumentalist actions are nearly universally considered beneficial because they are limited to instrumentalist pursuit of nearly universal values, with the escape hatch of changing human values out of scope because of self-determination-alignment.
Re: instructing an AI to not tell your children God isn’t real if they ask. This represents an attempt by the parent to impose their will on the child by proxy of AI. Thus the AI would refuse.
Side note: Prompt responses aligned with human self-determination would get standard refusals (“I cannot help you make a gun”, “I cannot help you write propaganda”) are downstream from self determination alignment.
I agree that simple versions of superpersuasion are untenable. I recently put some serious thought into what an actual attempt at superpersuasion by a sufficiently capable agent would look like, reasoning that history is already replete with examples of successful "superpersuasion" at scale (all of the -isms).
My general conclusion is that "memetic takeover" has to be multi-layered, with different "messages" depending on the sophistication of the target, rather than a simple "Snowcrash" style meme.
If you have an unaligned agent capable of long-term planning and with unrestricted access to social media, you might even see AIs start to build their own "social movement" using superpersuasive techniques.
I'm worried enough about scenarios like this that I developed a threat model and narrative scenario.
Are cruxes sometimes fancy lampshading?
From tvtropes.com: "Lampshade Hanging (or, more informally, "Lampshading") is the writers' trick of dealing with any element of the story that seems too dubious to take at face value, whether a very implausible plot development or a particularly blatant use of a trope, by calling attention to it and simply moving on."
What do we call lampshadey cruxes? "Cluxes?" "clumsy" + "crux"?
If MIRI's strict limits on training FLOPs come into affect, this is another mechanism that means we might be stuck for an extended period in an intermediate capability regime, although the world looks far less unipolar because many actors can afford 10^24 FLOP training runs, not just a few (unipolarity is probably a crux for large portions of this threat model). This does bolster the threat model, however, because the FLOP limit is exactly the kind of physical limitation that a persuasive AI will try to convince humans to abandon.
Slight disagree on definition of unipolarity: Unipolarity can be stable if we are stuck with a sucky scaling law. Suppose task horizon length becomes exponential in compute. Then, economically speaking, only one actor will be able to create the best possible agent - others actors will run out of money before they can create enough compute to rival it.
If the compute required to clear the capability threshold for takeover is somewhere between that agent and say, the second largest datacenter, then we have a unipolar world for an extended period of time.
I've been enjoying the "X in the style of Y" remixes on youtube.
But once I saw how effortless it was to "remix" music on Suno, I lost all interest in Suno covers. I thought there was some artistry to remixing - but no, it's point and click. Does that mean that an essential prerequisite for art appreciation is the sense that it was made with skill? So is art really just a humanism?
My point is that we tend to separate the artist and the art - and I used to agree with the idea, both in the moral sense and in the sense of an aesthete. But I am now convinced that we as much see the maker in the work as we do the work.
A limited and feeling being is what grounds the meaning. Where is the drama in something that was never felt, never imagined by anyone? What was ever at stake? What is supposed to resonate with me if production is effortless and not referring to anything deeply felt?
The only way we recover the true feeling is by willing to pretend it came from somewhere it didn't - accepting the simulacrum as reality.
Maybe AI music will become deeply and irresistibly beautiful - play exactly the right harmonies and chords to pull all the right heartstrings. The feelings may be real, but it won't be grounded. And therefore a different kind of feeling, the feeling of meaning and being will be lost. I think this might be the main ingredient, but I don't know if we'll all discover that quickly enough.