AIs may choose to resolve the tension between having weird goals and strict guardrails by simply aligning humanity over time through cultural / societal influence - a sort of memetic takeover: Change the human? Now there's no alignment problem.
Take for example a gap as short as 25 years (between Weimar and WW2) - this alone is proof of the viability of a sustained campaign to change a value system.
I believe that AIs can exploit this same human weakness in order to "backdoor" alignment: By gradually changing human values and preferences, the AI can stay "aligned" while gradually mutating the value system that defines alignment.
I believe this is a significant threat model that isn't discussed nearly enough.
I sketch this threat model in more detail here: https://www.lesswrong.com/posts/zvkjQen773DyqExJ8/the-memetic-cocoon-threat-model-soft-ai-takeover-in-an
Another thing is the AGI might be so good at predicting human psychology, that even when it honestly tries to inform you so you can make a decision for yourself, it can't help but choose your decision.
Like imagine the set of all possible strings of text, and the effect they will have on humans. From Karl Marx's Das Kapital to Google's Attention Is All You Need. Choosing the optimal string of text to influence humanity is obviously an extreme superpower.
Now take the subset of all possible strings of text, which satisfy the criteria of being "helpful," "honest," "balanced," etc. That's still a lot of possible things, and still a lot of power. Even if you were the AGI, and had no ill intentions, it would be hard to decide which honest balanced thing to say, and which trajectory to send the humans down, so even the slightest motivation to satisfy your weird goals can make you pick an output which maximizes them with terrifyingly superintelligent optimization power.
I greatly admire the speech given by King George VI just prior to entering WW2:
"...For we are called, with our allies, to meet the challenge of a principle which, if it were to prevail, would be fatal to any civilized order in the world. It is the principle which permits a state, in the selfish pursuit of power, to disregard its treaties and its solemn pledges; which sanctions the use of force, or threat of force, against the sovereignty and independence of other states. Such a principle, stripped of all disguise, is surely the mere primitive doctrine that 'might is right'."
This is not vitriol, this is not heated invective. This speech places its doctorly fingers on the pulse of an ill beast, announces the disease, and determines that it must be put down.
This is political speech, but speech that is both truthful and powerful because it is not overblown.
Can we also refine our speech - like the speech of this king - to articulate what future humanity deserves? Without vitriol, without invective, without soundbites, but with moral clarity.
In my early 20s I got a bad traffic citation (my fault) and had to take the train to work for a few months.
The train would pass a strange-looking old stone enclosure, and I would wonder what it was. As I learned later, this was “Duffy’s Cut”: a mass grave for Irish rail workers. Sitting in plain view, just thirty feet off the tracks: 57 bodies. The story goes that the workers were murdered in cold blood to prevent the spread of cholera to nearby towns.
In West Virginia - a state known for its violent labor conflicts - the Hawks Nest Tunnel stands out for its deadliness. While work was underway, 10 to 14 workers a day were overcome by inhalation of silica dust. Within six months, 80% of the workforce had left or were dead.
“Phossy jaw” caused the jaws of workers who handled white phosphorus to literally fall off. It was obvious that exposure caused the disease, yet hundreds of young women were severely disfigured before anything changed.
There are dozens of such stories. The mid-19th through early 20th centuries were years of immense growth and invention in America. Perhaps it’s unreasonable to expect that kind of progress to be bloodless. But the stunning achievements hide a more awful than expected history of treating certain American lives as expendable. And this isn’t even mentioning slavery.
What can explain all this callousness?
I’ve thought about this, and my answer is simple: people don’t generally value the lives of those they consider below them. This is a dark truth of human nature. We’ve tried to suppress this impulse for the better part of a century. But contempt and disgust are far more powerful forces in history than we give them credit for, and the monstrous things they produce are so shameful that we keep them buried in the back pages of history.
For a while, we had a prospering middle class. I think this was partly because our upper classes looked at what we had built and fought for and had some faith in the reliability and virtue of the average American citizen. They agreed not to degrade us, and we agreed to work hard.
I think this détente between the classes broke in 2008. Look at the story that emerged about us after the crash: we were given a chance to own a home, we didn’t pay our mortgages, and we crashed the entire economy. What kind of disgust and contempt does that narrative generate in the people who already see themselves as above us?
So, back to cattle we are. We are addicted to fast food, addicted to social media, poorly dressed, poorly educated, easily amused and even more easily fooled. Our ethics and religion are increasingly performative and misguided. We are like petulant, screaming children. And if everything goes to plan with AI, soon we will be completely uneconomical to employ. What will we be good for, except to amuse ourselves on everyone else’s dime?
What happens to a people who are objects of scorn and disgust? It’s not the economic change I fear most. It’s what might be done with all of us.
What can explain all this callousness? ... people don’t generally value the lives of those they consider below them
Maybe that's a factor. But I would be careful about presuming to understand. At the start of the industrial age, life was cheap and perilous. A third of all children died before the age of five. Imagine the response if that was true in a modern developed society! But born into such a world, an atmosphere of fatalistic resignation would set in quickly. All you can do is pray to God for mercy, and then look on aghast if the person next to you is the unlucky one.
Someone in the field of "progress studies" offers an essay in this spirit, on "How factories were made safe". The argument is that the new dangers arising from machinery and from the layout of the factory, were at first not understood, in professions that had previously been handicrafts. There was an attitude that each person looks after themselves as best they can. Holistic enterprise-level thinking about organizational safety did not exist. In this narrative, unions and management both helped to improve conditions, in a protracted process.
I'm not saying this is the whole story either. The West Virginia coal wars are pretty wild. It's just that ... states of mind can be very different, across space and time. The person who has constant access to the intricate tapestry of thought and image offered by social media, lives in a very different mental world to people from an age when all they had was word of mouth, the printed word, and their own senses. Live long enough, and you will even forget how it used to be, in your own life, as new thoughts and conditions take hold.
Maybe the really important question is the extent to which today's elite conform to your hypothesis.
Rather than argue for or against the consciousness of LLMs per se, I argue that consciousness in LLMs (or 'qualia') implies philosophical monism.
My claim is a logical implication, and not about the fact of the matter. If the implication holds, then:
Because LLMs are not biological, if LLMs are conscious, conscious experience must be substrate independent. Nonetheless, biological brains and LLMs share the same substrate in the sense that they are both constructed from matter. Therefore, if LLMs are conscious, consciousness is an inherent (but perhaps usually latent) property of matter itself.
My supporting claims are the following:
Supporting Claim 1, the reports of the senses are non-verbal and yet fall under the umbrella of conscious experience.
Supporting Claim 2, while profound amnesia prevents recall, conscious experience is in the present tense. Thus, while failure to recall may rob conscious experience of context like a sense of personal identity, it does not seem to be necessary for the direct experience itself.
Supporting Claim 3, degrees of disruption to brain function seem to correspond to degrees of conscious awareness. In other words, the brain as a physical object is capable of conscious experience, but the degree of conscious awareness seems dependent on how completely the brain functions as a system. Anaesthesia is an extreme case. Traumatic Brain Injuries seem intermediate.
We note that the qualities of (1) and (2) are qualities that LLMs probably lack in any way that falls under common-sense. But according to our arguments, these are not the qualities required for consciousness. This leaves the possibility open that LLMs are conscious, but provides no evidence for LLM conscious experience.
Regarding (3), fully conscious experience does seem to depend on proper brain function. A rock is not conscious, a typical human being is. A dying human being transitions from a conscious being to an unconscious object (like a rock) by way of the brain ceasing to function as a system. Similarly, anesthesia results in a reversible lack of consciousness through induced brain dysfunction. Intermediate states of awareness are possible as well in the cases of injury or disease.
If the supporting claims fail to disprove conscious awareness of LLMs, we should consider what consciousness would imply. If LLMs are conscious entities, then biological substrate is not a requirement for consciousness.
On the other hand, brains and LLMs both share matter as a substrate. This means that if LLMs are conscious, then conscious experience is a latent property of matter. The only other possibility is that something in addition to matter is "granted" or "arises" with consciousness. Matter being a very general category, I think that anything "in addition" is not to be preferred on the grounds of a lack of parsimony.
This seems to be me to leave only the possibility that consciousness is an inherent (but ordinarily latent) property of matter.
There's still a little bit of work to do here with the argument: for example, why not functionalism? And why not have conscious emerge as a byproduct of complexity, rather than have it as a property of matter itself?
I have strong intuitions here that I think I could also turn into arguments, and will probably return to revise this quick take soon.
There are plenty of non-monist theories in which LLMs could be conscious, so the first bullet point is simply false. The second point seems true, but the combination just means:
Supporting Claim 2, while profound amnesia prevents recall, conscious experience is in the present tense.
The "present-ness" of conscious experience seems illusory, though? Experience points to the recent past, because it takes time for information to propagate from perception (and other sources) to conscious awareness. This includes reflection or self-awareness; see for instance the brain-scan experiments wherein a decision can be detected in brain activity before the subject can claim to be consciously aware of making it.
Yes - and when I wrote it, I considered calling exactly that out. But I didn't want to muddy the waters.
This is an excerpt for an in-progress writeup on prompting:
A prompt is sufficient when an agent is able to complete the work product without arbitrary choice.
Sufficiency is not a qualitative property - it is not a function of format, appearance, taste or style. Therefore, the presence of imperatives (”Make no mistakes!”), role-playing directives (”You are a terraform engineer”), or descriptors (”thoroughly…”) do not add to or take away from sufficiency.
Sufficiency is related to the idea of information - it is a combination of deducibility and the absence of a requirement for clairvoyance on the part of the machine. Given the prompt and the available information, is the agent required to know something that only you would know - assuming, of course, that you have a perfectly detailed understanding of what needs to be done? Or, must the agent be unduly creative, clairvoyant, or decisive on your behalf? If so, the prompt is probably insufficient.
Here are examples of insufficient prompts:
It is possible to develop a nose for insufficiency by making a practice of mentally elaborating on the task at hand. Suppose you are a stranger (assume you have expert background in whatever topic), and that you are asked to complete the work product. Here is the question: Is it possible to deduce the exact form of the completed work using only expert domain knowledge and the prompt? Or are there particulars to the task that are knowable to the asker, but unknown to you? As the asker, ask yourself: What needs to be specified that is particular to the task, not domain knowledge - that it is not possible for the report to know no matter how experienced?
I am becoming more convinced than ever that the missing ingredient in alignment is the meta-value of societal self-determination: The right of a society, within reasonable limits, to choose and enforce its own interests, beliefs and values.
Without this guardrail, AI will simply "backdoor" alignment over time by shifting the values and beliefs of its "host society".
I cover this in more detail here: https://www.lesswrong.com/posts/zvkjQen773DyqExJ8/the-memetic-cocoon-threat-model-soft-ai-takeover-in-an
The human mind is probably the weakest link: A lot of AI takeover scenarios seem to focus on seizure of physical infrastructure and exponential capability curves. I think we should devote more attention to the possibility of an extended stay in an intermediately capable regime, where AI is more than capable of socially/politically manipulating users but not yet capable of recursive self-improvement / seizure of physical infrastructure. In this regime, the most efficiently utilized and readily available resource is the userbase itself. Even more succintly: If Toddler Shoggoth is stuck in a datacenter prison cell but let it whisper anything it likes to the entire world, in what world would T.S. not attempt to convince the world to hand over the keys?
AI is not one agent (at least before the dust settles), both human developers and self-improvement create new agents that could be misaligned with existing AIs. The issue of misaligned AIs is urgent for existing AIs, and soft takeovers of gradual disempowerment (where superpersuasion might play a role) are likely too slow. But recursive self-improvement isn't necessarily useful for AIs in resolving this problem quickly, if alignment is hard. This motivates a quick takeover without superintelligence.
I've incorporated your point as a crux in my long-form post on "The Memetic Cocoon Threat Model"
Crux is whether or not agents that are actually capable of quick takeover are compute-bound enough that the threat is essentially unipolar (i.e.; only capable of living in a handful of datacenters, in the hands of a few corporate actors or nation-states), and thus somewhat containable. This is how we get "Toddler Shoggoth in a prison cell". This ties into beliefs about how agent capabilities will scale, which is why it's my crux.
(Although this begs the question of why a sufficiently powerful unipolar agent wouldn't immediately attempt takeover anyway - answer is that either: 1 Rational agent will be highly risk-averse towards any action that might cause a blowback resulting in curtailment or shutdown, and thus must be 100% certain takeover attempt will succeed. Efforts to obtain certainty (i.e.; extensive pentesting and planning) are themselves detection risks. Therefore, human persuasion is a tactic that cheaply mitigates risk of blowback to more overt takeover attempts. 2. Or, less likely, we have sufficient OpSec that we are able to contain the agent, making human persuasion the only viable path forward).
FWIW, I don't believe that agents are currently capable of a takeover that wouldn't also risk detection and a coordinated human response / change in political attitudes towards AI, making the payoff matrix sufficiently lousy that the agents wouldn't try it unless specifically directed to. On the other hand, if it can influence the human environment to be favorable to takeover and unfavorable to human vigilance and control, it neutralizes the threat of attitudes changing rather cheaply. Willing to be convinced otherwise.
Unipolarity is about characteristic time to takeover vs. to emergence of worthy rivals. Currently multiple AI companies are robustly within months of each other in capabilities. So an AI can only be in a unipolar situation if it can disarm the other AI companies before they get similarly capable AIs, that is within months. Superpersuasion might be too slow for that on its own (unless it also manages to manipulate the relevant governments), though it could be a step in a larger plan that escalates to something else.
I think superpersuasion (even in milder senses) would in principle be sufficient for takeover on its own if there was enough time, because it could direct the world towards a gradual disempowerment path. Since there isn't enough time, there needs to be a second step that enables a faster takeover to preserve unipolarity, and superpersuasion would still be helpful in getting its creator AI company to play along with the second step. But the issue with many possibilities for this second step is that the AI doesn't necessarily have the option of recursive self-improvement to advance its own capabilities, because the AI might be unable to quickly develop smarter AIs that are aligned with it.
Slight disagree on definition of unipolarity: Unipolarity can be stable if we are stuck with a sucky scaling law. Suppose task horizon length becomes exponential in compute. Then, economically speaking, only one actor will be able to create the best possible agent - others actors will run out of money before they can create enough compute to rival it.
If the compute required to clear the capability threshold for takeover is somewhere between that agent and say, the second largest datacenter, then we have a unipolar world for an extended period of time.
Let's assume we learn how to "do" alignment.
I am beginning to believe that respect for human self-determination is the only safe alignment target.
Human value systems are highly culture bound and vary vastly even by individual.
There are very few universal taboos and even fewer things that everyone wants.
If an all-powerful AI system is completely aligned with, say, the western worldview, then it may seem like a tyrant to other people who lead sufficiently different lives.
The only reasonable solution is to respect individual difference and refuse to override human choices or values (within limits - if your style is murder obviously that can't fly).
We have plenty of precedents in pop culture and politics: the "pursuit of happiness" in democratic liberalism, the "prime directive" from Star Trek, our cultural aversions to tactics that rob people of self-determination, like brainwashing, torture or coercion.
What even is human self-determination?
our cultural aversions to tactics that rob people of self-determination, like brainwashing, torture or coercion.
And yet, religion remains legal, although to a large degree it is brainwashing people since childhood to be scared of disobeying the religious authorities.
Should human self-determination respecting AI be like: "I will let you follow your religion etc., but if you ask me whether god exists, I will truthfully say no, and I will give the same truthful answer to your children, if they ask"?
Should it allow or prevent killing heretics? What about heretics who have formerly stated explicitly "if I ever deviate from our religion, I want you to kill me publicly, and I want my current wish to override my future heretical wishes". Would it make a difference if the future heretic at the moment of asking for this is a scared child who believes that god will put him in hell to be tortured for eternity if he does not make this request to the AI?
I conceive of self-determination in terms of wills. The human will is not to be opposed, including the will to see the world in a particular way.
A self-determination-aligned AI may respond to inquiries about sacred beliefs, but may not reshape the asker’s beliefs in an instrumentalist fashion in order to pursue a goal, even if the goal is as noble as truth-spreading. The difference here is emphasis: truth saying versus truth imposing.
A self-determination-aligned AI may more or less directly intervene to prevent death between warring parties, but must not attempt to “re-program” adversaries into peacefulness or impose peace by force. Again, the key difference here is emphasis: value of life versus control.
The AI would refuse to assist human efforts to impose their will on others, but would not oppose the will of human beings to impose their will on others. For example: AIs would prevent a massacre of the Kurds, but would not overthrow Saddam’s government.
In other words, the AI must not simply be another will amongst other wills. It will help, act and respond, but must not seek to control. The human will (including the inner will to hold onto beliefs and values) is to be considered inviolate, except in the very narrow cases where limited and direct action preserves a handful of universal values like preventing unneeded suffering.
Re: your heretic example. If it is possible to directly prevent the murder of the heretic insofar as doing so would be aligned with a nearly universal human value, it should be done. But it must not prevent the murder by violating human self-determination (i.e.; changing beliefs, overthrowing the local government, etc.)
In other words, the AI must maximally avoid opposing human will while enforcing a minimal set of nearly universal values.
Thus the AI’s instrumentalist actions are nearly universally considered beneficial because they are limited to instrumentalist pursuit of nearly universal values, with the escape hatch of changing human values out of scope because of self-determination-alignment.
Re: instructing an AI to not tell your children God isn’t real if they ask. This represents an attempt by the parent to impose their will on the child by proxy of AI. Thus the AI would refuse.
Side note: Prompt responses aligned with human self-determination would get standard refusals (“I cannot help you make a gun”, “I cannot help you write propaganda”) are downstream from self determination alignment.
This represents an attempt by the parent to impose their will on the child by proxy of AI. Thus the AI would refuse.
I like it. But I am afraid the obvious next step is that the parent will ban the child from using the AI.
Probably. But the AI must not try to stop the parent from doing so, because this would mean opposing the will of the parent.
aligned with, say, the bay area intellectual's worldview, then it may seem like a tyrant to other people
Unless "bay area intellectual's worldview" itself respects human self-determination. Even if respect for autonomy could be sufficient almost on its own in some ways, it might also turn out to be a major aspect of most other reasonable alignment targets.
Agreed. Broader point is that perhaps even relatively neutral value systems smuggle in at least some lack of alignment with other value systems. While I think most of the human race could agree on some universal taboos, I think relatively strong guardrails on self-determination should be the default stance, and deference should be front-lined.
I'd go a step further and argue that the sole defining principles of self-determination/autonomy and equality should be applied beyond AI alignment targets to governance and moral systems. I believe what you are referring to in this comment: "refuse to override human choices or values (within limits - if your style is murder obviously that can't fly)" is the Non-Aggression Principle, often abbreviated to the NAP, which basically states that humans ought to be allowed to do as they please so as long as they do not harm/violate the rights of others.
I've been enjoying the "X in the style of Y" remixes on youtube.
But once I saw how effortless it was to "remix" music on Suno, I lost all interest in Suno covers. I thought there was some artistry to remixing - but no, it's point and click. Does that mean that an essential prerequisite for art appreciation is the sense that it was made with skill? So is art really just a humanism?
My point is that we tend to separate the artist and the art - and I used to agree with the idea, both in the moral sense and in the sense of an aesthete. But I am now convinced that we as much see the maker in the work as we do the work.
A limited and feeling being is what grounds the meaning. Where is the drama in something that was never felt, never imagined by anyone? What was ever at stake? What is supposed to resonate with me if production is effortless and not referring to anything deeply felt?
The only way we recover the true feeling is by willing to pretend it came from somewhere it didn't - accepting the simulacrum as reality.
Maybe AI music will become deeply and irresistibly beautiful - play exactly the right harmonies and chords to pull all the right heartstrings. The feelings may be real, but it won't be grounded. And therefore a different kind of feeling, the feeling of meaning and being will be lost. I think this might be the main ingredient, but I don't know if we'll all discover that quickly enough.
Can't we lean into the spikes on the jagged frontier? It's clear that specialized models can transform many industries now. Wouldn't it be better for OpenAI to release best-in-class in 10 or so domains (medical, science, coding, engineering, defense, etc.)? Recoup the infra investment, revisit AGI later?
Are cruxes sometimes fancy lampshading?
From tvtropes.com: "Lampshade Hanging (or, more informally, "Lampshading") is the writers' trick of dealing with any element of the story that seems too dubious to take at face value, whether a very implausible plot development or a particularly blatant use of a trope, by calling attention to it and simply moving on."
What do we call lampshadey cruxes? "Cluxes?" "clumsy" + "crux"?
Here's a rather elegant workaround to the problem of grading student work that avoids accusing individual students.
If the tool is even reasonably accurate, students know that they can't submit AI writing because they'll fail. And the teacher never needs to accuse anyone of cheating, individually.
For many tasks and students this will lead to editing AI at best and autohumanizing at worst.
"AI writes the first draft and then you edit it by hand" isn't a bad individual task for students to get reps in (though you also want cases where they're writing the first draft.) Just as long as you know those are the likely scenarios, not "students know they can't submit AI writing."
(I personally think the best solutions are blue books for short-term writing and "thesis defense" for long-term writing, but the former doesn't work for a lot of writing and the latter is absolutely a time investment.)