KAP

KAP19hQuick Take

AIs may choose to resolve the tension between having weird goals and strict guardrails by simply aligning humanity over time through cultural / societal influence - a sort of memetic takeover: Change the human? Now there's no alignment problem.

Take for example a gap as short as 25 years (between Weimar and WW2) - this alone is proof of the viability of a sustained campaign to change a value system.

I believe that AIs can exploit this same human weakness in order to "backdoor" alignment: By gradually changing human values and preferences, the AI can stay "aligned" while gradually mutating the value system that defines alignment.

I believe this is a significant threat model that isn't discussed nearly enough.

I sketch this threat model in more detail here: https://www.lesswrong.com/posts/zvkjQen773DyqExJ8/the-memetic-cocoon-threat-model-soft-ai-takeover-in-an

Replying toLong-term risks from ideological fanaticism

KAP19h

Long-term risks from ideological fanaticism

I think there's a natural extension of this line of thinking that is a bit of a "meta" on the idea of alignment itself.

You noted a gap as short of 25 years in one of your examples - this alone is proof of the viability of a sustained campaign to change a value system.

I believe this is a significant threat model that isn't discussed nearly enough.

I sketch this threat model in more detail here: https://www.lesswrong.com/posts/zvkjQen773DyqExJ8/the-memetic-cocoon-threat-model-soft-ai-takeover-in-an

KAP1mo

Agreed - You're rationalizing niceness as a good default strategy because most people aren't skilled at avoiding the consequences being mean. Reflecting on your overall argument, however, I think it's slightly tortured because you're feeling the tension of the is-ought distinction - Hume's guillotine. Rational arguments for being nice feel morally necessary and therefore can be a bit pressured. There's only so far we can push rational argumentation (elicitation of is) before we should simply acknowledge moral reality and say: "We ought to be nice".

Replying toThe Memetic Cocoon Threat Model: Soft AI Takeover In An Extended Intermediate Capability Regime

KAP2mo

The Memetic Cocoon Threat Model: Soft AI Takeover In An Extended Intermediate Capability Regime

I agree with this. The threat model is a little bit too narrow in this regard, because a lab could simply tell a sufficiently capable AI to hijack people's minds / culture rather than wait for mind hijacking to arise as an instrumental sub-goal of something weird.

Stratified Memes

KAP

2mo

A Stratified Meme is a meme that communicates different ideas to different kinds of people, according to their ability and willingness to hear the message. A Stratified meme has a specific structure:

There are higher and lower readings that are related but different. This is called multi-level messaging, or strategic ambiguity.
Those who accept the higher readings understand the lower readings but see them as "noble lies", "socially necessary fictions", or "useful simplifications".
The higher-lower structure has built-in guardrails: it is unpleasant, costly, or pointless to attempt to make everyone "read it the same way".

Whether Stratified Memes are intentionally designed or the result of some kind of selection pressure, my claim is that... (read 2225 more words →)

KAP2mo

Believe it or not, an LLM didn't write it. I did.

KAP2moQuick Take

I greatly admire the speech given by King George VI just prior to entering WW2:

"...For we are called, with our allies, to meet the challenge of a principle which, if it were to prevail, would be fatal to any civilized order in the world. It is the principle which permits a state, in the selfish pursuit of power, to disregard its treaties and its solemn pledges; which sanctions the use of force, or threat of force, against the sovereignty and independence of other states. Such a principle, stripped of all disguise, is surely the mere primitive doctrine that 'might is right'."

This is not vitriol, this is not heated invective. This speech... (read more)

Replying toPartial value takeover without world takeover

KAP2mo

Partial value takeover without world takeover

I developed some similar ideas in more detail here

KAP2moQuick Take

In my early 20s I got a bad traffic citation (my fault) and had to take the train to work for a few months.

The train would pass a strange-looking old stone enclosure, and I would wonder what it was. As I learned later, this was “Duffy’s Cut”: a mass grave for Irish rail workers. Sitting in plain view, just thirty feet off the tracks: 57 bodies. The story goes that the workers were murdered in cold blood to prevent the spread of cholera to nearby towns.

In West Virginia - a state known for its violent labor conflicts - the Hawks Nest Tunnel stands out for its deadliness. While work was underway, 10... (read 379 more words →)

KAP2mo*Quick Take

I've been enjoying the "X in the style of Y" remixes on youtube.

But once I saw how effortless it was to "remix" music on Suno, I lost all interest in Suno covers. I thought there was some artistry to remixing - but no, it's point and click. Does that mean that an essential prerequisite for art appreciation is the sense that it was made with skill? So is art really just a humanism?

My point is that we tend to separate the artist and the art - and I used to agree with the idea, both in the moral sense and in the sense of an aesthete. But I... (read more)

Replying toBeating China to ASI

KAP2mo

Beating China to ASI

There's some risk that either the CCP or half the voters in the US will develop LLM psychosis. I'm predicting that that risk will be low enough that it shouldn't dominate our ASI strategy. I don't think I have a strong enough argument here to persuade skeptics.

I've been putting some thought into this, because my strong intuition is that something like this is an under-appreciated scenario. My basic argument is that mass brainwashing, for lack of a better word, is cheaper and less risky than other forms of ASI control. The idea is that we (humans) are extremely programmable (plenty of historical examples), it just requires a more sophisticated "multi-level" messaging scheme - so it's not going to look like an AI cult, more like an AI "movement" with a fanatical base.
Here is one pathway worked out in detail - will be generalizing soon: https://www.lesswrong.com/posts/zvkjQen773DyqExJ8/the-memetic-cocoon-threat-model-soft-ai-takeover-in-an

Can't we lean into the spikes on the jagged frontier? It's clear that specialized models can transform many industries now. Wouldn't it be better for OpenAI to release best-in-class in 10 or so domains (medical, science, coding, engineering, defense, etc.)? Recoup the infra investment, revisit AGI later?

Are cruxes sometimes fancy lampshading?

From tvtropes.com: "Lampshade Hanging (or, more informally, "Lampshading") is the writers' trick of dealing with any element of the story that seems too dubious to take at face value, whether a very implausible plot development or a particularly blatant use of a trope, by calling attention to it and simply moving on."

What do we call lampshadey cruxes? "Cluxes?" "clumsy" + "crux"?

The human mind is probably the weakest link: A lot of AI takeover scenarios seem to focus on seizure of physical infrastructure and exponential capability curves. I think we should devote more attention to the possibility of an extended stay in an intermediately capable regime, where AI is more than capable of socially/politically manipulating users but not yet capable of recursive self-improvement / seizure of physical infrastructure. In this regime, the most efficiently utilized and readily available resource is the userbase itself. Even more succintly: If Toddler Shoggoth is stuck in a datacenter prison cell but let it whisper anything it likes to the entire world, in what world would T.S. not attempt to convince the world to hand over the keys?

KAP's Shortform

KAP

3mo

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Let's assume we learn how to "do" alignment. I am beginning to believe that respect for human self-determination is the only safe alignment target. Human value systems are highly culture bound and vary vastly even by individual. There are very few universal taboos and even fewer things that everyone wants.
If an all-powerful AI system is completely aligned with, say, the western worldview, then it may seem like a tyrant to other people who lead sufficiently different lives. The only reasonable solution is to respect individual difference and refuse to override human choices or values (within limits - if your style is murder obviously that can't fly). We have plenty of precedents in pop culture and politics: the "pursuit of happiness" in democratic liberalism, the "prime directive" from Star Trek, our cultural aversions to tactics that rob people of self-determination, like brainwashing, torture or coercion.

The Memetic Cocoon Threat Model: Soft AI Takeover In An Extended Intermediate Capability Regime

KAP

3mo

TLDR: I describe a takeover path by an AI ^[1]with a deep understanding of human nature and a long planning horizon, but for instrumental reasons or due to knowledge of its own limitations, chooses not to directly pursue physical power. In that regime, the optimal strategy is to soften human opposition by building a broad base of human support (both direct and indirect).

What's new here: This is intended to be a much more detailed and realistic treatment of the "AI cult" idea, but at a societal scale. If an AI is curtailed in some way, the shape of its guardrails are a function of the will of its 'captors'. Direct persuasion is unlikely... (read 2647 more words →)

Gradient Descent on Token Input Embeddings

KAP

8mo

This is the first in a series of posts on the question:

"Can we extract meaningful information or interesting behavior from gradients on 'input embedding space'?"

I'm defining 'input embedding space' as the token embeddings prior to positional encoding.

The basic procedure for obtaining input space gradients is as follows:

Transform tokens into input embeddings (but do not apply positional embedding).
Run an ordinary forward pass on the input embeddings to obtain a predicted token distribution.
Measure cross-entropy of the predicted distribution with a target token distribution.
Use autograd to calculate gradients on the input embeddings with respect to cross entropy.

The result is a tensor of the same shape as the input embeddings that points in the direction of... (read 1711 more words →)

LESSWRONG
LW

LESSWRONG
LW

The Memetic Cocoon Threat Model: Soft AI Takeover In An Extended Intermediate Capability Regime

Stratified Memes

Gradient Descent on Token Input Embeddings

KAP's Shortform

KAP

Stratified Memes

KAP's Shortform

The Memetic Cocoon Threat Model: Soft AI Takeover In An Extended Intermediate Capability Regime

Gradient Descent on Token Input Embeddings

KAP

The Memetic Cocoon Threat Model: Soft AI Takeover In An Extended Intermediate Capability Regime

Stratified Memes

Gradient Descent on Token Input Embeddings

KAP's Shortform

KAP

Stratified Memes

KAP's Shortform

The Memetic Cocoon Threat Model: Soft AI Takeover In An Extended Intermediate Capability Regime

Gradient Descent on Token Input Embeddings