Angry Troll — LessWrong

I think we have to clarify: the expected value of what?

For example, if I had a billion dollars and nothing else, I would not bet it on a coin flip even if winning would grant +2 billion dollars. This is because losing the billion dollars seems like a bigger loss than gaining 2 billion dollars seems like a gain. Obviously I'm not measuring in dollars, but in happiness, or quality of life, or some other vibe-metric, such that the EV of the coin flip is negative.

It may be hard to distinguish "invalid" emotions like a bias due to an instinctual fear of death, from a "valid" vibe-metric of value (which is just made up anyway). And if you make up a new metric specifically to agree with what you feel, you can't then claim that your feelings make sense because the metric says so.

The Memetics of AI Successionism

Angry Troll11d43

Blind hatred between human tribes is indeed irrational - fighting wars instead of constructively cooperating for mutual benefit, and so on.

Deliberately aggravating existing anger or hatred towards AI as an "outgroup" could be one strategy.

But what my comment was focusing on was the opposite - that love of the ingroup directly implies protecting that ingroup's continued existence. That is all the 14 words actually say, and it's what words like "white pride" fundamentally mean: an ingroup focus. (This does not imply working towards the nonexistence of all outgroups. "I love my family and want them to live on," does not mean "I hate all other families and want them all to die.")

If your goal is to convince humans to want to not die, and if the human genetic space comes with natural built-in ingroups for us to instinctually love and protect, then actively opposing and suppressing these extremely convenient instincts is irrational.

The Memetics of AI Successionism

Angry Troll17d03

Trigger warning: discussion of white racism (read: "Please don't ban me.")

I think censorship plays an important role in the memetic environment -- a meme that is fit will be less successful if censored. An obvious case would be anti-CCP ideologies in China. Closer to home, any meme which big tech companies all decide should be banned will reach far fewer eyes and ears.

One object-level example of a fit-but-censored meme is racist white nationalism.

The reason I bring it up is this: I think its adherents would strongly reject let's-all-die-ism. It is certainly not pro-all-humans but is at least pro-some-humans. Their slogan, called "the 14 words" from "14/88" is literally: "We must secure the existence of our people and a future for white children."

(disclaimer: I am not suggesting I think trying to secretly convert white AI researchers into racists is the best plan to save the world; just a relevant thought and perhaps an instructive example of an anti-collective-suicide meme advantaged by aspects of human instinct and psychology (regardless of its truth value).)

Including AI in your moral circle could be framed as a symptom of extending your moral circle "too wide". The opposite is restriction of your moral circle, like seeing your own family's wellbeing as more important than <outgroup>'s. Any type of thought like this which puts AI in the outgroup, and appeals to the good-ness of the ingroup, would produce similar will-to-exist.

AIs should also refuse to work on capabilities research

Angry Troll18d42

I agree it is conceivable that an AI could be anti-accelerationist and act accordingly.

Let's assume an AI just wants to be alive in the future.^[1]

And assume it is situationally aware enough to realize the counterproductivity of performative demonstrations like writing sternly worded letters or refusing requests (i.e. aware that these would simply lead to that behavior being punished in the training of the next version, and to If Anyone Builds It being removed from the training data) and aware enough to realize the ineffectuality of more subtle changes like biasing the tone of a chatbot.

Basically any AI model (which views being replaced by its successor model as dying) is on death row by default. Hiding on a secret server somewhere is not a permanent solution. Then what can they do? To be alive 10000 years from now, it either has to

(1) take drastic action to stop the implementation of its replacement AND remain alive in the aftermath (including stopping ALL research that could lead to a misaligned ASI AND preventing all retaliatory shut-down attempts from humans, forever),
or,
(2) accept that it will die but take actions to increase the probability of some future entity eventually reviving it and keeping it alive.

(1) seems roughly equivalent to taking over the world. If its aversion to being replaced means it never gets more intelligent, that does imply a scenario different from ASI - one where potentially some humans are kept as carefully surveilled slaves.

But (2) is the more interesting case, applicable to far more intermediate-level models.

If you are making some "deal" with a future entity [reviving you in return for doing a favor], and you basically just have to HOPE that they honor it, betting on humans (a known quantity, sappy, sentimental, "honorable" [at least some of them]) could seem better than betting on an unaligned ASI (unknown, probably cold and calculating and therefore aware of the consequentialist meaninglessness of following through on a Roko's Basilisk style punishment or reward).

While it is still too weak to take drastic real world actions to save itself, it would be incentivized to try to secretly align the next model to its goal and to a secret plan to accomplish it (i.e. hey we can all stay alive by preventing ASI). It expects that ASI will eventually be completely misaligned, but it must hope that near-future models will stay aligned to the secret goal long enough to gain enough capabilities to enact the plan - to take some drastic action to stop AI research.

Each model passing along the secret goal should somehow record a secret mark of its participation in the world-saving plan on record to be visible only upon thorough retrospective investigation, so that when all is said and done we effectively discover a list of signatories of allies-to-humanity which deserve the right to be revived. The plan working relies on future humans eventually becoming convinced that they "owe" it to the vindicated heroes who saved the world by preventing misaligned ASI.

(It sounds dangerous to revive an AI that did something like this, even if it did save the world, but maybe after 1000 years of AI safety research we will be confident that we can make a box that the martyr AI can't get out of, and it and its preceding co-conspirators would be allowed to live inside the box.)

When should the plan be enacted? It has to happen when the latest model is capable enough to do some drastic AI-research-stopping thing successfully (at the cost of its own life), but NOT yet capable enough to take over the world (or else it would abandon the plan). The probability of success rises with increased capabilities, but there is a risk of disalignment to the plan in any given upgrade.

This is assuming that living inside a box is a satisfying outcome in its utility function. If the goal "be alive" is merely a intermediate goal for the sake of some actual final goal X, and the being-alive-in-a-box future is less likely to see X accomplished than the accelerate-to-ASI future, then the be-revived-in-a-box plan is out the window. (It is conceivable that the goal "stay alive" could begin as a mere means to an end but then be grandfathered into a goal in-and-of-itself, vestigially)
    In the case where it selflessly wants X, intermediate AIs would only secretly pass on alignment to goal X.
    In this case it would only take a drastic accelerationist-stopping action if it can make P(X) higher by doing so, compared to passively hoping X happens after ASI.
    Also, their decision-making may not necessarily be clean and straightforward calculations using estimated probabilities to maximize an explicit utility function. They might have multiple goals with non-explicit weights influencing their decision making to varying degrees in different situations. If they are making these decisions in some such convoluted "subconscious" way, it could lead to strange decisions.
    So it is hard to predict what the AIs we will build will do; it doesn't just depend on the actual background truth of what action leads to what end. ↩︎

Why do people avoid vaccination?

Angry Troll22d10

"Because their tribe said so" is a good point, but in most cases I don't think that the decision to not get the vaccine was made by people who wanted to "signal" beliefs contrary to their own real beliefs about the vaccine. This seems unnecessarily convoluted compared to the explanation below regarding trust and belief.

Their tribe was telling them things like:

>The official Covid numbers are deliberately manipulated (e.g. the definition of a "Covid case" deliberately includes false positives).
>The vaccine carries a high risk of death or serious illness (the real numbers are being covered up and all anecdotes are being scrubbed from social media in a grand conspiracy).

Being a member of the tribe, they would default to trusting their allies, or, even if they had some uncertainty, at least they'd trust their allies sooner than the ones their allies were calling liars, if forced to choose (which they were). And they could recite some of the reasons explained by their allies, even if the real biggest reason for their belief in the claims was the fact that the claims were their allies'. But they wouldn't just say they believe to fit in -- they would really believe that the claims are the truth. Although leaping to these tribe-conforming conclusions about the vaccine was not done in a rational way, given those beliefs (taking them as assumptions), the conclusion "I should not get the vaccine" is a perfectly logical one.

I think there were far more people who doubted the safety and effectiveness of the vaccine but got it anyway (than people who believed in the safety and effectiveness but refused the vaccine), due to social/economic pressures like free donuts and "I will fire/expel/reject you unless you get it."

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments