## LESSWRONGLW

A Black Swan is better formulated as:
- Extreme Tail Event : Probabilities cannot compute in current paradigm. Its weight is p<Epsilon.
- Extreme Impact if it happens : Paradigm Revolution.
- Can be rationalised in hindsight, because there were hints. "Most" did not spot the pattern. Some may have.

If spotted a priori, one could call it a Dragon King: https://en.wikipedia.org/wiki/Dragon_king_theory

The Argument:
"Math + Evidence + Rationality + Limits makes it Rational to drop Long Tail for Decision Making"
is a prime example of an heuristic which fails into ...

When it comes to rationality, the Black Swan Theory ( https://en.wikipedia.org/wiki/Black_swan_theory ) is an extremely useful test.

A truly rational paradigm should be built with anti-fragility in mind, especially towards Black Swan events which would challenge its axiomatic.

1Gerald Monroe4mo

This is actually a quote from Arbital. Their article explain the connection.

My point is that SFOT likely never work in any environment relevant to AI Alignement, where such diagonal methods show any Agent with a fixed Objective Function is crippled by an adequate counter.

Therefore SFOT should not be used when exploring AI alignement.

Can SFOT hold in ad-hoc limited situations that do not represent the real world? Maybe, but that was not my point.

Finding one counter-example that shows SFOT does not hold in a specific setting (Clippy in my scenario) proves that it does not hold in general, which was my goal.

The discussion here is about the strong form. Proving that a « terminal » agent is crippled is exactly what is needed to prove the strong form does not hold.

1Anon User5mo
Maybe there is a better way to put it - SFOT holds for objective functions/environments that only depend on the agent I/O behavior. Once the agent itself is embodied, then yes, you can use all kinds of diagonal tricks to get weird counterexamples. Implications for alignment - yes, if your agent is fully explainable and you can transparently examine it's workings, chances are that alignment is easier. But that is kind of obvious without having to use SFOT to reason about it. Edited to add: "diagonal tricks" above refers to things in the conceptual neighborhood of https://en.m.wikipedia.org/wiki/Diagonal_lemma

(1) « Liking », or « desire » can be defined as « All other things equal, Agents will go to what they Desire/Like most, whenever given a choice ». Individual desire/liking/tastes vary.

(2) In Evolutionary Game Theory, in a Game where a Mitochondria-like Agent offers you choice between :

• (Join eukaryotes) mutualistic endosymbiosis, at the cost of obeying apoptosis, or being flagged as Cancerous enemy
• (Non eukaryotes) refusal of this offer, at the cost of being treated by the Eukariotes as a threat, or a lesser symbiote.

then that Agent is likely to win. To a rational agent, it’s a winning wager. My last publication expands on this.

What would prevent a Human brain from hosting an AI?

FYI some humans have quite impressive skills:

• Hypermnesia, random: 100k digits of Pi (Akira Haraguchi) That’s many kB of utterly random programming.
• Hypermnesia, visual: accurate visual memory (Stephen Wiltshire, NYC Skyline memorised in 10mn)
• Hypermnesia, language: fluency in 40+ languages (Powell Alexander Janulus)
• High IQ, computation, etc. : countless records.

Peak human brain could act as a (memory-constrained) Universal Turing/Oracle Machine, and run a light enough AI, especially if it’s programmed in such a way that the Human Memory is its Web-like database?

Arbital is where I found this specific wording for the strong form.

Since I wrote this (two weeks), I am working on addressing some lesser forms as presented in Stuart Armstrong’s article at section 4.5.

2TAG5mo
Arbital says: You say: I don't see the connection.

We can consider the « Stronger Strong Form » about « Eternally Terminal » Agents, which CANNOT change, does not hold, then :-)

1Anon User6mo
Well, yeah, if you specifically choose a crippled version of the high-U agent that is somehow unable to pursue the winning strategy, it will loose - but IMHO that's not what the discussion here should be about.

(1) « people liking thing does not seem like a relevant parameter of design ».

This is quite a bold statement. I personally believe the mainstream theory according to which it’s easier to have designs adopted when they are liked by the adopters.

(2) Nice objection, and the observation of complex life forms gives a potential answer :

• All healthy multicellular cells obey Apoptosis.
• Apoptosis literally is « suicide in a way that’s easy to recycle because the organism asks you » (the source of the request can be internal via mitochondria, or external, generally le
...
1VojtaKovarik6mo
Fair point. I guess "not relevant" is a too strong phrasing. And it would have been more accurate to say something like "people liking things might be neither sufficient nor necessary to get designs adopted, and it is not clear (definitely at least to me) how much it matters compared to other aspects".   Re (2): Interesting. I would be curious to know to what extent this is just a surface-level-only metaphor, or unjustified antrophomorphisation of cells, vs actually having implications for AI design. (But I don't understand biology at all, so I don't really have a clue :( .)

Hypothesis:

Basilisk could give a virus to any complex enough Turing Machine, that proves Basilisk’s Wager is either:

• a clear mutualistic win-win with the Basilisks (Hive)
• or a “you will need to waste all your resources trying to avoid our traps”

My first post should be validated soon, and is a proof that the strong form does not hold: in some games some terminal alignment perform less than non-terminal equivalent alignment.

An hypothesis is that most goals, if they become “terminal” (“in itself”, impervious to change), prevent evolution, and mutualistic relationships with other agents.

Evolution gives us many organically designed Systems which offer potential Solutions:

• white blood cells move everywhere to spot and kill defective cells with literal kill-switch (apoptosis)

A team of Leucocytes (white blood cells):

• organically checking the whole organisation at all levels
• scanning for signs of amorality/misalignment or any other error
• flagging for surveillance, giving warnings, or sending to a restorative justice process depending on gravity
• agents themselves held to the highest standard

This is a system that can be implemented in a Company and f...

1VojtaKovarik6mo
I agree that the general point (biology needs to address similar issues, so we can use it for inspiration) is interesting. (Seems related to https://www.pibbss.ai/ .) That said, I am somewhat doubtful about the implied conclusion (that this is likely to help with AI, because it won't mind): (1) there are already many workspace practices that people don't like, so "people liking things" doesn't seem like a relevant parameter of design, (2) (this is totally vague, handwavy, and possibly untrue, but:) biological processes might also not "like" being surveiled, replaced, etc, so the argument proves too much.