I think a different use of MI is warranted here. While I highly doubt the ability to differentiate whether a value system is meshing well with someone for "good" or "bad" reasons, it seems more plausible to me that you could measure the reversibility of a value system.

The distinguishing feature of a trap here isn't so much the badness, as the fact that it's irreversible. If you used interpretability techniques to check whether someone could be reprogrammed from a belief, you'd avoid a lot of tricky situations.

Reply

[-]Martín Soto3mo20

Potential solution via mechanistic interpretability

Sounds unlikely to me. Due to the space of values being so large, I don't expect we can fix upfront a set of "valid mental moves to justify a value", even if these are pretty high-level abstractions. Put another way, I expect even these generators of the space of values (or the space of "human judgements of values") to be too many, thus face the same tension between exploration and virality.

Reply

[-]Lionel Levine3mo10

The "duty to listen" could be qualified: We accept the duty to listen to value system V, if and only if V endorses the duty to keep listening. We cultivate an allergy to value systems that forbid further listening. The allergy enables us to consider (some) bad viral value systems, without getting trapped in them.

I suspect there are sneakier forms of bad viral value systems that would trap us without triggering the allergy. I wonder what shape these have.

Reply

[-]Lionel Levine3mo10

permanently adopt P

typo: permanently adopt V

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

26

Balancing exploration and resistance to memetic threats after AGI

26

26

Potential solution via mechanistic interpretability

Why would traps exist?

Why I think viral value systems exist

Why I think many viral value systems will be bad

Why these goals are in tension

Why this problem might be solvable

Potential solution via mechanistic interpretability

Potential solution via clever design

Potential solution via moral trade

Potentially useful tool: zero-knowledge proofs or arguments

Appendix: formalizing thorough exploration and trap avoidance