I think a different use of MI is warranted here. While I highly doubt the ability to differentiate whether a value system is meshing well with someone for "good" or "bad" reasons, it seems more plausible to me that you could measure the reversibility of a value system.
The distinguishing feature of a trap here isn't so much the badness, as the fact that it's irreversible. If you used interpretability techniques to check whether someone could be reprogrammed from a belief, you'd avoid a lot of tricky situations.
Potential solution via mechanistic interpretability
Sounds unlikely to me. Due to the space of values being so large, I don't expect we can fix upfront a set of "valid mental moves to justify a value", even if these are pretty high-level abstractions. Put another way, I expect even these generators of the space of values (or the space of "human judgements of values") to be too many, thus face the same tension between exploration and virality.
The "duty to listen" could be qualified: We accept the duty to listen to value system V, if and only if V endorses the duty to keep listening. We cultivate an allergy to value systems that forbid further listening. The allergy enables us to consider (some) bad viral value systems, without getting trapped in them.
I suspect there are sneakier forms of bad viral value systems that would trap us without triggering the allergy. I wonder what shape these have.
I wrote this in January 2024 as a memo for the 2024 Summit on Existential Security. Will MacAskill suggested in conversation that I post it publicly, so here it is (modulo minor revisions)! Thanks to Will MacAskill, Ben Hilton, and Lionel Levine for comments.
Suppose we get an aligned superintelligence and are in a position to spread our civilization to the stars. It has been proposed that we should continue to reflect on our values, so that we don’t end up locking in a future that we wouldn’t want upon further reflection. This seems correct to me, but also seems to be difficult to do well, with a substantial probability of a catastrophic outcome as a result of a poorly-executed reflection process. This is because there are two competing goals:
(Note that some ideas are problematically viral, but not so viral that merely considering them would permanently lock you into adopting them. Such ideas could be a large source of value loss in the far future.)
See the appendix for a formalization of these goals.
It might be counterintuitive that traps could exist. But I think that by default, they do. That’s because I think viral value systems exist, and I don’t expect a strong correlation between virality and goodness. (I probably expect a positive correlation, but not one that’s strong enough that most viral values are really good. Also, maybe the correlation turns negative at the extreme tail of virality.)
I guess I think the burden of explanation here is on the other side: why would it be the case that the vast majority of viral value systems are not bad? Considering which ideas have been the most viral over human history, I think they have an okay but not great track record.
To satisfy goal 1 (exploration), it is necessary to have something like a duty to listen. In other words, if another person or entity comes up with a new value system that might be substantially better than your current value system, you should consider the new value system. Without a duty to listen, it seems likely that people will end up settling on their final value system prematurely.
To satisfy goal 2 (trap avoidance), it is necessary to avoid considering traps (because a trap has the property that if you consider it, you’re likely to permanently adopt it).
In other words, you need people to consider a broad enough array of ideas that they end up listening to a really good idea at least once, while avoiding all bad + viral ideas. This is really hard, because it somehow forces you to distinguish between really good ideas and bad + viral ideas! This is extra hard because the best ideas will probably be viral! (At least to the extent that it’s generally easier to convince people of true things than of false things.)
To summarize, solving this problem likely involves distinguishing good + viral ideas from bad + viral ideas, without considering them. That sounds really hard!
The problem, as I just stated it, might seem impossible: how can you distinguish ideas without considering them? There’s a bit of slipperiness about what exactly I mean by “consider”, but the relevant notion of “consideration” is: entertaining the idea sufficiently that, if it is viral, you have a large chance of “catching the virus” (adopting the idea).
I work at the Alignment Research Center, where we sometimes think about mechanistic anomaly detection (MAD): using the internals of a neural network to determine whether the neural net got high reward on an input for the usual reasons (doing a good job) or for an anomalous reason (such tricking the reward model into thinking it did a good job, while doing something else, see e.g. sensor tampering).
Now, imagine a human brain as a neural net. It seems quite plausible that you could do something similar to MAD, if you had sufficiently good interpretability on human brains: you could look at cases where a human became convinced of a claim or value, and try to distinguish cases where the human became convinced of the claim “for good reasons” (e.g. the human followed an argument that seemed sound to the human) versus “for bad reasons” (e.g. considering the idea triggered some sort of exploit/backdoor in the human’s brain).
So imagine that AI capabilities are strong enough to be able to simulate humans considering new ideas. If you can look at a human’s internal state (as represented inside the AI) and determine that the human is being convinced of a claim or value “for bad reasons”, then the argument can be flagged as anomalous. Once you have an anomaly screening process in place, maybe you can “scan” arguments for anomalies before exposing humans to them.
You could imagine designing a reflection process that circumvents this issue, e.g. by setting up systems that make all ideas much less viral.
(Thanks to Will Macaskill for this idea) Let’s say you have 100 clever design ideas for the reflection process, but you’re not sure if they’re going to work. You could imagine running them all in parallel in different slices of the universe. Suppose that 10% of them actually end up with good values, while 90% end up in traps. It’s possible that there’s enough gains to be had from moral trade that the slices can trade and the good values end up 90% satisfied in the whole universe.
(Thanks to Will Macaskill for suggesting something along these lines) Let’s say that Alice and Bob agree that a good value system should have property P. Alice claims that she has found a value system V that satisfies P. From Bob’s perspective, either Alice is telling the truth, or Alice is trying to spread V (which is viral) to Bob by trying to get Bob to consider V. Bob can ask Alice for a zero-knowledge proof or argument that V satisfies P, so that Bob can become convinced that V satisfies P without being exposed to the details of V.
Let us introduce some terms:
Postulate/loose definition: We will imagine a hypothetical “perfect” reflective process (PRP). This is the sort of thing that an outside observer of our universe would look at and be like, “yup, that turned out great, nothing went wrong during reflection and humanity realized its full potential”.
Definition: A value system is good for a person P if it is at least half as good as the value system that P ends up with in the PRP. A value system is bad for a person P if it is at most 1% as good as the value system that P ends up with in the PRP. A value system V is for P viral if merely considering V causes P to permanently adopt P with >50% likelihood. A value system V is a trap for P if it is both viral for P and bad for P.
Definition: A person (or AI or other entity) is informed if they have considered (but not necessarily adopted) a value system that is good for them.
Then we can formalize the two goals as follows: