I don't think these articles should make up a high proportion of the content on Less Wrong, but I think it's good if things like this are occasionally discussed.
Great article.
One point of disagreement: I suspect that the difficulty of the required high-impact tasks likely relates more to what someone thinks about the offense-defense balance than the alignment difficulty per se.
Just to add to this:
Beliefs can be self-reinforcing in predictive processing theory because the higher level beliefs can shape the lower level observations. So the hypersensitisation that Delton has noted can reinforce itself.
Steven Byrnes provides an explanation here, but I think he's neglecting the potential for belief systems/systems of interpretation to be self-reinforcing.
Predictive processing claims that our expectations influence what we observe, so experiencing pain in a scenario can result in the opposite of a placebo effect where the pain sensitizes us. Some degree of sensitization is evolutionary advantageous - if you've hurt a part of your body, then being more sensitive makes you more likely to detect if you're putting too much strain on it. However, it can also make you experience pain as the result of minor sensations that aren't actually indicative of anything wrong. In the worst case, this pain ends up being self-reinforcing.
https://www.lesswrong.com/posts/BgBJqPv5ogsX4fLka/the-mind-body-vicious-cycle-model-of-rsi-and-back-pain
Interesting work.
This post has made me realise that constitutional design is surprisingly neglected in the AI safety community.
Designing the right constitution won't save the world by itself, but it's a potentially easy win that could put us in a better strategic situation down the line.
I guess I'm worried that allowing insurance for disasters above a certain size could go pretty badly if it increases the chance of labs being reckless.
Thank you for your service!
For what it's worth, I feel that the bar for being a valuable member of the AI Safety Community, is much more attainable than the bar of working in AI Safety full-time.
If the strong AI has knowledge of the benchmarks (or can make correct guesses about how these were structured), then it might be able to find heuristics that work well on them, but not more generally, Some of these heuristics might seem more likely than not to humans.
Still seems like a useful technique if the more powerful model isn't much more powerful.
I really like the way that you've approached this pragmatically, "If you do X, which may be risky or dubious, at least do Y".
I suspect that there's a lot of alpha in taking a similar approach to other issues.
I’m confused by your use of Shapley values. Shapley values assume that the “coalition” can form in any order, but that doesn’t seem like a good fit for language models where order is important.