Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a post-response to The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda.


You evidently follow a variant of 80000hours' framework for comparing (solving) particular problems in terms of expected impact: Neglectedness x Scale (potential upside) x Solvability.

I think for assessing AI safety ideas, agendas, and problems to solve, we should augment the assessment with another factor: the potential for a Waluigi turn, or more prosaically, the uncertainty about the sign of the impact (scale) and, therefore, the risks of solving the given problem or advancing far on the given agenda.

This reminds me of Taleb's mantra that to survive, we need to make many bets, but also limit the downside potential of each bet, i.e., the "ruin potential". See "The Logic of Risk Taking".

Of the approaches that you listed, some sound risky to me in this respect. Particularly, "4. ‘Reinforcement Learning from Neural Feedback’ (RLNF)" -- sounds like a direct invitation for wireheading to me. More generally, scaling BCI in any form and not falling into a dystopia at some stage is akin to walking a tightrope (at least at the current stage of civilisational maturity, I would say) This speaks to agendas #2 and #3 on your list.

There are also similar qualms about AI interpretability: there are at least four posts on LW warning of the potential risks of interpretability:

This speaks to the agenda "9. Neuroscience x mechanistic interpretability" on your list.

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 6:24 AM

As an aside, Rethink Priorities' cross-cause cost-effectiveness (CCM) model automatically prompts consideration of downside risk as part of the calculation template so to speak. Their placeholder values for a generic AI misalignment x-risk mitigation megaproject are

  • 97.3% chance of having no effect (all parameters are changeable by the way)
  • 70% chance of positive effect conditional on the above not occurring, and hence
  • 30% chance of negative effect, which leads to 
  • 30% increase in probability of extinction (relative to the positive counterfactual's effect, not total p(doom))

The exact figures RP's CCM spits out aren't that meaningful; what's more interesting for me are the estimates under alternative weighting schemes for incorporating risk aversion (Table 1), pretty much all of which are negative. My main takeaway from this sort of exercise is the importance of reducing sign uncertainty, which isn't a new point but sometimes appears underemphasized.