In Part I of CLR's safe Pareto improvements (SPI) agenda, we gave our high-level strategy for evaluating models for SPI-incompatible behavior and reasoning. This guide gives more details on how I’m thinking about executing on this strategy, especially: * the kind of workflow I think we should use, to start...
Executive summary * Safe Pareto improvements (SPIs) are ways of changing agents’ bargaining strategies that make all parties better off, regardless of their original strategies. SPIs are an unusually robust approach to preventing catastrophic conflict between AI systems, especially AIs capable of credible commitments. This is because SPIs can reduce...
(Subtitle: “And ethics, and epistemology, and…”. Cross-posted from my Substack.) We want to make decisions for good reasons. But I worry some common approaches to decision theory stray from this purpose. They start with a bottom-line verdict, “I should choose this action”, then use this verdict to justify claims about...
(Cross-posted from my Substack.) Here’s an important way people might often talk past each other when discussing the role of intuitions in philosophy.[1] Intuitions as predictors When someone appeals to an intuition to argue for something, it typically makes sense to ask how reliable their intuition is. Namely, how reliable...
I’ve argued in my unawareness sequence that when we properly account for our severe epistemic limitations, we are clueless about our impact from an impartial altruistic perspective. However, this argument and my responses to counterarguments involve a lot of moving parts. And the term “clueless” gets used in various importantly...
As many folks in AI safety have observed, even if well-intentioned actors succeed at intent-aligning highly capable AIs, they’ll still face some high-stakes challenges.[1] Some of these challenges are especially exotic and could be prone to irreversible, catastrophic mistakes. E.g., deciding whether and how to do acausal trade. To deal...
We’re finally ready to see why unawareness so deeply undermines action guidance from impartial altruism. Let’s recollect the story thus far: 1. First: Under unawareness, “just take the expected value” is unmotivated. 2. Second: Likewise, “do what seems intuitively good and high-leverage, then you’ll at least do better than chance”...