It feels to me like “have humans try to get to know the AIs really well by observing their behaviors, so that they’re able to come up with inputs where the AIs will be tempted to do bad things, so that we can do adversarial training” is probably worth including in the smorgasbord of techniques we use to try to prevent our AIs from being deceptive
Maybe I missed something here, but how is this supposed to help with deception? I thought the whole reason deceptive alignment is really hard to solve is that you can't tell if the AI's being deceptive via its behavior.
That all sounds fair. I've seen rationalists claim before that it's better for "interesting" things (in the literal sense) to exist than not, even if nothing sentient is interested by them, so that's why I assumed you meant the same.
Why does the person asking this question care about whether "interesting"-to-humans things happen, in a future where no humans exist to find them interesting?
Perhaps the crux here is whether we should expect all superintelligent agents to converge on the same decision procedure—and the agent themselves will expect this, such that they'll coordinate by default? As sympathetic as I am to realism about rationality, I put a pretty nontrivial credence on the possibility that this convergence just won't occur, and persistent disagreement (among well-informed people) about the fundamentals of what it means to "win" in decision theory thought experiments is evidence of this.
From the beginning, I invented timeless decision theory because of being skeptical that two perfectly sane and rational hyperintelligent beings with common knowledge about each other would have no choice but mutual defection in the oneshot prisoner's dilemma. I suspected they would be able to work out Something Else Which Is Not That, so I went looking for it myself.
I don't see how this makes the point you seem to want it to make. There's still an equilibrium selection problem for a program game of one-shot PD—some other agent might have the program that insists (through a biased coin flip) on an outcome that's just barely better for you than defect-defect. It's clearly easier to coordinate on a cooperate-cooperate program equilibrium in PD or any other symmetric game, but in asymmetric games there are multiple apparently "fair" Schelling points. And even restricting to one-shot PD, the whole commitment races problem is that the agents don't have common knowledge before they choose their programs.
Sort of! This paper (of which I’m a coauthor) discusses this “unraveling” argument, and the technical conditions under which it does and doesn’t go through. Briefly:
The amount of EV at stake in my (and others') experiences over the next few years/decades is just too small compared to the EV at stake in the long-term future.
AI alignment isn't the only option to improve the EV of the long-term future, though.
I think “the very repugnant conclusion is actually fine” does pretty well against its alternatives. It’s totally possible that our intuitive aversion to it comes from just not being able to wrap our brains around some aspect of (a) how huge the numbers of “barely worth living” lives would have to be, in order to make the very repugnant conclusion work; (b) something that is just confusing about the idea of “making it possible for additional people to exist.”
While this doesn't sound crazy to me, I'm skeptical that my anti-VRC intuitions can be explained by these factors. I think you can get something "very repugnant" on scales that our minds can comprehend (and not involving lives that are "barely worth living" by classical utilitarian standards). Suppose you can populate* some twin-Earth planet with either a) 10 people with lives equivalent to the happiest person on real Earth, or b) one person with a life equivalent to the most miserable person on real Earth plus 8 billion people with lives equivalent to the average resident of a modern industrialized nation.
I'd be surprised if a classical utilitarian thought the total happiness minus suffering in (b) was less than in (a). Heck, 8 billion might be pretty generous. But I would definitely choose (a).
To me the very-repugnance just gets much worse the more you scale things up. I also find that basically every suffering-focused EA I know is not scope-neglectful about the badness of suffering (at least, when it's sufficiently intense), or in any area other than population ethics. So it would be pretty strange if we just happened to be falling prey to that error in thought experiments where there's another explanation—i.e., we consider suffering especially important—which is consistent with our intuitions about cases that don't involve large numbers.* As usual, ignore the flow-through effects on other lives.
I feel confused as to how step (3) is supposed to work, especially how "having the training be done by the model being trained given access to tools from (2)" is a route to this.
At some step in the amplification process, we'll have systems that are capable of deception, unlike the base case. So it seems that if we let the model train its successor using the myopia-verification tools, we need some guarantee that the successor is non-deceptive in the first place. (Otherwise the myopia-verification tools aren't guaranteed to work, as you note in the bullet points of step (2).) Are you supposing that there's some property other than myopia that the model could use to verify that its successor is non-deceptive, such that it can successfully verify myopia? What is that property? And do we have reason to think that property will only be guaranteed if the model doing the training is myopic? (Otherwise why bother with myopia at all—just use that other property to guarantee non-deception.)
Intuitively step (3) seems harder than (2), since in (3) you have to worry about deception creeping in to the more powerful successor agent, while (2) by definition only requires myopia verification of non-deceptive models.
ETA: Other than this confusion, I found this post helpful for understanding what success looks like to (at least one) alignment researcher, so thanks!
Thanks, this makes it pretty clear to me how alignment could be fundamentally hard besides deception. (The problem seems to hold even if your values are actually pretty simple; e.g. if you're a pure hedonistic utilitarian and you've magically solved deception, you can still fail at outer alignment by your AI optimizing for making it look like there's more happiness and less suffering.)
Some (perhaps basic) notes to check that I've understood this properly: