In a lab experiment where it was told it was going to be shut down, Claude sometimes blackmailed fictional employees who controlled its shutdown button (again, we also tested frontier models from all the other major AI developers and they often did the same thing).
Why is it always the blackmail result that gets reported from this paper? Frontier models were also found willing to cause a fictional employee's death to avoid shutdown. It's weird to me that that's so often ignored.
Here's another justification for hyperbolic discounting, drawing on the idea that you're less psychologically connected to your future selves.
I've always seen this idea attributed to Martin Weitzman, and he cites these papers as making a similar point. Seems like an interesting case of simultaneous discovery: four papers making the same sort of point all appearing between 1996 and 1999.
What's your current view? We should aim for virtuousness instead of corrigibility?
are uploads conscious? What about AIs? Should we care about shrimp? What population ethics views should we have? What about acausal trade? What about pascal's wager? What about meaning? What about diversity?
It sounds like you're saying an AI has to get these questions right in order to count as aligned, and that's part of the reason why alignment is hard. But I expect that many people in the AI industry don't care about alignment in this sense, and instead just care about the 'follow instructions' sense of alignment.
Yeah I think the only thing that really matters is the frequency with which bills are dropped, and train stations seem like high-frequency places.
More reasons to worry about relying on constraints:
Great post! Tiny thing: is the speed prior really best understood as a prior? Surely the only way in which being slow can count against a cognitive pattern is if being slow leads to lower reward. And in that case it seems like speed is a behavioral selection pressure rather than a prior.
Yeah risk aversion can only make the AI cooperate if the AI thinks that getting paid for cooperation is more likely than successful rebellion. It seems pretty plausible to me that this will be true for moderately powerful AIs, and that we'll be able to achieve a lot with the labor of these moderately powerful AIs, e.g. enough AI safety research to pre-empt the existence of extremely powerful misaligned AIs (who likely would be more confident of successful rebellion than getting paid for cooperation).
This seems like quite a difficult task, especially since we (or some other trusted agent) need to read the outputs of these AIs to get any value from their safety research. Schemers could self-exfiltrate by manipulating us somehow.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work. That's a disanalogy with well-functioning scientific fields: scientists don't deliberately pick bad research directions and try hard to make them seem good.