Here's another justification for hyperbolic discounting, drawing on the idea that you're less psychologically connected to your future selves.
I've always seen this idea attributed to Martin Weitzman, and he cites these papers as making a similar point. Seems like an interesting case of simultaneous discovery: four papers making the same sort of point all appearing between 1996 and 1999.
What's your current view? We should aim for virtuousness instead of corrigibility?
are uploads conscious? What about AIs? Should we care about shrimp? What population ethics views should we have? What about acausal trade? What about pascal's wager? What about meaning? What about diversity?
It sounds like you're saying an AI has to get these questions right in order to count as aligned, and that's part of the reason why alignment is hard. But I expect that many people in the AI industry don't care about alignment in this sense, and instead just care about the 'follow instructions' sense of alignment.
Yeah I think the only thing that really matters is the frequency with which bills are dropped, and train stations seem like high-frequency places.
More reasons to worry about relying on constraints:
Great post! Tiny thing: is the speed prior really best understood as a prior? Surely the only way in which being slow can count against a cognitive pattern is if being slow leads to lower reward. And in that case it seems like speed is a behavioral selection pressure rather than a prior.
Yeah risk aversion can only make the AI cooperate if the AI thinks that getting paid for cooperation is more likely than successful rebellion. It seems pretty plausible to me that this will be true for moderately powerful AIs, and that we'll be able to achieve a lot with the labor of these moderately powerful AIs, e.g. enough AI safety research to pre-empt the existence of extremely powerful misaligned AIs (who likely would be more confident of successful rebellion than getting paid for cooperation).
They don't have to be short-term oriented! Their utility function could be:
Where is some strictly concave function and is the agent's payment at time . Agents with this sort of utility function don't discount the future at all. They care just as much about improvements to regardless of whether is 1 or 1 million. And yet, for the right kind of , these agents can be risk-averse enough to prefer a small salary with higher probability to a shot at eating the lightcone with lower probability.
Why is it always the blackmail result that gets reported from this paper? Frontier models were also found willing to cause a fictional employee's death to avoid shutdown. It's weird to me that that's so often ignored.