As some of you may know, I'm editing a special issue of the journal Philosophical Studies on AI safety (along with @Dan H ). I thought I'd share the first paper from the issue, which deals with some issues in AI safety theory that have been frequently discussed on LessWrong.

Here's the abstract:

A simple argument suggests that we can fruitfully model advanced AI systems using expected utility theory. According to this argument, an agent will need to act as if maximising expected utility if they’re to avoid exploitation. Insofar as we should expect advanced AI to avoid exploitation, it follows that we should expect advanced AI to act as if maximising expected utility. I spell out this argument more carefully and demonstrate that it fails, but show that the manner of its failure is instructive: in exploring the argument, we gain insight into how to model advanced AI systems.

You can find the paper here: https://link.springer.com/article/10.1007/s11098-023-02023-4

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 4:12 AM

This was great!

In the Avoidance section, I'd like to have seen discussion of the argument that if some specific observer is unable to exploit an AI, then it's useful for the observer to model the AI as maximizing utility. This argument (like other Avoidance arguments) is a little circular, because "explotation" is defined relative to some scoring function, so we're left with the question is how we learned about the lack of exploitability relative to this specific function. But there's still a substantive kernel, which is that the real AI might be some complicated, flawed process, but (arguendo) from the perspective of some even-more-flawed observer the best model might be simple. E.g. a poor chess player might not have any better model of an above-average chess player than just expecting them to make high-scoring moves.

This also makes me more interested in the author's perspective on other possible models of smarter-than-human decision-making.