How to safely use an optimizer
Summary: The post describes a method that allows us to use an untrustworthy optimizer to find satisficing outputs. Acknowledgements: Thanks to Benjamin Kolb (@benjaminko), Jobst Heitzig (@Jobst Heitzig) and Thomas Kehrenberg (@Thomas Kehrenberg) for many helpful comments. Introduction Imagine you have black-box access to a powerful but untrustworthy optimizing system, the Oracle. What do I mean by "powerful but untrustworthy"? I mean that, when you give an objective function f as input to the Oracle, it will output an element x that has an impressively low[1] value of f(x). But sadly, you don't have any guarantee that it will output the optimal element and e.g. not one that's also chosen for a different purpose (which might be dangerous for many reasons, e.g. instrumental convergence). What questions can you safely ask the Oracle? Can you use it to create utopia by asking for designs of machines, proteins, computer programs, etc.? Or are you risking the destruction of everything that we value if you dare to use such designs? As an example, the Oracle could be a learned system; in that case, the topic of this post would be finding a way to get useful work out of the Oracle despite its inner misalignment. In this post I'll describe a technique that allows us to safely use the Oracle under fairly weak assumptions. This approach can also be considered to be a way of controlling arbitrarily powerful AI systems. Edit: I've written a bit more on the motivation for this setting in a comment. One neat trick > This isn't fair, isn't fair, isn't fair! There's a limit to how many constraints you can add to a problem before it really is impossible! > > (Harry Potter and the Methods of Rationality, Chapter 56) Let O be a finite set of possible outputs of the Oracle (e.g. strings of length at most l) and f:O→R be our objective function. Let's assume we are happy with an output that satisfices; i.e. we want to find an output x such that the value of f(x) is lower than s
Corinne and I made the donation swap, thank you for offering though!