A putative new idea for AI control; index here.
Partially inspired by as conversation with Daniel Dewey.
People often come up with a single great idea for AI, like "complexity" or "respect", that will supposedly solve the whole control problem in one swoop. Once you've done it a few times, it's generally trivially easy to start taking these ideas apart (first step: find a bad situation with high complexity/respect and a good situation with lower complexity/respect, make the bad very bad, and challenge on that). The general responses to these kinds of idea are listed here.
However, it seems to me that rather than constructing counterexamples each time, we should have a general category and slot these ideas into them. And not only have a general category with "why this can't work" attached to it, but "these are methods that can make it work better". Seeing the things needed to make their idea better can make people understand the problems, where simple counter-arguments cannot. And, possibly, if we improve the methods, one of these simple ideas may end up being implementable.
The category I'm proposing to define is that of "crude measures". Crude measures are methods that attempt to rely on non-fully-specified features of the world to ensure that an underdefined or underpowered solution does manage to solve the problem.
To illustrate, consider the problem of building an atomic bomb. The scientists that did it had a very detailed model of how nuclear physics worked, the properties of the various elements, and what would happen under certain circumstances. They ended up producing an atomic bomb.
The politicians who started the project knew none of that. They shovelled resources, money and administrators at scientists, and got the result they wanted - the Bomb - without ever understanding what really happened. Note that the politicians were successful, but it was a success that could only have been achieved at one particular point in history. Had they done exactly the same thing twenty years before, they would not have succeeded. Similarly, Nazi Germany tried a roughly similar approach to what the US did (on a smaller scale) and it went nowhere.
So I would define "shovel resources at atomic scientists to get a nuclear weapon" as a crude measure. It works, but it only works because there are other features of the environment that are making it work. In this case, the scientists themselves. However, certain social and human features about those scientists (which politicians are good at estimating) made it likely to work - or at least more likely to work than shovelling resources at peanut-farmers to build moon rockets.
In the case of AI, advocating for complexity is similarly a crude measure. If it works, it will work because of very contingent features about the environment, the AI design, the setup of the world etc..., not because "complexity" is intrinsically a solution to the FAI problem. And though we are confident that human politicians have some good enough idea about human motivations and culture that the Manhattan project had at least some chance of working... we don't have confidence that those suggesting crude measures for AI control have a good enough idea to make their idea works.
It should be evident that "crudeness" is on a sliding scale; I'd like to reserve the term for proposed solutions to the full FAI problem that do not in any way solve the deep questions about FAI.
More or less crude
The next question is, if we have a crude measure, how can we judge its chance of success? Or, if we can't even do that, can we at least improve the chances of it working?
The main problem is, of course, that of optimising. Either optimising in the sense of maximising the measure (maximum complexity!) or of choosing the measure that is most extreme fit to the definition (maximally narrow definition of complexity!). It seems we might be able to do something about this.
Let's start by having AI create sample a large class of utility functions. Require them to be around the same expected complexity as human values. Then we use our crude measure μ - for argument's sake, let's make it something like "approval by simulated (or hypothetical) humans, on a numerical scale". This is certainly a crude measure.
We can then rank all the utility functions u, using μ to measure the value of "create M(u), a u-maximising AI, with this utility function". Then, to avoid the problems with optimisation, we could select a certain threshold value and pick any u such that E(μ|M(u)) is just above the threshold.
How to pick this threshold? Well, we might have some principled arguments ("this is about as good a future as we'd expect, and this is about as good as we expect that these simulated humans would judge it, honestly, without being hacked").
One thing we might want to do is have multiple μ, and select things that score reasonably (but not excessively) on all of them. This is related to my idea that the best Turing test is one that the computer has not been trained or optimised on. Ideally, you'd want there to be some category of utilities "be genuinely friendly" that score higher than you'd expect on many diverse human-related μ (it may be better to randomly sample rather than fitting to precise criteria).
You could see this as saying that "programming an AI to preserve human happiness is insanely dangerous, but if you find an AI programmed to satisfice human preferences, and that other AI also happens to preserve human happiness (without knowing it would be tested on this preservation), then... it might be safer".
There are a few other thoughts we might have for trying to pick a safer u:
- Properties of utilities under trade (are human-friendly functions more or less likely to be tradable with each other and with other utilities)?
- If we change the definition of "human", this should have effects that seem reasonable for the change. Or some sort of "free will" approach: if we change human preferences, we want the outcome of u to change in ways comparable with that change.
- Maybe also check whether there is a wide enough variety of future outcomes, that don't depend on the AI's choices (but on human choices - ideas from "detecting agents" may be relevant here).
- Changing the observers from hypothetical to real (or making the creation of the AI contingent, or not, on the approval), should not change the expected outcome of u much.
- Making sure that the utility u can be used to successfully model humans (therefore properly reflects the information inside humans).
- Make sure that u is stable to general noise (hence not over-optimised). Stability can be measured as changes in E(μ|M(u)), E(u|M(u)), E(v|M(u)) for generic v, and other means.
- Make sure that u is unstable to "nasty" noise (eg reversing human pain and pleasure).
- All utilities in a certain class - the human-friendly class, hopefully - should score highly under each other (E(u|M(u)) not too far off from E(u|M(v))), while the over-optimised solutions - those scoring highly under some μ - must not score high under the class of human-friendly utilities.
This is just a first stab at it. It does seem to me that we should be able to abstractly characterise the properties we want from a friendly utility function, which, combined with crude measures, might actually allow us to select one without fully defining it. Any thoughts?
And with that, the various results of my AI retreat are available to all.