A putative new idea for AI control; index here.

In a previous post, I talked about using a WBE to define a safe output for a reduced impact AI.

I've realised that the WBE isn't needed. Its only role was to ensure that the AI's output could have been credibly produced by something other than the AI - "I'm sorry, Dave. I'm afraid I can't do that." is unlikely to be the output of a random letter generator.

But a whole WBE is not needed. If the output is short, a chatbot with access to a huge corpus of human responses could function well. We can specialise it in the direction we need - if we are asking for financial advice, we can mandate a specialised vocabulary or train it on financial news sources.

So instead of training the reduced impact AI to behave as the 'best human advisor', we are are training it to behave as the 'luckiest chatbot'. This allows to calculate odds with greater precision, and has the advantage of no needing to wait for a WBE.

For some questions, we can do even better. Suppose we have a thousand different stocks, and are asking which one would increase in value the most during the coming year. The 'chatbot' here is simply an algorithm that picks a stock at random. So we now have an exact base rate - 1/1000 - and predetermined answers from the AI.

[EDIT:] Another alternative is to get online users to submit answers to the question. Then the AI selects the best answer from the choices. And if the AI is not turned on, a random answer is selected.

New Comment
9 comments, sorted by Click to highlight new comments since:

As I understand it, you're trying to prevent the AI from behaving in a non-humanlike way by constraining its output. This seems to me to be a reasonable option to explore.

I agree that generating a finite set of humanlike answers (with a chatbot or otherwise) might be a sensible way to do this. An AI could perform gradient descent over the solution space then pick the nearest proposed behaviour (it could work like relaxation in integer programming).

The multiple choice AI (with human-suggested options) is the most obvious option for avoiding unhumanlike behaviour. Paul has said in some medium comments that he thinks his more elaborate approach of combining mimicry and optimisation [1] would work better though. https://medium.com/ai-control/mimicry-maximization-and-meeting-halfway-c149dd23fc17

Thanks for linking me to that!

But a chatbot is very unlikely to produce coherent strings of text, let alone something that has value. So your AI system will be too constrained, and produce terrible output. In order to make it probable that a chatbot could have produced that output.

Or it will produce something that is incredibly improbable, and therefore definitely produced by an AI. And probably dangerous, since you've allowed a lot of optimization.

Why not just use humans? You don't need to whole brain emulation. Just put a human in a room for 3 weeks and have them think about a problem. At the end of 3 weeks, they write down the answer on a sheet of paper. The AI also writes down it's answer. A different AI, reads both, and tries to guess which one was produced by the AI and the human.

The first AI optimizes for being able to fool the judge and have it's paper be indistinguishable from human. But it does it conditionally on the human solving the problem. If the human doesn't solve the problem, the AI wins automatically. So the AI needs to come up with a correct solution, but also one that looks like a human produced it.

You may not even need to use real humans or have a real AI judge at all. Just assume this is all true as a counterfactual. Alternatively, you may need to do this a lot. To get training data, and show the AI the test is real.

The problem is that the human will know their answer, and could communicate it later if they're let out of the box. Maybe we could get online users to submit answers to the question. Then the AI selects the best answer from the choices. And if the AI is not turned on, a random answer is selected.

In my version, the AI only cares about what the AI judge thinks of it's answer. It doesn't care about anything that happens afterwards. However, it wants to look as human as possible. So it can't put infohazards into it's answer.

Interesting. I'll think on that.


I am not sure of myself, here, but I would expect a malicious AI to do the following. The first few (or many) times you run it, tell you the optimal stock. Then once in a while give a non-optimal stock. You would be unable to determine whether the AI was simply not turned on those times, or was not quite intelligent/resourceful enough to find the right stock. It may be that you would want the profits to continue.

By allowing itself to give you non-optimal stocks (but still making you rich), the AI can transmit information, such as its location, to anyone who would be looking at your pattern of buying stocks. And people would look at it, since you would be consistently buying the most profitable stock, with few exceptions. Once the location of the AI is known, you are in trouble, and someone less scrupulous than you may get their hand on the AI. Humans are dead in a fortnight.

Admittedly, this is a somewhat far-fetched scenario, but I believe that it indicates that you should not ask the AI more than one (or a few) questions before permanently destroying it. Even deleting all of its data and running the code again from scratch may be dangerous if the AI is able to determine how many times it has been launched in the past.

The AIs motivations can be precisely controlled. In fact such an AI can be limited to purely prediction. It would have no agency. No motivations or goals whatsoever. It just tries to predict what the price of each stock will be the next day.

For such a task, the AI doesn't need any of the reduced impact stuff described here. That stuff becomes relevant in more complicated domains, like controlling a robot body in the real world. Say, to do something simple like collect paperclips that have fallen on the floor.

In such a domain you might want to limit it to just predicting what a human would do if they were controlling the robot. Not find the absolute optimal sequence of actions. Which might involve running away, building more robots, and taking over the world. Then building as many paperclip factories as possible.

AIXI is controllable in this way. Or at least the Solomonoff induction part, which just predicts the future. You could just use it to see what the future will be. The dangerous optimization only comes in later. When you put another program on top of it that searches for the optimal sequence of actions to get a certain outcome. An outcome we might not want.

As far as I can tell, all the proposals for AI control require the ability to use the AI like this. As an optimizer or predictor for an arbitrary goal. Which we can control, if only in a restricted sense. If the AI is fundamentally malicious and uncontrollable, there is no way to get useful work out of it. Let alone use it to build FAI.

We can use various indifference methods to make the AI not care about these other answers.