You could see on LW a dialogue like this:

AI Pessimist: Look, you should apply security mindset when thinking about aligning superintelligence. If you have superintelligence, all your safety measures are under enormous optimization pressure, which will seek how to crack them.

Skeptic: I agree that if we have misaligned superintelligence, it will game whatever specification we give it and kill us or worse. But why would we have misaligned superintelligence in the first place? Where does adversarial pressure come from?

And this is a legitimate question! I always feel the statement "optimization pressure causes weird things" to be intuitive, but many people don't share that attitude and it's not as crisp as I've wanted.

Let's begin with the question: what does it mean for your security assumption to be broken? It means that your security assumption happens to be false. Security assumptions are not locks or doors; you can't break them through physical effort. If all your security assumptions are true (including "this set of assumptions is sufficient for safety"), you are safe. You are only in danger if your security assumptions are false.

If so, why do we have to deal with pressure on these assumptions? Because you are running systems whose very first task is to learn what is true and what is false.

That's it. We don't need anything else to justify necessity to treat overall situation as adversarily-optimized. It's an implementation-agnostic statement. It doesn't matter whether your system is agentic or non-agentic, symbolic or connectionist - if you increase the ability of your system to figure out things as high as possible, your system is going to shoot you in the leg if your security assumptions are false.

Old models of adversarial pressure presented something like this: the system has a plan-generator and plan-evaluator. If your evaluator is not maximally robust, your generator is going to find a plan which satisfies the evaluator but is actually bad. This construction seems unnatural: why would we want to build a generator-of-all-possible-plans? Why wouldn't we want to build a generator-of-similar-to-past-plans, at least?

But we can reframe the situation in other settings. Let's take the alignment of a DL system. Our security assumptions can be, for example:

  1. Our training data contains only Good Things.
  2. Our learning algorithm can learn only Good Things.
  3. Learned Good Things are going to overcome Bad Things.
  4. If Bad Things make their way inside the system, our validation strategy will catch that.

Suppose you deleted a Bad Thing from the training data. But what if you leave in the data a weird Bad-Thing-shaped hole, which can be a sufficient hint for a sufficiently smart learning system?

At this moment, in case when your security assumption is false, you find yourself in situation when you would want your AI to be less powerful, i.e., in adversarial situation. Therefore, you really need to apply security mindset. 

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 5:03 PM

This is a very good and perfectly clear post! Great job on cleaning cutting to the heart of the matter in a way that is also a good intuition pump.