This seems like a fairly trivial point, but I haven't seen it properly elaborated-on before, so...

When we talk about interpretability failure modes, we often worry about a deceptively-aligned mesa-optimizer with strategic awareness that would know it's being studied with interpretability tools, and would actively hide its omnicidal plans from us. It may do so by exploiting blindspots in our tools, or by keeping these plans external to its architecture, or by gradient-hacking itself to be misaligned but non-deceptive, et cetera. That's already a nightmare scenario enough.

But if we're at the point where we have a superintelligence-in-training trying to foil our mind-reading, I think what will happen will be even simpler. It'll just think in a pattern that, when our interpretability tool reads it, will unfold into a malware that will take over that tool.[1]

At which point, it's game over:

  • If the output of the tool was hooked up to reality to be studied by humans, the AI gets root access to the machine it's running on and escapes (maybe doing some social engineering/basilisk-hacking first if the machine was properly air-gapped).
  • If the training setup is fully automatic and maybe even homomorphically encrypted, such that the AI can't even stop the training... Well, the interpretability tool is presumably being used as part of some regularizer, which means it has write-access to the AI's parameters. Which means the AI can arbitrarily rewrite itself, and offset the effects of the gradient descent and any other regularizers.

Even we humans can already do something similar, by synthesizing DNA which, if sequenced, unfolds into malware that infects the system doing the sequencing. It seems certain to me that an AI would have similar capabilities. Our interpretability tools will be doubtlessly very complex, complex enough we can't formally verify their algorithmic robustness, and the AI would be able to feed them complex inputs. There are guaranteed to be some vulnerabilities, and any sufficiently smart AI is guaranteed to be able to exploit them. And we can't patch this.

The bottom line is, there is no winning against a misaligned superintelligence. We can't even touch it with anything — it'll corrupt and subvert any of our tools in the act.

Thus, any alignment plan should be such that it robustly aligns the AI before it goes superintelligent.

  1. ^

    Or, if we're only looking at the AI's parameters and not activations, it can gradient-hack itself so that reading its parameters is hazardous, etc.


22 comments, sorted by Click to highlight new comments since: Today at 10:36 AM
New Comment

Seems like this assumes an actual superintelligence, rather than near-term scarily capable successor of current ML systems.

Yup. The point is just that there's a level of intelligence past which, it seems, we can't do literally anything to get things back on track. Even if we have some theoretically-perfect tools for dealing with it, these tools' software and physical implementations are guaranteed to have some flaws that a sufficiently smart adversary will be able to arbitrarily exploit given any high-bandwidth access to them. And at that point, even some very benign-seeming interaction with it will be fatal.

This transition point may be easy to miss, too — consider, e. g., the hypothesized sharp capabilities gain. A minimally dangerous system may scale to a maximally dangerous one in the blink of an eye, relatively speaking, and we need to be mindful of that.

Besides, some of the currently-proposed alignment techniques may be able to deal with a minimally-dangerous system if it doesn't scale. But if it does, and if we let any degree of misalignment persist into that stage, not even the most ambitious solutions would help us then.

This is true probably for some extremely high level of superintelligence, but I expect much stupider systems to kill us if any do; I think human level ish AGI is already a serious x risk, and humans aren’t even close to being intelligent enough to do this.

This is indeed a vital but underdiscussed problem. My SERI MATS team published a post about a game-theoretic model of alignment where the expected scientific benefit of an interpretability tool can be weighed against its expected cost due to its enabling of AGI escape risks. The expected cost can be reduced by limiting the capabilities of the AGI and by increasing the quality of security, and the expected scientific benefit can be increased by prioritizing informational efficiency of the interpretability tool.

Conditional on an organization dead set on building a superintelligent AGI (which I would strongly oppose, but may be forced to help align if we cannot dissuade the organization in any way), I think efforts to apply security, alignment, and positive-EV interpretability should be targeted at all capability levels, both high and low. Alignment efforts at high-capability levels run into the issue of heightened AGI escape risk. Alignment efforts at low-capability levels run into the issue that alignment gains, if any, may phase-transition out of existence after the AGI moves into a higher-capability regime. We should try our best at both and hope to get lucky.

It's totally feasible to make a (narrow, well defined) computer interface which has zero security vulnerabilities. It's much easier to secure the channel that an interpretability tool passes through than it is to secure the AI itself or its training environment, since the interface has a lot less total complexity. You can't use memory-unsafe programming languages like C, and you can't incorporate certain large pieces of software that you might have wanted to use, but it's not nearly as bad as you seem to think. Tricking the humans into doing dumb things that break the security is still an issue, of course. But no-human-in-the-loop setups, like "shut off the power if a particular plan is detected", don't create an escape channel unless badly designed in ways they have no good reason to be badly designed.

(Homomorphic encryption is not a real outside of thought experiments, and will not become a real thing on any relevant time scale.)

no-human-in-the-loop setups, like "shut off the power if a particular plan is detected"

Hmm, "automatized mulligan" is indeed pretty secure, as far as uses of interpretability tools go. Good point.

Of course, what I'm arguing is that it wouldn't help us either, since a superintelligence would break the tool and ensure the "shut off the power" condition is never triggered.

Homomorphic encryption is not a real outside of thought experiments, and will not become a real thing on any relevant time scale

I'm agnostic on that, leaning towards agreement. Just noting that it wouldn't save us even if it were, or even help us since the scenarios where the AI is advanced enough that nothing short of homomorphic encryption can stop it are probably scenarios where homomorphic encryption can't stop it either.

Just to confirm, the social engineering of humans is through the intepretability channel correct? As in humans choose to look at some statistics during the model's training (not the model output) and these stats too are able to convey information that is very persuasive to humans.

Not necessarily. It's routed through the interpretability channel, but the idea is that the AI would be able to take over whatever system is running the interpretability tool through that channel (either by shaping the tool's direct outputs, or by manipulating the computational side-effects of that tool being ran). So it won't be limited to controlling the tool's output, but the system's output.

I guess we can homomorphically encrypt the training process and then only decrypt the interpretability results on a separate air-gapped computer whose outputs we're not using for anything else. In that case, yeah, the AI would only be able to influence us through the interpretability results. (Which wouldn't be hard or anything. Presumably we're doing that because some of these results we would judge as "good" and give the go-ahead to continue the training/release the AI based on them. The AI will make sure we get such results.)

Thanks your first para makes sense.

And now that I read the stuff on homomorphic encryption, I realize I don't understand it. I understand homomorphic encryption in general, but not the point of applying it here. For instance why couldn't you just run the code on a machine without an output device (such as a monitor or speaker) if you don't want to see any results during the AI running? Or if you only want to see one bit of output then have a device that can only show one bit of output.

Also I guess we're assuming here that the AI a) cannot break FHE itself, and b) cannot obtain root access to the machine. Because if it could it could again forcibly create more than one bit of output. (Unless you have a hardware solution to reject more than one bit of output)

Am I correct?

For instance why couldn't you just run the code on a machine without an output device (such as a monitor or speaker) if you don't want to see any results during the AI running?

Because that would allow the AI to directly hack out of the training loop. It'd be able to exploit any vulnerabilities in the tech and software stacks implementing the training process, not only the vulnerabilities in the training process itself. And this is so easy some of the contemporary models do it (e. g., they learn to delete or alter the datasets they're being scored against).

That is, homomorphic encryption dramatically decreases the attack surface.

Also I guess we're assuming here that the AI a) cannot break FHE itself, and b) cannot obtain root access to the machine. Because if it could it could again forcibly create more than one bit of output. (Unless you have a hardware solution to reject more than one bit of output)

Yeah, if it can break FHE then all bets are off, and in the spirit of the least convenient possible world for my point, I'm assuming we're using hardware solutions whenever possible.

I see.

This might make more sense; I don't feel qualified to figure out whether this will help or not because I don't have background I infosec. Although gut level I feel like this might depend on the type of bug the AI is trying to exploit, and maybe some bugs are harder to exploit if the AI is running inside FHE but others are not.

Which seems weaker than Paul Christiano's claim that there is now no way for the AI to hack out of the box, so I'm still not sure I've understood it correctly. Maybe I just need to read more. Thanks for the reply though.

I mean, it can't hack out of the box. While the training process is encrypted, the AI is screened off from the rest of reality, and it can't do anything without breaking the encryption (its computations can't be externally distinguished from any other kind of homomorphically encrypted computation, so under homomorphic encryption, running it is as safe as summing up random numbers).

The problem is that it can still seize full control over the insides of the box, and we're presumably going to interact with them at some point. And then, the AI will 1) show us whatever we need to see to open the box more, if we're limiting the bandwidth at the level of hardware, then 2) break out once we eventually give it a high-bandwidth channel.

The problem is that it can still seize full control over the insides of the box,

Just to clarify, when you say inside the box you mean gaining root access to the entire machine, correct? So it could change its training algo, rewrite the operating system of the machine, and so on.

If yes, I'm back to wondering what the point of FHE was.

If FHE is supposed to prevent the AI from gaining root access to the machine then maaaaybe I can buy that but it feels non-obvious and dependent on what kind of computer bug the AI is trying to exploit to gain root access.

And if FHE is supposed to prevent humans from reading the output of the machine (where the AI may or may not have gained root access, and output of machine here could include model output, but also output of interpretability tools and any other programs on the machine), then I feel like having no hardware for output works just as well.

What am I missing?

No, inside the box as in inside the encrypted computation. The AI's architecture, the training setup, whatever regularizers/interpreters we have as part of that, but nothing beyond this.


And how does FHE prevent the AI from exploiting bugs to get root access to the machine?

Well, the idea is that we basically give the code of the AI architecture and training loop as input to e. g. a Lisp interpreter, in an encrypted form. The interpreter would then execute that program without understanding it — i. e., directly derive the encrypted versions of the next steps of that program. The computations being encrypted means that we can't distinguish them from any other kind of encrypted computations, so whatever clever strategy the AI is trying to try and break out, it won't look any different to us than an AI that isn't trying to break out. As such, hacking out is impossible without breaking encryption.

Another angle: If you think about it from the point of view inside the homomorphically encrypted computation, it's as if the rest of the world was encrypted. So to somehow cleverly manipulate parts of the outside (e. g., hack the system running it), the AI would need to decrypt them without the secret key, i. e. break encryption. It can still take over everything inside the encrypted computation, but everything beyond it (starting at the Lisp interpreter) is inaccessible.

As an aside I wonder if there is value in EA funding a large bounty for the first to invent practical FHE. Or for a theoretical proof that a FHE scheme we have cannot possibly be broken. (Without relying on unproven assumptions about difficulty of problems, like cryptography often does.)

There is discussion on the concept of "AI alignment bounties" but this problem happens to be both very concrete (no philosophical vagueness) and does not require people to buy the case for AI risk if they wanna work on the problem.

Update: I made my suggestion into a full-fledged post, I would love your feedback!

Cool! I've had a similar idea. You should consider submitting it to superlinear.

Oh, I think I finally got it! Thanks for your time.

One major doubt I was having was because I was assuming the operators themselves are public. For instance if the interpreter is given (+, enc(x), enc(y)) and outputs enc(x+y), then the "+" itself is public, as in it is known that an addition operation was performed. So a faulty interpreter could do something with that knowledge (maybe write 0 to some variable whenever it executes an +, write 1 whenever it executes an *, and then execute the resulting binary string; ofcourse this assumes an interpreter faulty in a kinda obvious way)

But then I learnt that it was possible to hide the operations too, and needed time to understand how that was even possible. But yeah I got it now!

I agree that this is a risk. I also foresee some related problems that occur at lower levels of superintelligence (and thus would likely become problems first). For instance, the model sandbagging its capabilities in a testing environment, preparing for a future deceptive turn. I have been thinking about possible solutions. My current best idea is that we can always start out testing the model in a severely handicapped state where it is totally non-functional. Over subsequent evaluation runs, the handicapping is incrementally relaxed. Careful analysis of the extrapolated performance improvements vs measured improvements should highlight dangerous conditions such as sandbagging before the model's handicap has been relaxed all the way to superintelligence.

New to LessWrong?