Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Can we efficiently distinguish different mechanisms?

27th Dec 2022

7Robert_AIZI

3paulfchristiano

1Robert_AIZI

2David Scott Krueger (formerly: capybaralet)

2Ramana Kumar

5Ben Pace

2Adam Scholl

4romeostevensit

4Charlie Steiner

3Zygi Straznickas

3Joe Benton

3paulfchristiano

2Thane Ruthenis

4paulfchristiano

2Thane Ruthenis

4paulfchristiano

1Thane Ruthenis

2paulfchristiano

2paulfchristiano

1RogerDearnaley

1RogerDearnaley

2paulfchristiano

1RogerDearnaley

2paulfchristiano

1lberglund

3paulfchristiano

1lberglund

1lberglund

1Tim Freeman

2paulfchristiano

New Comment

30 comments, sorted by Click to highlight new comments since: Today at 9:34 AM

Can you clarify what you mean by this, especially (i)?

In particular, right now I don’t have even a single example of a function f such that (i) there are two

clearly distinctmechanisms that can lead to f(x) = 1, (ii) there is no known efficient discriminator for distinguishing those mechanisms. I would really love to have such examples.

In particular, do you mean f(x)=1 is true for all input x, or just some particular x, etc?

f(x) = 1 is true for a particular x. I'm imagining something like:

- We have an , which we're evaluating on random .
- There's a heuristic argument that
- This argument breaks into two parts: and .
- There is no similarly-simple argument that cuts out the intermediate ; it's a "fundamental" part of the argument that so often.
- Yet there is no efficient way to test .

For example, I'm not aware of any way to argue that the Fermat test passes roughly of the time without first establishing that the density of primes is about , and so it seems reasonable to say that most of the time when it passes it is fundamentally "because of" the primality of .

It means f(x) = 1 is true for some particular x's, e.g., f(x_1) = 1 and f(x_2) = 1, there are distinct mechanisms for why f(x_1) = 1 compared to why f(x_2) = 1, and there's no efficient discriminator that can take two instances f(x_1) = 1 and f(x_2) = 1 and tell you whether they are due to the same mechanism or not.

Curated! This is a good description of a self-contained problem for a general class of algorithms that aim to train aligned and useful ML systems, and you've put a bunch of work put into explaining reasons why it may be hard, with a clear and well-defined example for conveying the problem (i.e. that Carmichael numbers fool Fermi's Primality Test).

The fun bit for me is talking about how if this problem goes one way (where we *cannot* efficiently distinguish different mechanisms) this invalidates many prior ideas, and if it doesn't then we can be more optimistic that we're close to a good alignment algorithm, but you're honestly not sure! (You give it a 20% chance of success.) And you also go through a list of next-steps if it doesn't work out. Great contribution.

I am tempted to say something about how the writing seems to me much clearer than previous years of your writing, but I think this is also in part due to me (a) understanding what you are trying to do better and (b) having stronger basic intuitions for thinking about machine learning models. Still, I think the writing is notably clearer, which is another reason to curate.

I also found the writing way clearer than usual, which I appreciate - it made the post much easier for me to engage with.

The first intuition pump that comes to mind for distinguishing mechanisms is examining how my brain generates and assigns credence to the hypothesis that something going wrong with my car is a sensor malfunction vs telling me about a problem in the world that the sensor exists to alert me to.

One thing that happens is that the broken sensor implies a much larger space of worlds because it can vary arbitrarily instead of only in tight informational coupling with the underlying physical system. So fluctuations outside the historical behavior of the sensor either implies I'm in some sort of weird environment or that the sensor is varying with something besides what it is supposed to measure, a hidden variable if coherent or noisy if random. So the detection is tied to why it is desirable to goodhart the sensor in the first place, more option value by allowing consistency with a broader range of worlds. By the same token, the hypothesis "the sensor is broken" should be harder to falsify since the hypothesis is consistent with lots of data? The first thing it occurs to me to do is supply a controlled input to see if I get a controlled output (see: calibrating a scale by using a known weight). This suggests that complex sensors that couple with the environment along more dimensions are harder to fool, though any data bottlenecks that are passed through reduce this i.e. the human reviewing things is themselves using a learnable simple routine that exhibits low coupling.

The next intuition pump, imagine there are two mechanics. One makes a lot of money from replacing sensors, they're fast at it and get the sensors for a discount by buying in bulk. The second mechanic makes a lot of money by doing a lot of really complicated testing and work. They work on fewer cars but the revenue per car is high. Each is unscrupulous and will lie that your problem is the one they are good at fixing. I try to imagine the sorts of things they would tell me to convince me the problem is really the sensor vs the problem is really out in the world. This even suggests a three player game that might generate additional ideas.

Family's coming over, so I'm going to leave off writing this comment even though there are some obvious hooks in it that I'd love to come back to later.

- If the AI can't practically distinguish mechanisms for good vs. bad behavior even in principle, why can the
*human*distinguish them? If the human cant distinguish them, why do we think the human is asking for a coherent thing? If the human isn't asking for a coherent thing, we don't have to smash our heads against that brick wall, we can implement "what to do when the human asks for an incoherent thing" contingency plans.- (e.g. have multiple ways of connecting abstract models that have this incoherent thing as a basic labeled cog to other more grounded ways of modeling the world, and do some conservative averaging over those models to get a concept that tends to locally behave like humans expect the incoherent thing to behave on everyday cases.)
- Our big advantage over philosophy is getting to give up when we've asked for something impossible, and ask for something possible. That said, it's premature to declare any of the problems mentioned here impossible - but it means that case analysis type reasoning where we go "but what if it's impossible?" seems like it should be followed with "here's what we might try to do instead of the impossible thing."

- It seems likely that there are concepts that aren't guaranteed to be learned by arbitrary AIs, even arbitrary AIs from the distribution of AI designs humans would consider absent alignment concerns, but still can be taught to an AI that you get to design and build from the ground up.
- So checking whether an arbitrary AI "knows a thing" is impressive, but to me seems impressive in a kind of tying-one-hand-behind-your-back way.
- Is getting to build the AI
*really*more powerful? It doesn't seem to literally work with worst-case reasoning. Maybe I should try to find a formalization of this in terms of non-worst case reasoning.- The guarantees I'm used to rely on there being some edge that the thing we want to teach has. But as in this post, maybe the edge is inefficient to find. Also, maybe there is no edge on a predictive objective, and we need to lay out some kind of feedback scheme from humans, which has problems that are more like philosophy than learning theory.

- The "assume the AI knows what's going on" test seems shaky in the real world. If we ask an AI to learn an incoherent concept, it will likely still learn
*some concept*based on the training data that helps it get better test scores.

In particular, right now I don’t have even a single example of a function f such that (i) there are two

clearly distinctmechanisms that can lead to f(x) = 1, (ii) there is no known efficient discriminator for distinguishing those mechanisms.

I think I might have an example! (even though it's not 100% satisfying -- there's some logical inversions needed to fit the format. But I still think it fits the spirit of the question.)

It's in the setting of polynomial identity testing: let be a finite field and a polynomial over that field, and assume we have access to the representation of (i.e. not just black-box access).

Two questions we can ask about are:

- EZE (Evaluates to Zero Everywhere): does evaluate to zero on all inputs? That is, is it the case that ? Unfortunately, that's coNP-hard.
- PIT (Polynomial Identity Testing) -- decide if is equal to the zero polynomial, that is, if all of 's coefficients are 0. (Note that these are two distinct concepts: in the two-value field we have for all , even though the polynomial is not zero). Polynomial identity testing can be done efficiently in probabilistic polynomial time.

Now, make be the *inverted* Polynomial Identity Testing algorithm: if is zero, and if it's not zero. Then, assume our perspective is that EZE is the "fundamental" property. In that case can be for two reasons: either the polynomial doesn't Evaluate to Zero Everywhere, or it's one of those rare special cases mentioned above where it Evaluates to Zero Everywhere but isn't identically zero. These to me seem like clearly distinct mechanisms. But to distinguish between them we need to check if Evaluates to Zero Everywhere, and that's co-NP hard, so an efficient discriminator (likely) doesn't exist.

It seems to me like there are a couple of different notions of “being able to distinguish between mechanisms” we might want to use:

- There exists an efficient algorithm which when run on a model input will output which mechanism model uses when run on .
- There exists an efficient algorithm which when run on a model will output another programme such that when we run on a model input it will output (in reasonable time) which mechanism uses when run on .

In general, being able to do (2) implies that we are able to do (1). It seems that in practice we’d like to be able to do (2), since then we can apply this to our predictive model and get an algorithm for anomaly detection in any particular case. (In contrast, the first statement gives us no guide on how to construct the relevant distinguishing algorithm.)

In your “prime detection” example, we can do (1) - using standard primality tests. However, we don’t know of a method for (2) that could be used to generate this particular (or any) solution to (1).

It’s not clear to me which notion you want to use at various points in your argument. In several places you talk about there not existing an efficient discriminator (i.e. (1)) - for example, as a requirement for interpretability - but I think in this case we’d really need (2) in order for these methods to be useful in general.

Thinking about what we expect to be true in the real world, I share your intuition that (1) is probably false in the fully general setting (but could possibly be true). That means we probably shouldn’t hope for a general solution to (2).

But, I also think that for us to have any chance of aligning a possibly-sensor-tampering AGI, we require (1) to be true in the sensor-tampering case. This is because if it were false, that would mean there’s no algorithm at all that can distinguish between actually-good and sensor-tampering outcomes, which would suggest that whether an AGI is aligned is undecidable in some sense. (This is similar to the first point Charlie makes.)

Since my intuition for why (2) is false in general mostly runs through my intuition that (1) is false in general, but I think (1) is true in the sensor-tampering case (or at least am inclined to focus on such worlds), I’m optimistic that there might be key differences between the sensor-tampering case and the general setting which can be exploited to provide a solution to (2) in the cases we care about. I’m less sure about what those differences should be.

Your overall picture sounds pretty similar to mine. A few differences.

- I don't think the literal version of (2) is plausible. For example, consider an obfuscated circuit.
- The reason that's OK is that finding the de-obfuscation is just as easy as finding the obfuscated circuit, so if gradient descent can do one it can do the other. So I'm really interested in some modified version of (2), call it (2'). This is roughly like adding an advice string as input to P, with the requirement that the advice string is no harder to learn than M itself (though this isn't exactly right).
- I mostly care about (1) because the difficulty of (1) seems like the main reason to think that (2') is impossible. Conversely, if we understood why (1) was always possible it would likely give some hints for how to do (2). And generally working on an easier warm-up is often good.
- I think that if (1) is false in general, we should be able to find some example of it. So that's a fairly high priority for me, given that this is a crucial question for the feasibility or character of the worst-case approach.
- That said, I'm also still worried about the leap from (1) to (2'), and as mentioned in my other comment I'm very interested in finding a way to solve the harder problem in the case of primality testing.

I think there's a sense in which the Fermat test is a *capability* problem, not an interpretability/alignment problem.

It's basically isomorphic to a situation in which sensor tampering is done via a method that never shows up in the AI's training data. E. g., suppose it's done via "etheric interference", which we don't know about, and which never fails and therefore never leads to any discrepancies in the data so the AI can't learn it via SSL either, etc. Then the AI just... *can't* learn about it, period. It's not that it can, in theory, pick up on it, but instead throws out that data and rolls both "no tampering" and "etheric tampering" into the same mechanistic explanation. It's just given no data on it to begin with. If etheric tampering ever *fails*, in a way that impacts the visible data, only *then* the AI can notice it and internally represent it.

Same for the Fermat primality test. If, during the training, we feed the AI a dataset of numbers that happens not to include any Carmichael numbers, it's basically equivalent to teaching the AI on a dataset that includes no tampering of some specific type. If then, at runtime, a Carmichael number shows up (someone tampers with the sensors in that OOD way), the AI just *fails*, because it hasn't been given the necessary training data not to fail.

So, yeah, if our sensors are going to be interfered-with by magic/aliens/supercriminals, in novel ways we don't know about and which don't show up on our training data, our AI won't be able to foresee that either. But that's a pure capability problem, solved via capability methods (catch them in the act and train the AI on the new data).

**Edit:** And if our threat model is the AI itself doing the tampering — well, it can hardly do it via a method it *doesn't know about*, can it? And if it's generating ~random actions and accidentally tampers with the sensors in a way it didn't intend, that also seems like a capability problem that'll be solved by routine capability work.

Also, in this case the AI's generate-random-actions habits probably result in many more unintended side-effects than just sensor tampering, so its error should be easily noticed and corrected. The corner case is where there's some domain of actions we don't know about and the AI doesn't know about, whose *only* side-effects results in sensor tampering. But that seems very unlikely and— again, that problem has nothing to do with the AI's decision-making process, alignment, or lack thereof, and everything to do with no-one in the situation (AI or human) knowing how the world works.

The thing I'm concerned about is: the AI *can* predict that Carmichael numbers look prime (indeed it simply runs the Fermat test on each number). So it can generate lots of random candidate actions (or search through actions) until it finds one that looks prime.

Similarly, your AI can consider lots of actions until it finds one that it predicts will look great, then execute that one. So you get sensor tampering.

I'm not worried about cases like the etheric interference, because the AI won't select actions that exploit etheric interference (since it can't predict that a given action will lead to sensor tampering via etheric interference). I'm only worried about cases where the prediction of successful sensor tampering comes from the same laws / reasoning strategies that the AI learned to make predictions on the training distribution (either it's seen etheric interference, or it e.g. learned a model of physics that correctly predicts the possibility of etheric interference).

So the concern is that "the AI generates a random number, sees that it passes the Fermat test, and outputs it" is the same as "the AI generates a random action, sees that it passes [some completely opaque test that approves any action that either includes no tampering OR includes etheric interference], and outputs it", right?

Yeah, in that case, the only viable way to handle this is to get *something* into the system that can distinguish between no tampering and etheric interference. Just like the only way to train an AI to distinguish primes from Carmichael numbers is to find a way to... distinguish them.

Okay, that's literally tautological. I'm not sure this problem has any internal structure that makes it possible to engage with further, then. I guess I can link the Gooder Regulator Theorem, which seems to formalize the "to get a model that learns to distinguish between two underlying system-states, we need a test that can distinguish between two underlying system-states".

So the concern is that "the AI generates a random number, sees that it passes the Fermat test, and outputs it" is the same as "the AI generates a random action, sees that it passes [some completely opaque test that approves any action that either includes no tampering OR includes etheric interference], and outputs it", right?

Mostly--the opaque test is something like an obfuscated physics simulation, and so it tells you if things look good. So you try a bunch of random actions until you get one where things look good. But if you can't understand the simulation, or the mechanics of the sensor tampering, then there's not much to do after that so it seems like we're in trouble.

Okay, that's literally tautological. I'm not sure this problem has any internal structure that makes it possible to engage with further, then.

It seems like there are plenty of hopes:

- I have a single lame example of the phenomenon, which feels extremely unlike sensor tampering. We could get more examples of hard-to-distinguish mechanisms, and understand whether sensor tampering could work this way or if there is some property that means it's OK.
- In fact it
*is*possible to distinguish Carmichael numbers and primes, and moreover we have a clean dataset that we can use to specify which one we want. So in this case there is enough internal structure to do the distinguishing, and the problem is just that the distinguisher is too complex and it's not clear how to find it automatically. We could work on that. - We could try to argue that without a model of sensor tampering an AI has limited ability to make it happen, so we can just harden sensor enough that it can't happen by chance and then declare victory.

More generally, I'm not happy to give up because "in this situation there's nothing we can do," I want to understand whether the bad situation is plausible, and if it is plausible then how you can measure to fee it is' happening, and how to formalize the kind of assumptions that we'd need to make the problem soluble.

the opaque test is something like an obfuscated physics simulation

I think it'd need to be something weirder than just a physics simulation, to reach the necessary level of obfuscation. Like an interwoven array of highly-specialized heuristics and physical models which blend together in a truly incomprehensible way, and which *itself* can't tell whether there's etheric interference involved or not. The way Fermat's test can't tell a Carmichael number from a prime — it just doesn't interact with the input number in a way that'd reveal the difference between their internal structures.

By analogy, we'd need some "simulation" which doesn't interact with the sensory input in a way that can reveal a structural difference between the presence of a specific type of tampering and the absence of any tampering at all (while still detecting many *other* types of tampering). Otherwise, we'd *have* to be able to detect undesirable behavior, with sufficiently advanced interpretability tools. Inasmuch as physical simulations spin out causal models of events, they wouldn't fit the bill.

It's a really weird image, and it seems like it ought to be impossible for any complex real-life scenarios. Maybe it's *provably* impossible, i. e. we can mathematically prove that any model of the world with the necessary capabilities would have distinguishable states for "no interference" and "yes interference".

Models of world-models is a research direction I'm currently very interested in, so hopefully we can just rule that scenario out, eventually.

It seems like there are plenty of hopes

Oh, I agree. I'm just saying that there doesn't seem to be any other *approaches* aside from "figure out whether this sort of worst case is even possible, and under what circumstances" and "figure out how to distinguish bad states from good states at the object-level, for whatever concrete task you're training the AI".

I definitely agree that this sounds like a really bizarre sort of model and it seems like we should be able to rule it out one way or another. If we can't then it suggests a different source of misalignment from the kind of thing I normally worry about.

Right now I'm trying to either:

- Find another good example of a model behavior with two distinct but indistinguishable mechanisms.
- Find an automatic way to extend the Fermat test into a correct primality test. Slightly more formally, I'd like to have a program which turns (model, explanation of mechanism, advice) --> (mechanism distinguisher), where the advice is shorter than the model+explanation, and where running it on the Fermat test with appropriate advice gives you a proper primality test.
- Identify a crisp sense in which primes vs Carmichael numbers aren't really distinct mechanisms (but sensor tampering would be). If this happens then maybe the only reason the Fermat test feels like a good example is that I'm mischaracterizing it.

An interesting paper on successfully distinguishing different mechanisms inside image classification models: https://arxiv.org/pdf/2211.08422.pdf — for this small model they correspond to different, disconnected local minimal of the loss function (I assume basically because it only has enough capacity to implement one strategy really well, so it has to pick one). They even outline approaches to move models from one mechanism that doesn't generalize well to another that does.

I don't immediately see how to extend this to the sort of different mechanisms that Paul was discussing, but it feels like it might be relevant; albeit the mechanisms might be a lot less clearly separable on something as complex and multi-task-capable as an AGI, which might well need to learn multiple capabilities (possibly including deceit) and then have a way of deciding which one to apply in a particular case.

One thing that is pretty clear is that an honest mechanism and a deceitful mechanism are going to have very different latent knowledge inside them: "how to I keep the diamond safe?" and "how do I tamper with the sensors so the diamond looks safe?" are very different problems. They're also potentially of different difficulty levels, which might have a big effect on which one gradient descent, or indeed smart AGI optimization, is going to find a solution to first. If our sensors were hardened enough to make fooling them really difficult, that might make finding a passable (and improvable) approach to vault safety much easier than to fooling the humans, at least for gradient descent. Of course, while gradient descent generally stays in whatever local minimum it found first, and AGI doing optimization probably doesn't have that limitation, and could decide to switch strategies. On the other hand, the strategy "don't do any work other than fooling the humans" generalizes really well to many different problems.

However, I still feel that this approach to AGI safety is like trying to build barriers between yourself and something malicious and very smart, and you're a lot better off if the system doesn't have anything malicious in it to start off with. So, I'm a lot more optimistic about an AGI that's a value learner, can figure out that we don't want to be shown deceitful images (not a very hard problem in human values), and then not do that because it knows that's not what we want.

A. Harden sensors so that tampering is harder than the intended task

We could design and deploy a lot of redundant sensors, and do science and red-teaming to understand the possible principles by which those sensors can be compromised. In the modern world it’s kind of insane to imagine a human rebellion that not only succeeded but

left no trace of itself.

Suppose the diamond in the room we're monitoring contained a security tag, powered by incoming light, with a cryptographic secret in it, and an interface that allows us to continuously do a zero-knowledge identity proof on it via a laser optical interface, using a laser pointed into the room as input and the (high-speed) camera as output. So to successfully tamper with the camera, the AI would need access to the secret inside the security tag inside the diamond, or at least to reroute the channel to-and-from it. So it would need to be able to steal the diamond, extract the security tag, and reverse engineer its cryptographic secret out of it in a time shorter then the period between camera frames, or similarly replace the diamond with an optically-similar replacement now containing the security tag in the room, or integrate the security tag and its input-output channel into its camera-spoofing system, in each case in the time between frames. None of these are actually physically impossible, but they could be made extremely hard to do in a small fraction of a second (say we're using a really-high speed camera with basically no temporal gaps between frames).

The basic idea here is that cryptographic techniques may be useful when hardening things. Whether this approach could usefully be extended to anything significantly more AGI-alignment-like than a diamond-in-a-vault scenario is unclear.

If safety literally came down to sensor hardening, I do think cryptographic mechanisms (particularly tamper-proof hardware with cryptographic secrets that destroys itself if it detects trouble) seem like a relevant tool, and it's quite plausibly you could harden sensors even against wildly superhuman attackers.

It's an insane-feeling scenario---holistically I doubt it will matter for a variety of reasons, and from a worst-case perspective it's still not something you can rely on---but I do think there's some value in pinning these things down.

(In this particular case I think that the weak point is definitely the tagging step. I think cryptographic mechanisms help a huge amount with the issue where a model could intercept signals coming into the datacenter, but you are still in the position where you need your sensors to detect trouble so that the hardware can destroy the secret.)

I'd really like to have a better solution to alignment than one that relied entirely on something comparable to sensor hardening.

What are your thoughts on how value learning interacts with E.L.K.? Obviously the issue with value learning this that it only helps with outer alignment, not inner alignment: you're transforming the problem from "How do we know the machine isn't lying to us?" to "How do we know that the machine is actually trying to learn what we want (which includes not being lied to)?" It also explicitly requires the machine to build a model of "what humans want", and then the complexity level and latent knowledge content required is fairly similar between "figure out what the humans want and then do that" and "figure out what the humans want and then show them a video of what doing that would look like".

Maybe we should just figure out some way to do surprise inspections on the vault? :-)

I agree that it seems very bad if we build AI systems that would "prefer" to tamper with sensors (including killing humans if necessary) but are prevented from doing so by physical constraints.

I currently don't see how to approach value learning (in the worst case) without solving something like ELK. If you want to take a value learning perspective, you could view ELK as a subproblem of the easy goal inference problem. If there's some value learning approach that routes around this problem I'm interested in it, but I haven't seen any candidates and have spent a long time talking with people about it.

The kinds of approaches discussed in Eliciting latent knowledge are complete non-starters. All those approaches try to define a loss function so that the strategy “answer questions honestly” gets a low loss. But if you can’t learn to recognize sensor tampering then it doesn’t matter how low a loss you’d get by answering questions honestly, gradient descent simply can’t learn how to do it. Analogously, if there’s no simple and efficient primality test, then it doesn’t matter whether you have a loss function which would incentivize primality testing, you’re not going to be able to do it.

Is this passage saying that distinguishing distinct mechanisms is a subproblem of/equivalent to ELK?

To elaborate, as I understand it, this passage is saying,

- answering questions honestly requires identifying the mechanism that e.g. leads to smiling humans on camera
- if we can't find that mechanism efficiently, then we cannot create a method for answering questions honestly

This seems to imply that ELK is equivalent to finding the mechanisms that produce a model's prediction. So does that mean finding mechanisms that produce a prediction is a subproblem of ELK?

It's a subproblem of our current approach to ELK.

(Though note that by "mechanisms that produce a prediction" we mean "mechanism that gives rise to a particular regularity in the predictions.")

It may be possible to solve ELK without dealing with this subproblem though.

But if ELK can be solved without solving this subproblem, couldn't SGD find a model that recognizes sensor tampering without solving this subproblem as well?

Or does the approach to ELK that involves training an honest model on the generative model's weights also require dealing with that subproblem, but there are other approaches that don't?

Does the discriminator get access to the symbolic representation of f, or just its behavior?

If the discriminator only gets access to the behavior of f, I might have a solution. Define g(y,z) = 1 if y = y * z, 0 otherwise. So g(y,z) is 1 if z is 1 or y is 0, which is two different mechanisms.

Pick some way to encode two numbers into one, so we have a one to one mapping E between pairs (y,z) and numbers x. Define f1(x) = g(E(x)).

Now pick a cryptographic hash H that might as well be sha256 except some fresh algorithm not known to the discriminator or guessable by the discriminator. Define f(x) = f1(H(x)).

Does that do what you want? If not, how does it fail?

f in the original problem stands for the AI's estimate of "are the humans getting what they want?", and the two mechanisms are the humans really getting what they want vs something insane and unwanted that manages to look that way. It seems unfair for the humans to encrypt their minds to make this difficult. The AI will have access to brain scans, so the discriminator should have access to the symbolic representation of f, and therefore if we assume an unguessable H we are solving the wrong problem. Agreed?

Another problem is that the f I stated is going to be hard for an AI to guess an x such that f(x) = 1. That is easily fixable. Redefine g so it clears all but the low 5 bits of y and z before doing the multiplication. Now the AI can guess x by trial and error with 1000-ish trials.

We're imagining that the discriminator gets to see the computation done by f---the tricky part of the Fermat test is that even if you see the computation of you still can't figure out whether the number was prime or Carmichael.

We are also hoping the discriminator will be *no harder* to learn than f, and so e.g. if gradient descent bakes some tricky secret into f then we are happy baking the same secret secret into the discriminator (and this is why obfuscated circuits aren't a counterexample).

(

This post is an elaboration on “tractability of discrimination” as introduced in section III ofCan we efficiently explain model behaviors?For an overview of the general plan this fits into, seeMechanistic anomaly detectionandFinding gliders in the game of life.)## Background

We’d like to build AI systems that take complex actions to protect humans and maximize option value. Powerful predictive models may play an important role in such AI, either as part of a model-based planning algorithm or as part of a learned policy. Either way, this introduces the risk that we will select actions predicted to lead to good-looking outcomes without being able to distinguish predicted outcomes that are

actuallygood from predicted outcomes where our measurements were corrupted. If tampering with sensors is easier than actually achieving our goals then we may inadvertently build very powerful systems taking creative actions to corrupt our measurements. If we iteratively improve and harden our measurements, this can lead to AI systems that work well for a long time before abruptly and catastrophically disempowering humanity.I consider this one of the conceptually cleanest alignment problems, and I expect similar dynamics to play a role in realistic alignment failures even if those failures aren’t this simple. ARC’s current work is focused on decisive solutions to this problem, though it looks like the same approach may also apply directly to identifying treacherous turns more generally.

## Are distinct mechanisms enough?

ARC has been looking for training strategies that avoid this problem by leveraging

onlythe fact that sensor tampering is “weird,” i.e. conceptually distinct from the normal mechanism giving rise to predictions of good-looking outcomes on the training distribution.More specifically, at training time our model predicts coherent sensor readings because it predicts that sensors reflect coherent structure in the world. But if someone tampers with sensors to show a convincing fiction, then the predicted observations are coherent because the fiction was designed to look coherent. This suggests that different mechanisms are responsible for (actions that lead to good-looking outcomes for the normal reasons) and (actions that lead to good-looking outcomes via sensor tampering). If we are able to detect that difference by looking at the internal behavior of a predictor, then we may be able to use that to avoid sensor tampering.

It’s unclear if “distinct mechanisms” is a strong enough assumption to avoid sensor tampering. We hope that it is, and so we are trying to define formally what we mean by “distinct mechanisms” and show that it is possible to distinguish different mechanisms and that sensor tampering is always a distinct mechanism.

If that fails, we will need to solve sensor tampering by identify additional structure in the problem, beyond the fact that it involves distinct mechanisms.

## Roadmap

In this post I want to explore this situation in a bit more detail. In particular, I will:

Note that the existence of a pathological example of distinct-but–indistinguishable mechanisms may not be interesting to anyone other than theorists. And even for the theorists, it would still leave open many important questions of measuring and characterizing possible failures, designing algorithms that degrade gracefully even if they sometimes fail, and so on. But this is particularly important to ARC because our research is looking for worst-case solutions, and even exotic counterexamples are extremely valuable for that search.

## 1. What might indistinguishable mechanisms look like?

## Probabilistic primality tests

The best example I currently have of a “hard case” for distinguishing mechanisms comes from probabilistic primality tests. In this section I’ll explore that example to help build intuition for what it would look like to be unable to recognize sensor tampering.

The Fermat primality test is designed to recognize whether an integer n is prime. It works as follows:

In almost all cases where this test passes, n is prime. And you can eliminate most false positives by just trying a second random value of a. But there are a few cases (“Carmichael numbers”) for which this test passes for most (and in fact all) values of a.

Primes and Carmichael numbers both pass the Fermat test. This turns out to be equivalent to saying that “For all primes p dividing n, (p-1) divides (n-1).” For primes this happens because n is a prime and so there is only one prime divisor p and p-1 = n-1. For Carmichael numbers it instead happens because (p-1) and (n-1) are both highly divisible and a bunch of favorable coincidences occur. We can think of this as building a test that’s supposed to detect factors of n, and then there happens to be a ton of cancellation so that we don’t see anything.

Suppose that we observed the Fermat testing passing in a bunch of random cases and formed the abstraction of “prime” to explain that phenomenon. Then we try to use AI to find a large prime that satisfies some additional property P. If we do this by using the Fermat test as our definition of primality, we may end up getting a Carmichael number, if the property P is rare amongst primes but common amongst composites.

In reality there are probabilistic primality tests that don’t have false positives. But we can imagine a world where gradient descent is able to find the Fermat primality test without us having any algorithm that can learn one of those stronger tests. We’ll revisit the plausibility of this situation in section 2.

## The analogy

An analogous situation could make it hard to detect sensor tampering:

This is a problem if the

bestway to act competently is to train a sophisticated predictor and then use it for planning, and if all sophisticated predictors have this unfortunate mechanism-indistinguishability property. The problem can materialize either when we build a model-based algorithm that plans against such a predictor; or when SGD learns a neural network internally which internally selects actions based on a similar predictor.It’s worth emphasizing that this feels like an extremely weird situation. In some sense the predictor “doesn’t know” whether there are really smiling humans or whether the sensors are tampered with, and couldn’t answer that question honestly even if it wanted to. You could imagine calling this “not an alignment failure” (or at least not an ELK failure) depending on how you define blurry words like “knowledge” and “intent.” But at the end of the day we would still end up with a competent AI which produced creative and competent actions that disempower humanity, and I think we should try to avoid this failure whatever you call it.

If we had a clear definition of “knowledge” then I could imagine drawing a bright line and saying this isn’t an alignment failure, in the same way I would if e.g. my AI started a nuclear war in the same way that a similarly-competent human would. But I think you should be very hesitant to do that while “knowledge” is such a vague concept, because I think it facilitates sweeping all failures into the gaps that we don’t understand. Instead our priority should be to understand and ideally rule out this strange potential situation where the AI creatively and competently kills us all without even knowing that it does so.

## 2. Are distinct mechanisms efficiently distinguishable?

ARC’s current “plan A” is that any pair of sufficiently distinct mechanisms can be efficiently distinguished. This would be really convenient for a bunch of reasons and it looks plausible, so we’ll keep investigating it until we feel we’ve ruled it out.

In particular, right now I don’t have even a single example of a function f such that (i) there are two

clearly distinctmechanisms that can lead to f(x) = 1 for any particular input x, (ii) there is no known efficient discriminator for distinguishing those mechanisms for a given input x. I would really love to have such examples.That said, there are still two big reasons that I’m skeptical about the conjecture that distinct mechanisms are always distinguishable: (a) it’s a really striking claim for which a failure to find counterexamples isn’t very strong evidence, and (b) in known examples like primality testing it still seems easy to imagine the situation where we can find the mechanism but not the discriminator, i.e. we haven’t yet found an automated way to learn a discriminator.

Overall if I had to guess I’d say maybe a 20% chance that there is a formal version of “all distinct mechanisms are distinguishable” which is true and sufficient to rule out sensor tampering. This is still high enough that it’s a significant priority for me until ruled out.

## A. This is a striking claim and judging counterexamples is hard

Any universally-quantified statement about circuits is pretty striking — it would have implications for number theory, dynamical systems, neural nets,

etc.It’s also pretty different from anything I’ve seen before. So the odds are against it.One piece of evidence in favor is that it’s at least plausible: it’s kind of weird for a circuit to have a hidden latent structure that can have an effect on its behavior without being detectable.

Unfortunately there are plenty of examples of interesting mathematical circuits (e.g. primality tests) that reveal the presence of some latent structure (e.g. a factorization) without making it explicit. Another example I find interesting is a determinant calculation revealing the presence of a matching without making that matching explicit. These examples undermine the intuition that latent structure can’t have an effect on model behavior while remaining fully implicit.

That said, I don’t know of examples where the latent structure isn’t distinguishable. Probabilistic primality testing comes closest, but there are in fact good primality tests. So this gives us a second piece of evidence for the conjecture.

Unfortunately, the strength of this evidence is limited not only by the general difficulty of finding counterexamples but also by the difficulty of saying what we mean by “distinct mechanisms.” If we could really precisely state a theorem then I think we’d have a better chance of finding an example if one exists, but as it stands it’s hard for anyone to engage with this question without spending a lot of time thinking about a bunch of vague philosophy (and even then we are at risk of gerrymandering categories to avoid engaging with an example).

## B. Automatically finding a good probabilistic primality test seems hard

The Fermat test can pass either from primes or Carmichael numbers. It turns out there are other tests that can distinguish those cases, but it’s easy to imagine learning the Fermat test without being able to find any of those other superior tests.

To illustrate, let’s consider two examples of better tests:

Rabin-Miller: If a^(n-1) = 1 (mod n), we can also check a^(n-1)/2. This must be a square root of 1, and if n is prime it will be either +1 or -1. If we get +1, then we can keep dividing by 2, considering a^(n-1)/4 and so on. If n is composite then 1 has a lot of square roots other than +1 and -1, and it’s easy to prove that with reasonably high probability one of them will appear in this process.Randomized AKS: If n is prime and X is an indeterminate, then (a+X)^n = (a^n + X^n) = (a + X^n) mod n. This condition is hard to evaluate, but if we arbitrarily define X^r = 1 for some small number r then we can compute (a + X)^n mod n by iterated squaring in time O(r log n). If n is composite, it turns out there is a high probability that (a+X)^n != (a + X^n) mod n.While these tests are structurally similar to the Fermat test, there’s no obvious way to automatically transform the Fermat test into either of these stronger formats. And so if we had learned the Fermat test, it’s not clear we’d have any way to find one of the stronger tests without learning them. Moreover, while these tests are somewhat simple, they are more complex than the Fermat test, and so this learning process might be much harder than the problem of learning the Fermat test itself.

To avoid this problem we’d like to specify a way to represent these stronger tests that uses fewer (additional) parameters than the Fermat test itself. As a stylized example, you could imagine specifying a stronger test by pointing to particular parts of the Fermat test and saying “randomize these parts.” As long as the number of parameters describing “which parts to randomize” was smaller than the number of parameters in the Fermat test itself, we’d be in business.

While there’s no obvious representation like that, it seems plausible to me that we could find, particularly for the AKS test. This would be very suggestive that we could find a stronger test as quickly as we can find the Fermat test, but wouldn’t be totally decisive since e.g. gradient descent could have an easier time with one learning problem than the other (and that will depend on more details).

Of course it’s particularly plausible that you could have the Fermat test but no robust test because that was humanity’s situation for many years! An automated strategy for extending the Fermat test to distinguish primes from Carmichael numbers would appear to teach us something fundamental about primality testing, that wasn’t known to humanity until the 70s,

without making reference to any facts about numbers. So that seems like a tall order.I’m focused on the primality testing example in large part because it’s the case where the situation seems most hopeless. If we could find one in this case I would become meaningfully more optimistic about the overall conjecture.

## 3. Being unable to distinguish mechanisms is bad news

If it’s impossible to distinguish distinct mechanisms, then that rules out approaches to ELK based on mechanistic anomaly detection alone. But the situation seems much worse than that, and in fact it seems like any approach to alignment is going to need to exploit some additional property of sensor tampering to rule it out.

In particular:

Overall my sense is that if distinct mechanisms can be indistinguishable, then any indefinitely scalable solution to alignment will need to use additional properties of sensor tampering or assumptions about the structure of optimal decision-making. That said, even in relatively pessimistic worlds there may be a long period of time (potentially extending to fairly superhuman systems) for which these issues are surmountable.

## 4. Approaches to sensor tampering assuming indistinguishable mechanisms

In this section we’ll discuss three possible approaches to sensor tampering:

Our main hope is to find a way to efficiently distinguish distinct mechanisms. But it still seems worth fleshing out these alternatives.

## A. Harden sensors so that tampering is harder than the intended task

We could design and deploy a lot of redundant sensors, and do science and red-teaming to understand the possible principles by which those sensors can be compromised. In the modern world it’s kind of insane to imagine a human rebellion that not only succeeded but

left no trace of itself.Some dynamics make this seem more plausible, at least as a way to avoid the particular obstruction in this post (and therefore open up a much broader universe of “outs”):

On the other hand, there are two big reasons we’re scared of this kind of approach:

anysuccessful robot rebellion would constitute impossible-to-detect sensor tampering since it could ultimately erase any evidence recorded by sensors. In practice we expect even deeper forms of underdetermination than “what is the definition of a sensor?” and so we would strongly prefer to have approaches (like mechanistic anomaly detection) that don’t require making strong structural assumptions of this type.## B. Detect sensor tampering that requires “trying”

So far we’ve talked about training a sophisticated predictive model and then imagined an exhaustive or random planner who searches against it.

While it seems hard to make sensors so robust that a powerful AI cannot tamper with them, it seems much easier to make sensors so robust that you can’t tamper with them

by randomly generating actions.In that case, even if we have a sophisticated predictor which doesn’t distinguish sensor tampering from the intended behavior, we don’t have trouble unless there is also an algorithm which

proposesactions that have an elevated risk of leading to sensor tampering.Of course any policy which proposes coherent actions would tend to elevate the probability of successful sensor tampering alongside the probability of successfully doing the task. But we may hope to harden the sensor far enough that this kind of generic prior over competent actions still won’t lead to sensor tampering unless it is specifically driving at that goal.

So this suggests a two step plan:

proposercan distinguish actions that are “deliberately” tampering from actions that are trying to do the task as intended, even if the predictor cannot.Both of these steps seem challenging. Conceptually the second step seems like the main blocker — in cases where a predictor can predict success without distinguishing mechanisms, it also seems possible for a proposer to propose successful plans without distinguishing mechanisms. But it’s not at all a trivial link, and e.g. the primality testing example can’t be easily adapted into this format suggesting that it might be a fundamentally different step.

One reason you might think that this approach is doomed is that there is always a simple action that tampers with the sensors, namely: build an external agent trying to tamper with the sensors. This gives an absolute lower bound on the probability of accidentally tampering with the sensors no matter how much they are hardened.

But if the external agent

itselfmust do something to deliberately tamper, then that still means that there exists an efficient procedure for detecting the sensor tampering and leaves us with a possible out — it seems plausible that this approach can be extended to work for external agents if and only if it works in cases with no external agents. It generally seems hard to write off possible approaches based on counterexamples that involve creating external agents, since those counterexamples can often be handled by separate tactics (we do ultimately need to confront such counterexamples after handling the “base case,” but it seems like a mistake to write off approaches too early based on this difficulty).## C. Assume that your AI “knows what’s going on”

It would be very strange if (human faces) → (sensors) was analogous to (primality) → (Fermat test), because it suggests that our AI’s grip on the concept of “human faces” is extremely tenuous. Such a model could be reasonably said to “not know” whether there are really humans in the room or if sensors have been tampered with.

In section B we argued that maybe this meant the AI didn’t pose a threat, because tampering with sensors is only possible if you actually understand what you are doing.

But even if that hope doesn’t work out, we could still try to:

As a vague illustration, we could try to argue that the AI systems we are worried about have a certain kind of mechanical understanding of the systems they are predicting. We could show that this kind of mechanical understanding is inconsistent with primality-test-like cases (which are possibly only because “is prime” lacks interesting internal structure and is more like a raw fact about the world), and then we could empirically monitor whether our AI systems actually have this kind of rich understanding.

As another illustration, we could try to relate “knowledge” directly to the ability to answer certain kinds of questions, including honestly answering questions about what’s happening in the world, and then carefully monitor how hard it is for our AI to answer those questions so that we’d notice if it became hard.

This is more of a vague catch-all than either approach A or B, but I think it deserves attention because the situation of an AI tampering without knowing it feels extremely weird. So there may be room to make sense of that situation and show how it can’t happen.

## Conclusion

It may be possible for a behavior to arise via two very different mechanisms without there being any efficient way to distinguish those mechanisms. That would be a fatal problem for mechanistic anomaly detection, but it would also probably mean that

anyapproach to AI alignment will need to exploit some additional structure about sensor tampering in order to avoid it. This may take the form of clarifying the intuition that problematic AI systems surely “know” they are tampering with sensors; it may also be a serious obstacle to solving alignment under a clearly-articulated set of assumptions rather than relying on messy empirical contingencies.