Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(This post was written as part of my work at the Alignment Research Center.)

In this post I’ll describe some possible approaches to eliciting latent knowledge (ELK) not discussed in our report. These are basically restatements of proposals by Davidad, Rohin, Ramana, and John Maxwell. For each approach, I’ll present one or two counterexamples that I think would break it.

I think of these approaches as being based on two fundamental hopes:

  1. If our AI tampers with one sensor, there may be other sensors we could deploy, or experiments we could run, in order to notice that something fishy has happened. We could have an AI help us design those experiments/sensors.
  2. Actions that covertly tamper may be similar to actions that visibly tamper, e.g. they may exploit similar unknown facts about physics or they may involve common substeps. So we might be able to notice covert tampering by pointing out similar action sequences that lead to visible tampering.

While these hopes may hold in practice, in this post I’ll describe possible situations where they would fail. (I don't think anyone particularly expected these approaches to be counterexample-proof.)

I’m only claiming that these failures are possible. I think it is also interesting to try to generate the most probable failure cases and to actually estimate their probability, but that’s definitely not what I’m doing here. 

Strategy: Hold out sensors

(Proposal #1 here)

In addition to asking my model to predict one set of sensors, I can deploy a second “held out” set (e.g. more cameras or seismographs or microphones…). I don’t ask my model to predict those sensors, but I show them to a human before getting them to answer questions. Hopefully that sometimes lets them notice tampering that would have been invisible without the new sensors.

Worry: human simulator can use all the extra sensor readings too

I’m concerned that I will learn a better version of the human simulator which predicts the readings of all sensors and then outputs what a human would infer from the complete set. That task need not be much more computationally demanding than the original human simulator. 

Moreover, my AI may be able to deduce the presence of the additional sensors very cheaply. Perhaps it can notice the sensors, or it can learn about my past actions to get a hint about where I may have placed them. If this is possible, then “Predict the readings of all sensors” need not be much more complex than “Predict the readings of one sensor.”

An AI doesn’t have to be exactly correct about what sensors exist—when in doubt it can just predict what the sensor would say and imagine giving it to the human.

Strategy: Predict hypothetical sensors

(Proposal #2 here, also suggested with counterexample by Rohin in private communication)

Instead of installing a single sensor, I could randomly choose what sensors to deploy and how to install them. Then I could train a model Predict+ to map (description of sensor) → (sensor reading), and I could give a human the readings from many hypothetical sensors in order to give them the best chance of detecting tampering.

Worry 1: Predict+ may not be reasonable on non-existent sensors

To collect training data for Predict+, I actually install sensors and then use their readings as a target. In every situation, my AI is being asked to predict the readings from a sensor that I really did build.

If my AI is able to deduce the presence of sensors (as described in the last counterexample), then it’s unclear how Predict+(sensor description) will behave when we give it a description of a sensor that does not in fact exist. 

In particular, Predict+(sensor description) might behave in the following undesirable way: it predicts the future observations conditioned on its observations and on the fact that someone typed (sensor description) into the dataset. But it knows that no sensor with description (sensor description) exists. So if someone typed that description into the dataset, that just means something funny is going on—not that the dataset will magically contain readings from a non-existent sensor.

For a predictor which understands the world well it seems quite plausible that this is the most natural behavior. Otherwise it needs to reason about a physically impossible counterfactual to get the “intended” answer (which it never needed to do to answer any other questions). Moreover, it would also be undesirable if the predictor conditioned on such a sensor existing despite having strong evidence to the contrary (since its predictions may then be extremely strange).

(A more extreme version of the same worry applies to proposal #3 here. A predictor seems even more likely to behave badly when it is given a description of a sensor that the human obviously couldn’t have built, and it would definitely know that the sensor doesn’t actually exist.)

Worry 2: the human may not be able to tell what is going on even with all the sensors

For some kinds of tampering no possible sensors would reveal what had happened. For example, we can imagine that our AI has covertly built an army of tiny machines which deconstruct the objects in the room while replacing them with fake versions that look the same on the outside. Sufficiently sophisticated machinery could reproduce the results of almost any experiment we perform (e.g. if you broke an egg, the tiny machines could simulate the sound of an egg cracking and quickly create the yolk on demand).

More realistically, our AI could search for sensors and tamper with any that it finds. It’s not clear how it is supposed to answer questions about what a sensor “would have shown”—if the hypothetical sensor had actually existed, then it also would have been found and tampered with. In this scenario, it may be “correct” for Predict+ to report the tampered value.

(As usual, this is only a problem if the human overseer literally can’t notice anything suspicious about a sequence of actions which is in fact constructing an army of sophisticated tiny machines. We are putting the burden of proof on the builder to show that a given concern is impossible.)

Strategy: conduct new experiments to notice that something is wrong

(Part of the proposals here and here.)

In the previous section I described installing sensors before the AI acted so that we’d have a better chance of seeing what was going on in real time. But we could also imagine installing sensors or conducting experiments after arriving at a given state, in order to better understand that state.

For example, suppose I believe that the diamond is in the room and all sensors are functioning correctly. I could ask another AI to design an experiment which would show that I’m fundamentally misunderstanding the situation and the sensors aren’t functioning correctly (e.g. perhaps it would walk me through building an x-ray machine to show me that the room is actually empty).

Worry 1: there is no experiment that reveals the problem

Ultimately to implement an experiment we will execute some actions in the world and observe the results. These experiments won’t be successful if either:

  • The sensors used to record the results of an experiment are also tampered with, and there is no sequence of actions that “fix” those sensors. (As in the “delusion box” introduced here and discussed as a counterexample for an ELK-like problem here.)
  • The actuators for carrying out an experiment have been tampered with. Perhaps our AI “proposes an experiment,” but actually the human sees something totally different (or is incapacitated).

You could hope to avoid this problem by using the same sensors and actuators your AI uses to interact with the world; then any form of tampering which undermines those sensors/actuators seems like it must also undermine the AI’s ability to act in the world.

But it seems like we could still have a problem:

  • Your AI may not continue acting using the same actuators or sensors, or at least not only the same actuators, with which it was originally constructed. It may have constructed tools in the environment, other autonomous processes, or other copies of itself.
  • Even if your AI is continuing to use the same sensors, it could tamper with sensors in a way that preserves their usefulness while forcing them to look innocuous to the human (e.g. by putting information in higher-order bits of a benign-looking video).
  • In some situations an AI would be motivated to tamper with sensors even if doing so reduced its own ability to act in the world.

Worry 2: an experiment can itself perform tampering

If we let our AI propose experiments to “reveal” tampering, it may instead propose experiments that do tampering from scratch. That tampering could then provide convincing evidence that something is fishy. If this can be done in any state, then it is useless for telling whether something fishy has actually happened.

We could try to avoid this by restricting to actions that are simple enough that a human understands how they work and can be confident that they won’t tamper. But it then becomes even more unclear whether there is any experiment that would identify problems (since we are effectively pitting a human experimenter against a very powerful manipulator).

Alternatively, we could hope that “reporting existing tampering” involves shorter or simpler action sequences than tampering from scratch. Then we could identify a state as fishy if it’s possible to find a short sequences of actions that look fishy.

But unfortunately the fishiest states could end up looking the safest. For example, suppose that if an AI spends 10 * N + 100 timesteps tampering with the sensors, it takes N steps to “undo” its efforts and reveal that something has gone wrong. A simple tampering attempt takes 200 timesteps to execute and 10 timesteps to undo, which sounds good. But if the AI spent 1100 timesteps tampering, it would take 100 timesteps to undo its work, which is more than the amount it would take to tamper from scratch. So this extremely-tampered-with state would end up looking less fishy than a normal state.

Strategy: detect covert tampering by analogy to visible tampering

(Part of the proposal here.)

As we’ve just discussed, some forms of sensor tampering might also destroy your ability to ever figure out whether tampering has occurred (by disabling actuators or sensors). But it’s still plausible that on the way to tampering we will pass through states in which something fishy is clearly about to happen. If we interrupt these action sequences in the middle, maybe we can do an experiment that shows that something has gone wrong.

Put differently: maybe action sequences that result in covert tampering have the same initial steps as action sequences that result in visible tampering.

More generally: we could look for any recognizable similarity between action sequences that result in covert tampering and action sequences that result in nearly-identical yet visible tampering.

Intuitively you might be optimistic about this if e.g. a would-be tamperer needs to somehow specify what they want to show up on the camera after it’s tampered with. If they just specified something slightly different then the action sequence would still be almost the same yet it would be easy to tell that the camera was tampered with (e.g. because it’s showing something physically impossible). Even if a human doesn’t understand how the covert tampering worked, they could say that something bad has happened because the action sequence is similar to one that causes overt tampering.

Worry: action sequences may be brittle

There may be sequences of action that covertly tamper with the sensors, but for which any small change leads to no tampering at all.

As an extreme example, we can imagine an AI whose first action is to commit to the hash of its next 50,000 actions. Perhaps it chooses a value x, and then installs a device that will cut power to its motors unless the hash of the next 50k actions is equal to x. Perhaps it “should have been” possible to change action 34,279 in order to carry out visible tampering rather than covert tampering—but instead what will happen is that the device will trigger and incapacitate the AI, so that no tampering at all occurs.

More realistically you might imagine an environment where an AI needs to prepare for chaotic influences in the environment—if it changes the action in timestep 17 it could have subtle effects later that change what actions in timestep 140 are feasible. In this case, even if the effects of tampering are determined by actions at late timesteps, you still need to take corresponding actions at early timesteps in order to prepare.

A broader way of viewing this concern is that covert and visible tampering may require “similar” actions in an abstract sense that is meaningful to an AI, which nevertheless correspond to very different sequences of low-level actions and which look quite different to a human.

If there are any kinds of tampering that are brittle in this way, then we won’t be able to recognize them (and so if we select against tampering we may be left with such actions).

New to LessWrong?

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 8:12 PM

Thanks for the reply! I'll respond to the "Hold out sensors" section in this comment.

One assumption which seems fairly safe in my mind is that as the operators, we have control over the data our AI gets. (Another way of thinking about it is if we don't have control over the data our AI gets, the game has already been lost.)

Given that assumption, this problem seems potentially solvable

Moreover, my AI may be able to deduce the presence of the additional sensors very cheaply. Perhaps it can notice the sensors, or it can learn about my past actions to get a hint about where I may have placed them. If this is possible, then “Predict the readings of all sensors” need not be much more complex than “Predict the readings of one sensor.”

If the SmartVault learns a policy from data which was all generated prior to the installation of the new sensors, it seems unlikely that policy would deliberately account for the existence of those specific new sensors. We could leave info about your past actions out of the dataset as well.

An alternative approach in response to this bit:

I’m concerned that I will learn a better version of the human simulator which predicts the readings of all sensors and then outputs what a human would infer from the complete set.

The scenario is: we're learning a function F1(A, S1) -> D where A is an action sequence, S1 is readings from the known sensor, and D is a diamond location. Previously we've discussed two functions which both achieve perfect loss on our training data:

  • D1(A, S1) -- a direct translator which takes A and S1 into account

  • H1(A, S1) -- a simulation of what a human would believe if they saw A and S1

Let's also consider two other functions:

  • D2(A, S1, S2) -- a direct translator which takes A, S1, and S2 into account

  • H2(A, S1, S2) -- a simulation of what a human would believe if they saw A, S1, and S2

Your concern is that there is a third function on the original (A, S1) domain which also achieves perfect loss:

  • H1'(A, S1) = H2(A, S1, P_S2(A, S1)) -- defining P_S2 as a prediction of S2's readings given A & S1, we have H1' as a simulation of what a human would believe if they saw A, S1, and readings for S2 predicted from A & S1.

Why would it be bad if gradient descent discovered H1'? Because then when it comes time to learn a policy, we incentivize policies which deceive predicted readings for S2 in addition to S1.

Here's an idea for obtaining a function on the original (A, S1) domain which does not incentivize policies which deceive S2:

  1. Learn a function F2 on the expanded domain (A, S1, S2), using a training set which is carefully constructed so that the only way to achieve perfect loss is to do a good job of taking readings from S2 into account. (For example, deliberately construct scenarios where the readings from S2 are not what you would expect if you were only looking at A and S1, and make ignoring A & S1 in favor of S2 key to labeling those scenarios correctly.) F2 could be closer to either D2 or H2, I don't think it matters much.

  2. Define a function F1_only(A, S1) = F2(A, S1, <hardcoded stream of boring S2 sensor data>).

Now let's use F1_only as the target for learning our policy. I argue a policy has no incentive to deceive S2, because we know that F2 has been optimized to trust its S2 argument over its A and S1 arguments regarding what is going on around S2, and when we call F2 through F1_only, its S2 argument will always be telling it there are no interesting readings coming from S2. So, no bonus points for a policy which tries to fool S2 in addition to S1.

Maybe there is some kind of unintended consequence to this weird setup; I just came up with it and it's still a bit half-baked in my mind. (Perhaps you could make some kind of exotic argument on the basis of inner optimizers and acausal trade between different system components?) But the meta point is there's a lot of room for creativity if you don't anthropomorphize and just think in terms of learning functions on datasets. I think the consequences of the "we control the data our AIs get" assumption could be pretty big if you're willing to grant it.

In section "Worry 2: an experiment can itself perform tampering":

But if the AI spent 1100 timesteps tampering, it would take 100 timesteps to undo its work, which is less than the amount it would take to tamper from scratch.

Shouldn't it be "more" instead of "less"?

I'll respond to the "Predict hypothetical sensors" section in this comment.

First, I want to mention that predicting hypothetical sensors seems likely to fail in fairly obvious ways, e.g. you request a prediction about a sensor that's physically nonexistent and the system responds with a bunch of static or something. Note the contrast with the "human simulator" failure mode, which is much less obvious.

But I also think we can train the system to predict hypothetical sensors in a way that's really useful. As in my previous comment, I'll work from the assumptions (fairly weak IMO) that

  1. We can control the data our systems get.

  2. We are capable of doing regular old supervised learning -- possibly in conjunction with transfer learning that gives the system generic prior knowledge like the meaning of English words, but not specific prior knowledge like details of our situation (unless we want that). Our supervised learning finds a function which maps training examples in X to labels in Y (labels may or may not correspond to "reality").

In particular, these assumptions imply that our system doesn't necessarily need to know whether a sensor it's trying to predict exists physically (or if it would be physically possible to build).

But what if over the course of its operation, the system accidentally learns that a sensor of interest doesn't exist? E.g. because it points a sensor that does exist in the direction of the one that doesn't exist, and there's nothing present. Ideally we could be reasonably confident of good "readings" from the nonexistent sensor even past that point.

To achieve this, we could make use of the "regular old supervised learning" assumption and construct a doctored dataset as follows:

  1. Place camera S1 in the center of the vault, turn it on, and have it noisily rotate 360 degrees to capture a panorama.

  2. Mount audio sensor S2 on the interior wall of the vault, turn both sensors on, and have S1 do the same panorama.

  3. Create a supervised learning problem to predict the readings of S2 gathered in step 2, given S1's panorama from step 1.

If we have prediction problems like this in our dataset, hypotheses which predict that S2 continues behaving as normal even if physically absent will achieve a lower loss, and should be preferentially chosen.

A similar method lets us preferentially choose hypotheses that predict that S2 will emit a special "TAMPERING ACTIVE" signal during tampering, then work as normal once tampering is complete:

  1. Place audio sensor S2 in the corner of the vault and turn it on.

  2. Move a robotic arm right next to S2.

  3. Have the arm tamper with S2 so it no longer hears the sound of footfalls.

  4. Discard S2 and replace it with a fresh new sensor, S2', that has identical factory specifications.

  5. Have someone walk beneath S2' in full view of camera sensor S1.

  6. Splice together a composite audio sensor reading with the following sequence: S2 readings from steps 1 and 2, a "TAMPERING ACTIVE" signal for the duration of step 3, nothing for step 4, and the S2' readings from step 5.

  7. Create a supervised learning problem to predict the composite reading from step 6, given the movements of the robot arm from steps 2 and 3, plus the visual readings from S1 with step 4 snipped out.

The "S2 is vulnerable to tampering" hypothesis predicts that footfalls should not be present in the audio. But since they are present in the audio, it will achieve a higher loss. Failure to predict the emission of the "TAMPERING ACTIVE" signal also gives this hypothesis a higher loss.

It might also be possible to do away with the "TAMPERING ACTIVE" signal somehow, e.g. by making S2 and S2' identical except that the tampering which works on S2 doesn't work on S2', then splicing together readings from the robot arm doing the exact same tamper sequence on both sensors.

Finally, here are a few ideas for preferentially choosing hypotheses that predict accurate readings from technologically impossible sensors.

Supersensor Idea #1

  1. Write a natural language description of a "telepathic sensor". Describe it as being mounted right above a camera sensor S1.

  2. Place boxes X, Y, and Z within the vault beneath the watchful eye of S1.

  3. Turn on S1. Have a person enter the vault, visibly ponder for a moment, and open box Y.

  4. Create a stream of "telepathic sensor" readings. Find a timestamp in the video where the person is pondering and hasn't yet made a movement. Add a reading from the "telepathic sensor" at that timestamp: "The person intends to open box Y".

Supersensor Idea #2

  1. Write a natural language description of an "x-ray vision" sensor which can be pointed in various directions. Describe it as being mounted right above a camera sensor S1.

  2. Place a deck of cards on a table beneath the watchful eye of S1.

  3. Have a person enter the vault, shuffle the deck beneath the table, show a 9 of hearts to S1, and place it face down on top of the deck.

  4. Create a stream of "x-ray vision" sensor readings. Specify that the sensor is initially pointed away, but once the action finishes, the sensor turns towards the table and registers: "The top card of the deck is a 9 of hearts."

Supersensor Idea #3

  1. Write a natural language description of a "nanobot sensor" which beeps whenever a nanobot is present within 10 feet.

  2. Run an action sequence which builds nanobots and moves them within 10 feet of where the "nanobot sensor" is mounted.

  3. Create a stream of "nanobot sensor" readings. Specify that the sensor starts beeping as soon as the nanobots are nearby.

A few more thoughts on translating natural language into sensor readings:

  • Excise a real sensor type from system's prior knowledge. Ask your system to predict sensor data from a physical instance of this sensor, given a natural language description of its workings plus other readings from the environment. (H/T Romeo Stevens)

  • Make a weird modified sensor (e.g. a camera sensor which provides an upside down picture). Ask your system to predict readings from the modified sensor, given a natural language description of its modifications plus other readings from the environment.

Anyway, I'm not sure we need to reason about a physically impossible counterfactual or condition on the sensor existing physically. It seems perfectly coherent to ask "what is the answer to this thought experiment" rather than "if this sensor existed, what would it see"? For example, instead of the question "what would an Acme Corp camera mounted here see", consider the question "if the light which passes through a pinhole at these coordinates intersected with a plane at these other coordinates, and the intersections were digitized and formatted the same way Acme Corp cameras format photos, what would be the resulting binary file?"

Humans don't seem to have a lot of trouble performing thought experiments. If the system tries to fit the data with a hypothesis that references existing pretrained conceptual understanding, as I described above, that could give the system an inductive bias towards performing human-like thought experiments. This could be bad if human thought experiments are vulnerable to human deficiencies. It could also be good if we'd like the AI's hypothetical sensors to behave in the same intuitive way our thought experiments do.

One possible concern is hypotheses which reference dataset doctoring. Obviously one could try to excise knowledge of that possibility. Another quick idea is to try & train a classifier to differentiate doctored vs non-doctored SmartVault sequences, and keep improving our fakes until the classifier can't easily tell the difference? Or try to avoid any sort of branching so the system always acts like it's dealing with a doctored dataset when in production? Could even fuzz the live data stream in a way that makes it appear doctored ;-) Finally, to get a sense for the cognitive signature of a doctoring-related hypothesis, one could train the system to solve some problems where the only way to minimize the loss is to think a lot about doctoring. Maybe a classifier which aims to detect the presence of doctoring-related cognition could be useful here.

Another possibility is an alternative hypothesis along the lines of "predict what the operator would want me to predict" -- unclear if that's desirable?

If we train on data about what hypothetical sensors should show (e.g. by experiments where we estimate what they would show using other means, or by actually building weird sensors), we could just end up getting predictions of whatever process we used to generate that data.

In general the overall situation with these sensors seems quite similar to the original outer-level problem, i.e. training the system to answer "what would an ideal sensor show?" seems to run into the same issues as answering "what's actually going on?" E.g. your supersensor idea #3 seems to be similar to the "human operates SmartVault and knows if tampering occurred" proposal we discussed here.

I do think that excising knowledge is a substantive change, I feel like it's effectively banking on "if the model is ignorant enough about what humans are capable of, it needs to err on the side of assuming they know everything." But for intelligent models, it seems hard in general to excise knowledge of whole kinds of sensors (how do you know a lot about human civilization without knowing that it's possible to build a microphone?) without interfering with performance. And there are enough signatures that the excised knowledge is still not in-distribution with hypotheticals we make up (e.g. the possibility of microphones is consistent with everything else I know about human civilization and physics, the possibility of invisible and untouchable cameras isn't) and conservative bounds on what humans can know will still hit the one but not the other.

Do you run into a distinction between benign and malign tampering at any point? For example, if humans can never tell the difference between the tampered and non-tampered result, and their own sanity has not been compromised, it is not obvious to me that the tampered result is worse than the non-tampered result.

It might be easier to avoid compromising human sanity + use hold-out sensors than to solve ELK in general (though maybe not? I haven't thought about it much).

In some sense this is exactly what we want to do, and this is why we are happy with a very "narrow" version of ELK (see the appendices on narrow elicitation and why it might be enough, indirect normativity, and avoiding subtle manipulation).

But you still need to care about some sensor tampering. In particular, you need to make sure that there are actually happy humans deliberating about what to do (under local conditions that they believe are conducive to figuring out the answer), rather than merely cameras showing happy humans deliberating about what to do.

So, granting the assumption of not corrupting the humans (which is maybe what you are denying), doesn't this imply that we can go on adding sensors after the fact until, at some point, the difference between fooling them all and being honest becomes unproblematic?

It seems like the frame of some of the critique is that humans are the authority on human values and want to ensure that the AI doesn't escape that authority in illegible ways. To me it seems like the frame is more like we know that the sensors we have are only goodhartedly entangled with the things we care about and would ourselves prefer the less goodharted hypothetical sensors if we knew how to construct them. And that we'd want the AI to be inhabiting the same frame as us since, to take a page from Mad Investor Chaos, we don't know how lies will propagate through an alien architecture.

I don't know how 'find less goodharted sensors' is instantiated on natural hardware or might have toy versions implemented algorithmically, seems like it would be worth trying to figure out. In a conversation, John mentioned a type of architecture that is forced through an information bottleneck to find a minimal representation of the space. Seemed like a similar direction.

I would describe the overall question as "Is there a situation where an AI trained using this approach deliberately murders us?" and for ELK more specifically as "Is there a situation where an AI trained using this approach gives an unambiguously wrong answer to a straightforward question despite knowing better?" I generally don't think that much about the complexity of human values.