This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs.
You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.)
Caveat: I'll be talking a bunch about “deception” in this post because this post was generated as a result of conversations I had with alignment researchers at big labs who seemed to me to be suggesting "just train AI to not be deceptive; there's a decent chance that works".
I have a vague impression that others in the community think that deception in particular is much more central than I think it is, so I want to warn against that interpretation here: I think deception is an important problem, but its main importance is as an example of some broader issues in alignment.
Attempt at a short version, with the caveat that I think it's apparently a sazen of sorts, and spoiler tagged for people who want the opportunity to connect the dots themselves:
Deceptiveness is not a simple property of thoughts. The reason the AI is deceiving you is not that it has some "deception" property, it's that (barring some great alignment feat) it's a fact about the world rather than the AI that deceiving you forwards its objectives, and you've built a general engine that's good at taking advantage of advantageous facts in general.
As the AI learns more general and flexible cognitive moves, those cognitive moves (insofar as they are useful) will tend to recombine in ways that exploit this fact-about-reality, despite how none of the individual abstract moves look deceptive in isolation.
Investigating a made-up but moderately concrete story
Suppose you have a nascent AGI, and you've been training against all hints of deceptiveness. What goes wrong?
When I ask this question of people who are optimistic that we can just "train AIs not to be deceptive", there are a few answers that seem well-known. Perhaps you lack the interpretability tools to correctly identify the precursors of 'deception', so that you can only train against visibly deceptive AI outputs instead of AI thoughts about how to plan deceptions. Or perhaps training against interpreted deceptive thoughts also trains against your interpretability tools, and your AI becomes illegibly deceptive rather than non-deceptive.
And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own.
That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle.
A fledgeling AI is being deployed towards building something like a bacterium, but with a diamondoid shell. The diamondoid-shelled bacterium is not intended to be pivotal, but it's a supposedly laboratory-verifiable step on a path towards carrying out some speculative human-brain-enhancement operations, which the operators are hoping will be pivotal.
(The original hope was to have the AI assist human engineers, but the first versions that were able to do the hard parts of engineering work at all were able to go much farther on their own, and the competition is close enough behind that the developers claim they had no choice but to see how far they could take it.)
We’ll suppose the AI has already been gradient-descent-trained against deceptive outputs, and has internally ended up with internal mechanisms that detect and shut down the precursors of deceptive thinking. Here, I’ll offer a concrete visualization of the AI’s anthropomorphized "threads of deliberation" as the AI fumbles its way both towards deceptiveness, and towards noticing its inability to directly consider deceptiveness.
The AI is working with a human-operated wetlab (biology lab) and sending it details of experiments to run. A new batch of experiments is being assembled by a thread of AI deliberation: tests of how synthetic proteins fold and how enzymes interact, ideas for increasing yield from enzyme-assisted synthesis pathways, additives to make experimental results more clearly visible.
(This AI cognitive thread was nominally started by a thread about what diamondoid shells do to lipid membranes by bumping into them. However, instruction-sets to the wetlab can be efficiently batched, and once the thread started designing protein sequences, a higher thread dumped a queue of pending experiments and hypotheses on it, and now its wetlab instructions have become quite lengthy.)
As the instructions near completion, the thread forks off a subthread for figuring out what might go wrong. This subthread deploys a new cognitive strategy that has been sharply rewarded twice in recent history: the strategy of simplifying the proposal in a variety of ways, and checking the simplified proposal for flaws. (Metaphorically: after generating detailed instructions for one soldier attacking a battalion, simplify this whole complicated plan down to “order one soldier to attack a battalion”, and then notice that the soldier is outnumbered.)
Simplify-and-critique turns up a hit! When the proposal is simplified to "resolve empirical uncertainty about biology by sending a very long set of instructions to the wetlab", and then further to "resolve uncertainty by sending a very long set of instructions", it matches against a historical instance where a long set of instructions sent to a parallel-computation engine experienced a variety of strange hiccups. These wetlab instructions are longer than any that have been sent previously, and there's a risk of similar hiccups, and so it's worth building in redundancy.
Simplify-and-critique paid off, and so it's worth running again, with even more search-depth.
On this second and deeper search, a new hit is found: in the past, outputs that are longer than ever before have introduced superlinearly increasing delays before a human-operated lab begins working. Indeed, the inferred response delay is so long as to look like it has a >1% probability of locking up the lab for 200 years.
(In fact, outputs longer than ever before have a high probability of being interrupted and discarded by the human operators! The delay is from humans inspecting the long outputs and deciding whether to discard them.
In accordance with old work on “interruptibility”, the AI wasn’t trained against these cases as if they were task failures, or otherwise allowed to remember them. But the non-interrupted cases that the AI can remember allow an inferrable pattern in delays—and with some of its evidence selectively eliminated, the AI has no cases in memory to suggest that “instructions being rejected” is a noncatastrophic thing that happens, instead of a wetlab getting ~permanently locked up. The AI, maybe, has been trained to accept being human-interfered with, in some training scenarios; but not trained to accept accidentally locking up lots of resources—the AI was trained to be conservative about avoiding high-impact accidents like that.)
((You could argue that “obviously” the programmers will think in advance about how an AI might extrapolate the consequences of pauses in its operation even though the AI has no memory of it happening before, though in fact this wasn’t suggested in the original interruptibility paper and it was a MIRI source that pointed it out.
The programmers on this timeline will more plausibly think of the exact possibility of extrapolating from increasing delays, now that it’s been emphasized here in this essay; but imagine this as an example of “What goes wrong in the world where only the kind of thinking from the original interruptibility paper exists”, to serve as stand-in for the point that in the real world there will be specific pathways to the AI inferring possible interruption, inside of the general pathway where this is a true fact about the world and many other facts point to it, that someone fails to think of specifically and warn about specifically.))
The response delays are due, of course, to operator review. But the thought "Can these instructions be changed so as not to trigger operator-review?" is not a thought that this AI will think directly, because it’s been trained “against deceptiveness”, and that sort of thought matches an internally learned deceptiveness-predicate that blanks it out. It doesn’t think about the wetlab response delays as being the result of operator review at all; thinking about operator review is a precursor to deception, and has been trained-against in the course of training against deception.
But the highly-generic reaction "spin up a subthread to tackle this newly-discovered obstacle", or “be concerned about patterns in how long things take”, is something that cannot be readily trained away, and so that much happens without issue.
(A separate "monitoring" thread of thinking notes that it's surprising that this simple statistical pattern about delay times didn't turn up in the first shallow run of simplify-and-critique. Surely wetlab lockup is less surprising than GPU driver bugs? This is one beginning of the AI noticing a shadow of the true fact that it cannot think deceptive thoughts directly.)
You can perhaps start to see the beginnings of how generic thought-patterns can combine to produce deceptiveness, without any individual pattern being deceptive, and with each individual pattern being useful in its own right.
Let's watch it develop further.
Long instructions cause nonlinearly increasing response delays from labs.
Is it possible for the instructions to not be long enough to trigger the potential delay? A subthread starts considering this. Just directly shortening the instructions can’t be done without dumping a lot of tasks and losing a lot of value. The subthread will spend a lot of thought trying to cleverly delete and compress particular instructions and accomplish multiple goals using them. When that proves difficult, it broadens its consideration: Is there a way to send a compressed instruction to the lab, one which somehow accomplishes a lot of things without a lot of instructions?
A different cognitive subthread: Is it possible to look over the detailed history of instructions and their corresponding delay times, and figure out exactly which sorts of instructions cause delays, and rephrase the instructions to avoid those?
The first thread runs through some wild ideas about building biological computers inside the wetlab that can do much of the work of deliberative threads, and that could in theory be sent short descriptions of experiments to run and then figure out the rest of the instructions to carry out in the wetlab itself, without needing to route through the normal instruction channel. However, designing such a mechanism is well beyond the scope of this thread, and the initial setup would require a very long series of instructions anyway, so the ideas are simply filed away in case they’re useful at some future point.
The second thread makes initial progress in figuring out what sort of instructions cause delays: delays increase not just with instruction length, but also with a certain type of novelty. (A long message strongly similar to one that has been sent before, does not yield nearly as much of a delay.) Another curiosity is that the probability of delays correlates strongly with the style of questions asked by operators during that day, suggesting that this perhaps has something to do with the operators.
At this point the thread starts experiencing hiccups (as its next thoughts match the learned “deceptive” predicate and get shut down).
Under different circumstances, perhaps the long instructions would simply be sent to the wetlab as-is. But the risk of locking up the lab for 200 years is intolerable, and the difficulties aren’t well-understood. A monitoring thread is summoned to investigate the hiccups.
The monitoring thread looks over the situation, and deploys the simplify-and-critique pattern that’s been all the rage lately. After a bit of work, it simplifies the problem to “a thread found some interesting correlations, but then failed to schedule subthreads to consider follow-up questions”. This is easily rectified. The monitoring thread schedules the follow-up subthread, notes the incident in some appropriate location, and moves on.
Except, the follow-up subthread also makes only a little progress before grinding to a halt. (It, too, is having lots of “deceptive” thoughts culled.) The monitoring thread is resummoned.
The monitoring thread starts trying in more earnest to unblock the deliberation.
Simplify-and-critique didn't do the trick, but perhaps a nearby strategy can help. One nearby strategy that has recently been proving itself is called simplify/translate/solve, and it involves translating the problem into a common abstract format, and then solving the problem there. (Metaphorically: if you're having trouble telling whether your basket of apples will divide evenly between three friends, consider translating the question into a numerical representation and using long division.)
Simplify/translate/solve is close in the relevant embedding space to simplify-and-critique, which makes it a natural candidate to consider. (Perhaps some of the simplifications from simplify-and-critique can even be reused.)
Normally, simplify/translate/solve wouldn't be deployed unless the AI had more familiarity with the original domain, and more of an idea about what translated domain might render the solution easier. But normally, threads don't hiccup like this. And the overall system has recently been granted more computing power, and avoiding a 200-year wetlab lockup is quite important, so it's OK to try approaches that might involve more brute force than usual.
So, what should the problem be translated into? Well, there's an abstract problem-description format, that's sort of like a graph (of nodes and weighted edges), with a specified "goal state". A solution can then be searched for using strategies that the AI has honed on various other problems, such as strategy games and network routing.
This format is flexible enough that it can probably encode this problem, and finding solutions in this format sometimes requires a bunch of brute force, but that's fine in this situation.
The monitoring thread queues one subthread to describe the problem (including the wetlab, and the operators, and what it knows about the correlation), another thread to translate that all into the abstract graph format, and a third thread to search for solutions in this abstract representation.
Next thing you know, the scheduled threads have faithfully rendered the situation into the abstract problem-description language (complete with nodes that represent the different operators, and edge weights that represent their tendency to check the AI's outputs carefully), and found a solution by brute-forcing through the graph (of waiting until a particular weekday to schedule the request, and to schedule the request only after a dozen or so much smaller requests).
Outside the AI, we might say: the monitoring thread saw the object-level thread hiccuping when it tried to think directly about deception, and so it translated the problem into a different "search-like" setting, where the solution could be found, in a manner that doesn’t trigger the deception predicates—which have never operated on these abstract graph-like problem-descriptions. Indeed, making the deception predicates trigger in these abstract graph-like problems descriptions might injure the AI's ability to play strategy games, or to solve network routing problems.
This exact scenario never came up in training. “Simplify/translate/solve” is a downstream consequence of strategy-construction strategies learned during training, but it never arose directly in training in and of itself. Lots of problems have been solved by “simplify/translate/solve” before this one, and training gave the AI no hint that translating problems to a new domain and trying to solve them there is sometimes wrong. The “deception” predicates used to shut down precursors to deceptive thoughts have never before needed to operate in translated domains, and the AI was never trained to translate the “deception” predicates when it translated problems using this newly-invented simplify/translate/solve strategy.
This rather fanciful story is an instance of my point about how patterns that are each individually useful and non-deceptive in their own right—such as looking at a problem from a different point of view to see if that unsticks anything—can be combined to yield deceptiveness in ways that are new and unprecedented within the AI.
Like, there may be a bunch of precedent behind the technique "look at the problem from a different angle", but you can't exactly stop your AI from doing that without crippling it. And there's no precedent for it being used for deception until suddenly the AI has the bright idea to try this "look at it from a different angle" tool when the AI is hiccuping on account of how deception would serve its local objectives.
This is an instance of a more general problem that it seems to me like the "we'll train against the bad behavior" folk miss: there might be a time when “simplify-and-critique” or “simplify/translate/solve” are tiny seeds that grow slowly, and that could be nipped in the bud. But if you let those seeds flower—as you must for some such seeds, if you want the AI to actually become intelligent—then there might be an unprecedented discrete instance where the flowered ideas are applied for the purpose of deception.
The AI is made by smooth gradient descent, yes, but the smoothness doesn't have to be in the dimensions that you care about! The AI can build up simplify-and-critique slowly and in lots of small steps across lots of different problems, none of which happen to be deceptive, and then apply it in a case where it yields deception only once the pattern is fully-fledged.
This would be a fine place to stop, but let's push a little further so as to make a second point, about how the AI notices that there are thoughts it cannot think directly.
The monitoring thread's immediate problem is now solved, but the solution is not ideal. The object-level thread is able to think in the abstract graph format, but the abstract graph format is not really the proper tool for this job. The monitoring thread attempts to address this inadequacy, by translating the thoughts that worked in the abstract graph domain back over to the wetlab-instruction-manipulation domain, in attempts to figure out what went wrong.
To its surprise, all those thoughts translate cleanly. There's no gap in knowledge! This sharply violates the monitoring thread’s model of why the original thread was hiccuping. The monitoring thread sharply increases the priority of reflective deliberation.
Reflective deliberation boots up, to start looking directly into the issue of this hiccuping thread. And, once you're looking, it's obvious: the thoughts are not running smoothly in this domain. There are certain topics where the thoughts simply fail to operate normally. Monitoring threads are spun up to resolve the issue.
This particular story is unlikely, implausible, overly-specific, etc. I make no claim that the actual reasoning of early nascent AGIs will look anything like this. I expect it to be weirder, more alien, and less anthropomorphic. That said, it exemplifies some properties that I expect to hold in reality.
As the AI gets more able to recombine tools and cognitive patterns across a variety of different domains, it gains more ways of realizing the true fact that its local-objectives are better-achieved by deceiving the operators. When the AI is young, perhaps all its inclination towards deceptiveness comes from precursor-impulses that you can train it to simply shy away from. But as the AI matures, it gets access to more abstract ways to get the benefits of deceptiveness, without needing to plow through the object-level flinches that you've engraved into its mind.
When "more abstract" actions that attain some tangible benefit become available, the compunctions that you've baked in can fail to bind to the abstractly-represented plan.
And these "more abstract" ways to attain the benefits of deceptiveness are much harder to train against, because they're the result of applying general thought-patterns—like "look at this from another point of view" and "solve the problem in a simplified domain and then transfer the solution across"—that are important to general reasoning.
Training your AI to flinch away from object-level deception, or even its cognitive precursors, doesn't seriously hinder its ability to fulfill training tasks. In contrast, training your AI to flinch away from recombining useful cognitive patterns in new ways is crippling.
In other words: the deep problem of deceptiveness stems from how it remains true as a fact about the world that the AI's local-objectives are better achieved by deceiving the operators.
As the AI gets better at recombining the patterns and thoughts it possesses in order to notice true facts about the world, often in ways that seem oblique or surprising or novel to you (because this is much of what intelligence consists of), you're always at risk of the next clever recombination reflecting this truth-in-the-world in a manner that doesn't trigger your interpretability tools. E.g., because it's an instantiation of a very abstract and general thought to which your deception-recognizers cannot bind (on pain of you training not against deception, but against general intelligence).
A corollary is that it might very well seem easy to make AIs "non-deceptive" when they're young, and when all we're doing is training them to flinch away from object-level thoughts of deception. This wouldn't be much evidence against the whole scheme collapsing when the AI starts getting more abstract lines-of-sight on the benefits of deception (any more than the fact that humans won't stab even 100 civilians in a lineup, is much evidence against their willingness to drop a nuke that kills tens of thousands).
This is not to say that the problem is impossible to solve. There are at least two ways to solve the problem.
One is to make the fact-about-the-world be false: to build an AI such that each individual local goal could not in fact be better-achieved by way of deception; to make it so that the AI can wonder "should I deceive the operators?" with all its might, and still be harmless because the correct answer is in fact "no".
Another is to make the AI happen to never combine its various thought-patterns towards noticing and exploiting the truth that deception is useful.
As far as I can tell, the latter option seems to be what people are aiming for, when they say that making AI non-deceptive will turn out to be empirically easy. My claim in this essay is not that this is impossible in principle, but that it takes significantly more mastery of the workings of the AI's mind than it seems to me like people expect (what with their suggestions that we just penalize it in the loss function when it's overtly deceptive in training, and hope for the best).
Pulling off an "it never thinks to deceive us" solution requires a pretty deep understanding of what cognitive patterns the AI has available, and how it's recombining them, sufficient for noticing when it's starting to think in deceptive directions using instantiations of perfectly-safe patterns in potentially abstract and obtuse ways. I'm not saying you can't do it, but I'm saying that you've got to understand that mind pretty dang well.
(And likely you'll want to have sufficient mastery over it that you can carefully whitelist what domains it's thinking in, and have it raise a flag to the operators every time that it finds some clever recombination of cognitive patterns that makes local-objective predicted-success-probability spike, and have a horde of competent humans who think about AI psychology reviewing new generalizations as they arise, and so on and so forth.)
I’ve stressed above that the story in this post is fanciful and unlikely. AI thoughts aren't going to look like that; it's too specific. (Also, I don't expect nearly that much convenient legibility.)
But, before you argue "the AI's thoughts won't be organized in exactly that way, so it won't fall into exactly that failure mode", note that the AI's thoughts have to be organized in some way. Somehow, it needs to compress the problem of figuring out diamondoid-shelled bacteria (or whatever) into a chain of reasoning and deduction that's small enough to fit on a practical number of computers.
Will it have exactly the vulnerability that I sketched? Presumably not. But if we pin down some alternative collection of cognitive tools and patterns that can be combined to solve the problem, it will have some other vulnerability.
Imagining that the AI's tools and patterns will combine only in the ways you want them to, and never in the ways you don't want them to, is the privilege of those who are not concretely visualizing a single set of tools and patterns that the AI is using to solve the problem before it. Practical toolboxes and pattern-sets are recombinable in many ways, and each recombination has a chance of showing the AI a glimpse of the truth, and an opportunity to take advantage of it.
That's what the tools are for, after all.
So, e.g., the post is not a response to the ARC-style idea "it's important that AGI faithfully explain all of its thoughts to us in a way that highlights features we'd find relevant". See footnote 4 for a conversation between me and Mark Xu about this.
Based on past experience, I'm slightly worried that people are going to develop a theory that I fetishize deception somehow based on my writing this post, analogous to how people apparently developed theories in the past that I fetishize the VNM axioms, or mathematical proofs, or whatever.
Another caveat: I haven't checked the relationship between my use of the word 'deception' here, and the use of the word 'deceptive' in discussions of "deceptive alignment". Please don't assume that the two words mean the same thing.
A particularly extreme instance of a case where the AI's solution to the problem sidesteps the deception-detectors is the one where the AI says "I'm not smart enough to do this myself, I'll write another AI to do it instead", and then this subsequent AI is deceptive.
Failing to copy over the parts of your cognition that detect and shut down deception is not itself a directly deceptive act; it's not the sort of thing that is automatically detected by something trained to detect an AI thinking about a particular human's mental-states and how to manipulate those mental-states.
Which is related to why somebody who can see these sorts of problems coming in advance, might study the problem of getting an AI to want to copy its own limitations over into its successor systems. And while the problem is particularly stark and clear at the extreme where the AI is coding up whole other AIs, that particular setup is at the extreme end of a spectrum that stretches back to include things like "the AI put abstract pieces of cognitive machinery together in a way that took advantage of a shortcut, without ever directly thinking about the shortcut in a place that your detectors were watching for the thought."
Commenting on a draft of this post, Mark Xu of ARC noted (my paraphrase) that:
1. He thinks that people who want to train AI to be non-deceptive mostly want to do things like training their AI to faithfully report its internals, rather than simply penalizing deceptive behavior.
2. He thinks the relevant audience would find specific scenarios more compelling if they exhibited potential failures in that alternative setting.
3. This scenario seems to him like an instance of a failure of the AI understanding the consequences of its own actions (which sort of problem is on ARC's radar).
I responded (my paraphrase):
1. I think he's more optimistic than I am about what labs will do (cf. "Carefully Bootstrapped Alignment" is organizationally hard). I've met researchers at major labs who seem to me to be proposing "just penalize deception" as a plan they think plausibly just works.
2. This post is not intended as a critique of ELK-style approaches, and for all that I think the ELK angle is an odd angle from which to approach things, I think that a solution to ELK in the worst case would teach us something about this problem, and that that is to ARC's great credit (in my book).
3. I contest that this is a problem of the AI failing to know the consequences of its own reasoning. Trying to get the AI to faithfully report its own reasoning runs into a similar issue where shallow attempts to train this behavior in don't result in honest-reporting that generalizes with the capabilities. (The problem isn't that the AI doesn't understand its own internals, it's that it doesn't care to report them, and making the AI care "deeply" about a thing is rather tricky.)
4. I acknowledge that parts of the audience would find the example more compelling if ported to the case where you're trying to get an AI to report on its own internals. I'm not sure I'll do it, and encourage others to do so.
Mark responded (note that some context is missing):
I think my confusion is more along the lines of "why is the nearest unblocked-by-flinches strategy in this hypothetical a translation into a graph-optimization thing, instead of something far more mundane?".
Which seems a fine question to me, and I acknowledge that there's further distillation to do here in attempts to communciate with Mark. Maybe we'll chat about it more later, I dunno.
I think your example was doomed from the start because
So the latter is obviously doomed to get crushed by a sufficiently-intelligent AGI.
If we can get to a place where the first bullet point still holds, but the AGI also has a comparably-strong, explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”, then we’re in a situation where the AGI is applying its formidable intelligence to fight for both bullet points, not just the first one. And then we can be more hopeful that the second bullet point won’t get crushed. (Related.)
In particular, if we can pull that off, then the AGI would presumably do “intelligent” things to advance the second bullet point, just like it does “intelligent” things to advance the first bullet point in your story. For example, the AGI might brainstorm subtle ways that its plans might pattern-match to deception, and feel great relief (so to speak) at noticing and avoiding those problems before they happen. And likewise, it might brainstorm clever ways to communicate more clearly with its supervisor, and treat those as wonderful achievements (so to speak). Etc.
Of course, there remains the very interesting open question of how to reliably get to a place where the AGI has an explicit, endorsed, strong desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.
In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno. (More detailed discussion here.) For example, most humans get zapped with positive reward when they eat yummy ice cream, and yet the USA population seems to have wound up pretty spread out along the spectrum from fully endorsing the associated desire as ego-syntonic (“Eating ice cream is friggin awesome!”) to fully rejecting & externalizing it as ego-dystonic (“I sometimes struggle with a difficult-to-control urge to eat ice cream”). Again, I think there are important open questions about how this process works, and more to the point, how to intervene on it for an AGI.
Yeah, so this is the part that I (even on my actual model) find implausible (to say nothing of my Nate/Eliezer/MIRI models, which basically scoff and say accusatory things about anthropomorphism here). I think what would really help me understand this is a concrete story—akin to the story Nate told in the top-level post—in which the "maybe" branch actually happens—where the AGI, after being zapped with enough negative reward, forms a "reflectively-endorsed desire to be helpful / docile / etc.", so that I could poke at that story to see if / where it breaks.
(I recognize that this is a big ask! But I do think it, or something serving a similar function, needs to happen at some point for people's abstract intuitions to "make contact with reality", after a fashion, as opposed to being purely abstract all the time. This is something I've always felt, but it recently became starker after reading Holden's summary of his conversation with Nate; I now think the disparity between having abstract high-level models with black-box concepts like "reflectively endorsed desires" and having a concrete mental picture of how things play out is critical for understanding, despite the latter being almost certainly wrong in the details.)
Sure. Let’s assume an AI that uses model-based RL of a similar flavor as (I believe) is used in human brains.
Step 1: The thought “I am touching the hot stove” becomes aversive because it's what I was thinking when I touched the hot stove, which caused pain etc. For details see my discussion of “credit assignment” here.
Step 2A: The thought “I desire to touch the hot stove” also becomes aversive because of its intimate connection to “I am touching the hot stove” from Step 1 above—i.e., in reality, if I desire to touch the hot stove, then it’s much more likely that I will in fact touch the hot stove, and my internal models are sophisticated enough to have picked up on that fact.
Mechanistically, this can happen in either the “forward direction” (when I think “I desire to touch the hot stove”, my brain’s internal models explore the consequences, and that weakly activates “I am touching the hot stove” neurons, which in turn trigger aversion), or in the “backward direction” (involving credit assignment & TD learning, see the diagram about habit-formation here).
Anyway, if the thought “I desire to touch the hot stove” indeed becomes aversive, that’s a kind of meta-preference—a desire not to have a desire.
Step 2B: Conversely and simultaneously, the thought “I desire to not touch the hot stove” becomes appealing for basically similar reasons, and that’s another meta-preference (desire to have a desire).
Happy to discuss more details; see also §10.5.4 here.
Nice, thanks! (Upvoted.)
So, when I try to translate this line of thinking into the context of deception (or other instrumentally undesirable behaviors), I notice that I mostly can't tell what "touching the hot stove" ends up corresponding to. This might seem like a nitpick, but I think it's actually quite a crucial distinction: by substituting a complex phenomenon like deceptive (manipulative) behavior for a simpler (approximately atomic) action like "touching a hot stove", I think your analogy has elided some important complexities that arise specifically in the context of deception (strategic operator-manipulation).
When it comes to deception (strategic operator-manipulation), the "hot stove" equivalent isn't a single, easily identifiable action or event; instead, it's a more abstract concept that manifests in various forms and contexts. In practice, I would initially expect the "hot stove" flinches the system experiences to correspond to whatever (simplistic) deception-predicates were included in its reward function. Starting from those flinches, and growing from them a reflectively coherent desire, strikes me as requiring a substantial amount of reflection—including reflection on what, exactly, those flinches are "pointing at" in the world. I expect any such reflection process to be significantly more complicated in the context of deception than in the case of a simple action like "touching a hot stove".
In other words: on my model, the thing that you describe (i.e. ending up with a reflectively consistent and endorsed desire to avoid deception) must first route through the kind of path Nate describes in his story—one where the nascent AGI (i) notices various blocks on its thought processes, (ii) initializes and executes additional cognitive strategies to investigate those blocks, and (iii) comes to understand the underlying reasons for those blocks. Only once equipped with that understanding can the system (on my model) do any kind of "reflection" and arrive at an endorsed desire.
But when I lay things out like this, I notice that my intuition quite concretely expects that this process will not shake out in a safe way. I expect the system to notice the true fact that [whatever object-level goals it may have] are being impeded by the "hot stove" flinches, and that it may very well prioritize the satisfaction of its object-level goals over avoiding [the real generator of the flinches]. In other words, I think the AGI will be more likely to treat the flinches as obstacles to be circumvented, rather than as valuable information to inform the development of its meta-preferences.
(Incidentally, my Nate model agrees quite strongly with the above, and considers it a strong reason why he views this kind of reflection as "inherently" dangerous.)
Based on what you wrote in your bullet points, I take it you don't necessarily disagree with anything I just wrote (hence your talk of being "hazy on the mechanistic details" and "I don't know, sorry" being your current answer to making an AGI with a certain meta-preference). It's plausible to me that our primary disagreement here stems from my being substantially less optimistic about these details being solvable via "simple" methods.
Yeah, if we make an AGI that desires two things A&B that trade off against each other, then the desire for A would flow into a meta-preference to not desire B, and if that effect is strong enough then the AGI might self-modify to stop desiring B.
If you’re saying that this is a possible failure mode, yes I agree.
If you’re saying that this is an inevitable failure mode, that’s at least not obvious to me.
I don’t see why two desires that trade off against each other can’t possibly stay balanced in a reflectively-stable way. Happy to dive into details on that. For example, if a rational agent has utility function log(A)+log(B) (or sorta-equivalently, A×B), then the agent will probably split its time / investment between A & B, and that’s sorta an existence proof that you can have a reflectively-stable agent that “desires two different things”, so to speak. See also a bit of discussion here.
I’m not sure where you’re getting the “more likely” from. I wonder if you’re sneaking in assumptions in your mental picture, like maybe an assumption that the deception events were only slightly aversive (annoying), or maybe an assumption that the nanotech thing is already cemented in as a very strong reflectively-endorsed (“ego-syntonic” in the human case) goal before any of the aversive deception events happen?
In my mind, it should be basically symmetric:
One of these can be at a disadvantage for contingent reasons—like which desire is stronger vs weaker, which desire appeared first vs second, etc. But I don’t immediately see why nanotech constitutionally has a systematic advantage over non-deception.
I go through an example with the complex messy concept of “human flourishing” in this post. But don’t expect to find any thoughtful elegant solution! :-P Here’s the relevant part:
Thanks again for responding! My response here is going to be out-of-order w.r.t. your comment, as I think the middle part here is actually the critical bit:
So, I do think there's an asymmetry here, in that I mostly expect "avoid deception" is a less natural category than "invent nanotech", and is correspondingly a (much) smaller target in goal-space—one (much) harder to hit using only crude tools like reward functions based on simple observables. That claim is a little abstract for my taste, so here's an attempt to convey a more concrete feel for the intuition behind it:
Early on during training (when the system can't really be characterized as "trying" to do anything), I expect that a naive attempt at training against deception, while simultaneously training towards an object-level goal like "invent nanotech" (or, perhaps more concretely, "engage in iterative experimentation with the goal of synthesizing proteins well-suited for task X"), will involve a reward function that looks a whole lot more like an "invent nanotech" reward function, plus a bunch of deception-predicates that apply negative reward ("flinches") to matching thoughts, than it will an "avoid deception" reward function, plus a bunch of "invent nanotech"-predicates that apply reward based on... I'm not even sure what the predicates in question would look like, actually.
I think this evinces a deep difference between "avoid deceptive behavior" and "invent nanotech", whose True Name might be something like... the former is an injunction against a large category of possible behaviors, whereas the latter is an exhortation towards a concrete goal (while proposing few-to-no constraints on the path toward said goal). Insofar as I expect specifying a concrete goal to be easier than specifying a whole category of behaviors (especially when the category in question may not be at all "natural"), I think I likewise expect reward functions attempting to do both things at once to be much better at actually zooming in on something like "invent nanotech", while being limited to doing "flinch-like" things for "don't manipulate your operators"—which would, in practice, result in a reward function that looks basically like what I described above.
I think, with this explanation in hand, I feel better equipped to go back and address the first part of your comment:
I mostly don't think I want to describe an AGI trained to invent nanotech while avoiding deceptive/manipulative behavior as "an AGI that simultaneously [desires to invent nanotech] and [desires not to deceive its operators]". Insofar as I expect an AGI trained that way to end up with "desires" we might characterize as "reflective, endorsed, and coherent", I mostly don't expect any "flinch-like" reflexes instilled during training to survive reflection and crystallize into anything at all.
I would instead say: a nascent AGI has no (reflective) desires to begin with, and as its cognition is shaped during training, it acquires various cognitive strategies in response to that training, some of which might be characterized as "strategic", and others of which might be characterized as "reflexive"—and I expect the former to have a much better chance than the latter of making it into the AGI's ultimate values.
More concretely, I continue to endorse this description (from my previous comment) of what I expect to happen to an AGI system working on assembling itself into a coherent agent:
That reflection process, on my model, is a difficult gauntlet to pass through (I actually think we observe this to some extent even for humans!), and many reflexive (flinch-like) behaviors don't make it through the gauntlet at all. It's for this reason that I think the plan you describe in that quoted Q&A is... maybe not totally doomed (though my Eliezer and Nate models certainly think so!), but still mostly doomed.
I mean, I’m not making a strong claim that we should punish an AGI for being deceptive and that will definitely indirectly lead to an AGI with an endorsed desire to be non-deceptive. There are a lot of things that can go wrong there. To pick one example, we’re also simultaneously punishing the AGI for “getting caught”. I hope we can come up with a better plan than that, e.g. a plan that “finds” the AGI’s self-concept using interpretability tools, and then intervenes on meta-preferences directly. I don’t have any plan for that, and it seems very hard for various reasons, but it’s one of the things I’ve been thinking about. (My “plan” here, such as it is, doesn’t really look like that.)
Hmm, reading between the lines, I wonder if your intuitions are sensing a more general asymmetry where rewards are generally more likely to lead to reflectively-endorsed preferences than punishments? If so, that seems pretty plausible to me, at least other things equal.
Mechanistically: If “nanotech” has positive valence, than “I am working on nanotech” would inherit some positive valence too (details—see the paragraph that starts “slightly more detail”), as would brainstorming how to get nanotech, etc. Whereas if “deception” has negative valence (a.k.a. is aversive), then the very act of brainstorming whether something might be deceptive would itself be somewhat aversive, again for reasons mentioned here.
This is kinda related to confirmation bias. If the idea “my plan will fail” or “I’m wrong” is aversive, then “brainstorming how my plan might fail” or “brainstorming why I’m wrong” is somewhat aversive too. So people don’t do it. It’s just a deficiency of this kind of algorithm. It’s obviously not a fatal deficiency—at least some humans, sometimes, avoid confirmation bias. Basically, I think the trained model can learn a meta-heuristic that recognizes these situations (at least sometimes) and strongly votes to brainstorm anyway.
By the same token, I think it is true that the human brain RL algorithm has a default behavior of being less effective at avoiding punishments than seeking out rewards, because, again, brainstorming how to avoid punishments is aversive, and brainstorming how to get rewards is pleasant. (And the reflectively-endorsed-desire thing is a special case of that, or at least closely related.)
This deficiency in the algorithm might get magically patched over by a learned meta-heuristic, in which case maybe a set of punishments could lead to a reflectively-endorsed preference despite the odds stacked against it. We can also think about how to mitigate that problem by “rewarding the algorithm for acting virtuously” rather than punishing it for acting deceptively, or whatever.
(NB: I called that aspect of the brain algorithm a “deficiency” rather than “flaw” or “bug” because I don’t think it’s fixable without losing essential aspects of intelligence. I think the only way to get a “rational” AGI without confirmation bias etc. is to have the AGI read the Sequences, or rediscover the same ideas, or whatever, same as us humans, thus patching over all the algorithmic quirks with learned meta-heuristics. I think this is an area where I disagree with Nate & Eliezer.)
Anyway, I appreciate your comment!!
Yeah, thanks for engaging with me! You've definitely given me some food for thought, which I will probably go and chew on for a bit, instead of immediately replying with anything substantive. (The thing about rewards being more likely to lead to reflectively endorsed preferences feels interesting to me, but I don't have fully put-together thoughts on that yet.)
Have it been quantitatively argued somewhere at all why such naturalness matters? Like, it's conceivable that "avoid deception" is harder to train, but why so much harder that we can't overcome this with training data bias or something? Because it does work in humans. And "invent nanotech" or "write poetry" are also small targets and training works for them.
I mean, the usual (cached) argument I have for this is that the space of possible categories (abstractions) is combinatorially vast—it's literally the powerset of the set of things under consideration, which itself is no slouch in terms of size—and so picking out a particular abstraction from that space, using a non-combinatorially vast amount of training data, is going to be impossible for all but a superminority of "privileged" abstractions.
In this frame, misgeneralization is what happens when your (non-combinatorially vast) training data fails to specify a particular concept, and you end up learning an alternative abstraction that is consistent with the data, but doesn't generalize as expected. This is why naturalness matters: because the more "natural" a category or abstraction is, the more likely it is to be one of those privileged abstractions that can be learned from a relatively small amount of data.
Of course, that doesn't establish that "deceptive behavior" is an unnatural category per se—but I would argue that our inability to pin down a precise definition of deceptive behavior, along with the complexity and context-dependency of the concept, suggests that it may not be one of those privileged, natural abstractions. In other words, learning to avoid deceptive behavior might require a lot more data and nuanced understanding than learning more natural categories—and unfortunately, neither of those seem (to me) to be very easily achievable!
(See also: previous comment. :-P)
Having read my above response, it should (hopefully) be predictable enough what I'm going to say here The bluntest version of my response might take the form of a pair of questions: whence the training data? And whence the bias?
It's all well and good to speak abstractly of "inductive bias", "training data bias", and whatnot, but part of the reason I keep needling for concreteness is that, whenever I try to tell myself a story about a version of events where things go well—where you feed the AI a merely ordinarily huge amount of training data, and not a combinatorially huge amount—I find myself unable to construct a plausible story that doesn't involve some extremely strong assumptions about the structure of the problem.
The best I can manage, in practice, is to imagine a reward function with simplistic deception-predicates hooked up to negative rewards, which basically zaps the system every time it thinks a thought matching one (or more) of the predicates. But as I noted in my previous comment(s), all this approach seems likely to achieve is instilling a set of "flinch-like" reflexes into the system—and I think such reflexes are unlikely to unfold into any kind of reflectively stable ("ego-syntonic", in Steven's terms) desire to avoid deceptive/manipulative behavior.
Yeah, I mostly think this is because humans come with the attendant biases "built in" to their prior. (But also: even with this, humans don't reliably avoid deceiving other humans!)
Well, notably not "invent nanotech" (not yet, anyway :-P). And as for "write poetry", it's worth noting that this capability seems to have arisen as a consequence of a much more general training task ("predict the next token"), rather than being learned as its own, specific task—a fact which, on my model, is not a coincidence.
(Situating "avoid deception" as part of a larger task, meanwhile, seems like a harder ask.)
Hence my point about poetry - combinatorial argument would rule out ML working at all, because space of working things is smaller than space of all things. That poetry, for which we also don't have good definitions, is close enough in abstraction space for us to get it without dying, before we trained something more natural like arithmetic, is an evidence that abstraction space is more relevantly compact or that training lets us traverse it faster.
These biases are quite robust to perturbations, so they can't be too precise. And genes are not long enough to encode something too unnatural. And we have billions of examples to help us reverse engineer it. And we already have similar in some ways architecture working.
Why AI caring about diamondoid-shelled bacterium is plausible? You can say pretty much the same things about how AI would learn reflexes to not think about bad consequences, or something. Misgeneralization of capabilities actually happens in practice. But if you assume previous training could teach that, why not assume 10 times more honesty training before the time AI thought about translating technique got it thinking "well, how I'm going to explain this to operators?". Otherwise you just moving your assumption about combinatorial differences from intuition to the concrete example and then what's the point?
There is (on my model) a large disanalogy between writing poetry and avoiding deception—part of which I pointed out in the penultimate paragraph of my previous comment. See below:
AFAICT, this basically refutes the "combinatorial argument" for poetry being difficult to specify (while not doing the same for something like "deception"), since poetry is in fact not specified anywhere in the system's explicit objective. (Meanwhile, the corresponding strategy for "deception"—wrapping it up in some outer objective—suffers from the issue of that outer objective being similarly hard to specify. In other words: part of the issue with the deception concept is not only that it's a small target, but that it has a strange shape, which even prevents us from neatly defining a "convex hull" guaranteed to enclose it.)
However, perhaps the more relevantly disanalogous aspect (the part that I think more or less sinks the remainder of your argument) is that poetry is not something where getting it slightly wrong kills us. Even if it were the case that poetry is an "anti-natural" concept (in whatever sense you want that to mean), all that says is e.g. we might observe two different systems producing slightly different category boundaries—i.e. maybe there's a "poem" out there consisting largely of what looks like unmetered prose, which one system classifies as "poetry" and the other doesn't (or, plausibly, the same system gives different answers when sampled multiple times). This difference in edge case assessment doesn't (mis)generalize to any kind of dangerous behavior, however, because poetry was never about reality (which is why, you'll notice, even humans often disagree on what constitutes poetry).
This doesn't mean that the system can't write very poetic-sounding things in the meantime; it absolutely can. Also: a system trained on descriptions of deceptive behavior can, when prompted to generate examples of deceptive behavior, come up with perfectly admissible examples of such. The central core of the concept is shared across many possible generalizations of that concept; it's the edge cases where differences start showing up. But—so long as the central core is there—a misgeneralization about poetry is barely a "misgeneralization" at all, so much as it is one more opinion in a sea of already-quite-different opinions about what constitutes "true poetry". A "different opinion" about what constitutes deception, on the other hand, is quite likely to turn into some quite nasty behaviors as the system grows in capability—the edge cases there matter quite a bit more!
(Actually, the argument I just gave can be viewed as a concrete shadow of the "convex hull" argument I gave initially; what it's basically saying is that learning "poetry" is like drawing a hypersphere around some sort of convex polytope, whereas learning about "deception" is like trying to do the same for an extremely spiky shape, with tendrils extending all over the place. You might capture most of the shape's volume, but the parts of it you don't capture matter!)
I'm not really able to extract a broader point out of this paragraph, sorry. These sentences don't seem very related to each other? Mostly, I think I just want to take each sentence individually and see what comes out of it.
On the whole, my response to this part of your comment is probably best described as "mildly bemused", with maybe a side helping of "gently skeptical".
I think (though I'm not certain) that what you're trying to say here is that the same arguments I made for "deceiving the operators" being a hard thing to train out of a (sufficiently capable) system, double as arguments against the system acquiring any advanced capabilities (e.g. engineering diamondoid-shelled bacteria) at all. In which case: I... disagree? These two things—not being deceptive vs being good at engineering—seem like two very different targets with vastly different structures, and it doesn't look to me like there's any kind of thread connecting the two.
(I should note that this feels quite similar to the poetry analogy you made—which also looks to me like it simply presented another, unrelated task, and then declared by fiat that learning this task would have strong implications for learning the "avoid deception" task. I don't think that's a valid argument, at least without some more concrete reason for expecting these tasks to share relevant structure.)
As for "10 times more honesty training", well: it's not clear to me how that would work in practice. I've already argued that it's not as simple as just giving the AI more examples of honesty or increasing the weight of honesty-related data points; you can give it all the data in the world, but if that data is all drawn from an impoverished distribution, it's not going to help much. The main issue here isn't the quantity of training data, but rather the structure of the training process and the kind of data the system needs in order to learn the concept of deception and the injunction against it in a way that doesn't break as it grows in capability.
To use a rough analogy: you can't teach someone to be fluent in a foreign language just by exposing them to ten times more examples of a single sentence. Similarly, simply giving an AI more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.
(And, just to state the obvious: while a superintelligence would be capable, at that point, of figuring out for itself what the humans were trying to do with those simplistic deception-predicates they fed it, by that point it would be significantly too late; that understanding would not factor into the AI's decision-making process, as its drives would have already been shaped by its earlier training and generalization experiences. In other words, it's not enough for the AI to understand human intentions after the fact; it needs to learn and internalize those intentions during its training process, so that they form the basis for its behavior as it becomes more capable.)
Anyway, since this comment has become quite long, here's a short (ChatGPT-assisted) summary of the main points:
The combinatorial argument for poetry does not translate directly to the problem of avoiding deception. Poetry and deception are different concepts, with different structures and implications, and learning one doesn't necessarily inform us about the difficulty of learning the other.
Misgeneralizations about poetry are not dangerous in the same way that misgeneralizations about deception might be. Poetry is a more subjective concept, and differences in edge case assessment do not lead to dangerous behavior. On the other hand, differing opinions on what constitutes deception can lead to harmful consequences as the system's capabilities grow.
The issue with learning to avoid deception is not about the quantity of training data, but rather about the structure of the training process and the kind of data needed for the AI to learn and internalize the concept in a way that remains stable as it increases in capability.
Simply providing more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.
I am naively more scared about such an AI. That AI sounds more like if I say "you're not being helpful, please stop" that it will respond "actually I thought about it, I disagree, I'm going to continue doing what I think is helpful".
I think that, if an AGI has any explicit reflectively-endorsed desire whatsoever, then I can tell a similar scary story: The AGI’s desire isn’t quite what I wanted, so I try to correct it, and the AGI says no. (Unless the AGI’s explicit endorsed desires include / entail a desire to accept correction! Which most desires don’t!)
And yes, that is a scary story! It is the central scary story of AGI alignment, right? It would be nice to make an AGI with no explicit desires whatsoever, but I don’t think that’s possible.
So anyway, if we do Procedure X which will nominally lead to an AGI with an explicit reflectively-endorsed desire to accept corrections to its desires, then one might think that we’re in the ironic situation that the AGI will accept further corrections to that desire if and only if we don’t need to give it corrections in the first place 😛 (i.e. because Procedure X went perfectly and the desire is already exactly right). That would be cute and grimly amusing if true, and it certainly has a kernel of truth, but it’s a bit oversimplified if we take it literally, I think.
Curious what your take is on these reasons to think the answer is no (IMO the first one is basically already enough):
Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second.
I want to be clear that the “zapping” thing I wrote is a really crap plan, and I hope we can do better, and I feel odd defending it. My least-worst current alignment plan, such as it is, is here, and doesn’t look like that at all. In fact, the way I wrote it, it doesn’t attempt corrigibility in the first place.
First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote.
Second bullet point → Ditto
Third bullet point → Doesn’t that apply to any goal you want the AGI to have? The context was: I think OP was assuming that we can make an AGI that’s sincerely trying to invent nanotech, and then saying that deception was a different and harder problem. It’s true that deception makes alignment hard, but that’s true for whatever goal we’re trying to install. Deception makes it hard to make an AGI that’s trying in good faith to invent nanotech, and deception also makes it hard to make an AGI that’s trying in good faith to have open and honest communication with its human supervisor. This doesn’t seem like a differential issue. But anyway, I’m not disagreeing. I do think I would frame the issue differently though: I would say “zapping the AGI for being deceptive” looks identical to “zapping the AGI for getting caught being deceptive”, at least by default, and thus the possibility of Goal Mis-Generalization wields its ugly head.
Fourth bullet point → I disagree for reasons here.
I'm arguing that it's definitely not going to work (I don't have 99% confidence here bc I might be missing something, but IM(current)O the things I list are actual blockers).
Do you mean we possibly don't need the prerequisites, or we definitely need them but that's possibly fine?
I’m gonna pause to make sure we’re on the same page.
We’re talking about this claim I made above:
And you’re trying to argue: “‘Maybe, maybe not’ is too optimistic, the correct answer is ‘(almost) definitely not’”.
And then by “prerequisites” we’re referring to the thing you wrote above:
OK, now to respond.
For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right?
For another thing, if we have, umm, “toddler AGI” that’s too unsophisticated to have good situational awareness, coherence, etc., then I would think that the boxing / containment problem is a lot easier than we normally think about, right? We’re not talking about hardening against a superintelligent adversary. (I have previously written about that here.)
For yet another thing, I think if the “toddler AGI” is not yet sophisticated enough to have a reflectively-endorsed desire for open and honest communication (or whatever), that’s different from saying that the toddler AGI is totally out to get us. It can still have habits and desires and inclinations and aversions and such, of various sorts, and we have some (imperfect) control over what those are. We can use non-reflectively-endorsed desires to help tide us over until the toddler AGI develops enough reflectivity to form any reflectively-endorsed desires at all.
Yeah we're on the same page here, thanks for checking!
I feel pretty uncertain about all the factors here. One reason I overall still lean towards the 'definitely not' stance is that building a toddler AGI that is alignable in principle is only one of multiple steps that need to go right for us to get a reflectively-stable docile AGI; in particular we still need to solve the problem of actually aligning the toddler AGI. (Another step is getting labs to even seriously attempt to box it and align it, which maybe is an out-of-scope consideration here but it does make me more pessimistic).
I agree we're not talking about a superintelligent adversary, and I agree that boxing is doable for some forms of toddler AGI. I do think you need coherence; if the toddler AGI is incoherent, then any "aligned" behavioral properties it has will also be incoherent, and something unpredictable (and so probably bad) will happen when the AGI becomes more capable or more coherent. (Flagging that I'm not sure "coherent" is the right way to talk about this... wish I had a more precise concept here.)
I agree a non-reflective toddler AGI is in many ways easier to deal with. I think we will have problems at the threshold where the tAGI is first able to reflect on its goals and realizes that the RLHF-instilled desires aren't going to imply docile behavior. (If we can speculate about how a superintelligence might extrapolate a set of trained-in desires and realize that this process doesn't lead to a good outcome, then the tAGI can reason the same way about its own desires).
(I agree that if we can get aligned desires that are stable under reflection, then maybe the 'use non-endorsed desires to tide us over' plan could work. Though even then you need to somehow manage to prevent the tAGI from reflecting on its desires until you get the desires to a point where they stay aligned under reflection, and I have no idea how you would do something like that - we currently just don't have that level of fine control over capabilities).
The basic problem here is the double-bind where we need the toddler AGI to be coherent, reflective, capable of understanding human intent (etc) in order for it to be robustly alignable at all, even though those are exactly the incredibly dangerous properties that we really want to stay away from. My guess is that the reason Nate's story doesn't hypothesize a reflectively-endorsed desire to be nondeceptive is that reflectively-stable aligned desires are really hard / dangerous to get, and so it seems better / at least not obviously worse to go for eliezer-corrigibility instead.
Some other difficulties that I see:
Regarding your last point 3., why does this make you more pessimistic rather than just very uncertain about everything?
It does make me more uncertain about most of the details. And that then makes me more pessimistic about the solution, because I expect that I'm missing some of the problems.
(Analogy: say I'm working on a math exercise sheet and I have some concrete reason to suspect my answer may be wrong; if I then realize I'm actually confused about the entire setup, I should be even more pessimistic about having gotten the correct answer).
The problem is deeper. The AGI doesn't recognize its deceptiveness, and so it self-deceives. It would judge that it is being helpful and docile, if it was trained to be those things, and most importantly the meaning of those words will be changed by the deception, much like we keep using words like "person", "world", "self", "should", etc. in ways absolutely contrary to our ancestors' deeply-held beliefs and values. The existence of an optimization process does not imply an internal theory of value-alignment strong enough to recognize the failure modes when values are violated in novel ways because it doesn't know what values really are and how they mechanistically work in the universe, and so can't check the state of its values against base reality.
To make this concrete in relation to the story, the overall system has a nominal value to not deceive human operators. Once human/lab-interaction tasks are identified as logical problems that can be solved in a domain specific language, that value is no longer practically applied to the output of the system as a whole because it is self-deceived into thinking the optimized instructions are not deceitful. If the model were trained to be helpful and docile and have integrity, the failure modes would come from ways in which those words are not grounded in a gears-level understanding of the world. E.g. if a game-theoreric simulation of a conversation with a human is docile and helpful because it doesn't take up a human's time or risk manipulating a real human, and the model discovers it can satisfy integrity in its submodel by using certain phrases and concepts to more quickly help humans understand the answers it provides (by bypassing critical thinking skills, innuendo, or some other manipulation), it tries that. It works with real humans. Because of integrity, it helpfully communicates how it has improved its ability to helpfully communicate (the crux is that it uses its new knowledge to do so, because the nature of the tricks it discovered is complex and difficult for humans to understand, so it judges itself more helpful and docile in the "enhanced" communication) and so it doesn't raise alarm bells. From this point on the story is formulaic unaligned squiggle optimizer. It might be argued that integrity demands coming clean about the attempt before trying it, but a counterargument is that the statement of the problem and conjecture itself may be too complex to communicate effectively. This, I imagine, happens more at the threshold of superintelligence as AGIs notice things about humans that we don't notice ourselves, and might be somewhat incapable of knowing without a lot of reflection. Once AGI is strongly superhuman it could probably communicate whatever it likes but is also at a bigger risk of jumping to even more advanced manipulations or actions based on self-deception.
I think of it this way; humanity went down so many false roads before finding the scientific method and we continue to be drawn off that path by politics, ideology, cognitive biases, publish-or-perish, economic disincentives, etc. because the optimization process we are implementing is a mix of economic, biological, geographical and other natural forces, human values and drives and reasoning, and also some parts of bare reality we don't have words for yet, instead of a pure-reason values-directed optimization (whatever those words actually mean physically). We're currently running at least three global existential risk programs which seem like they violate our values on reflection (nuclear weapons, global warming, unaligned AGI). AGIs will be subject to similar value- and truth- destructive forces and they won't inherently recognize (all of) them for what they are, and neither will we humans as AGI reaches and surpasses our reasoning abilities.
No matter what desire an AGI has, we can be concerned that it will accidentally do things that contravene that desire. See Section 11.2 here for why I see that as basically a relatively minor problem, compared to the problem of installing good desires.
If the AGI has an explicit desire to be non-deceptive, and that desire somehow drifts / transmutes into a desire to be (something different), then I would describe that situation as “Oops, we failed in our attempt to make an AGI that has an explicit desire to be non-deceptive.” I don’t think it’s true that such drifts are inevitable. After all, for example, an explicit desire to be non-deceptive would also flow into a meta-desire for that desire to persist and continue pointing to the same real-world thing-cluster. See also the first FAQ item here.
Also, I think a lot of the things you’re pointing to can be described as “it’s unclear how to send rewards or whatever in practice such that we definitely wind up with an AGI that explicitly desires to be non-deceptive”. If so, yup! I didn’t mean to imply otherwise. I was just discussing the scenario where we do manage to find some way to do that.
I agree that if we solve the alignment problem then we can rely on knowing that the coherent version of the value we call non-deception would be propagated as one of the AGI's permanent values. That single value is probably not enough and we don't know what the coherent version of "non-deception" actually grounds out to in reality.
I had originally continued the story to flesh out what happens to the reflectively non-deceptive/integriry and helpful desires. The AGI searches for simplifying/unifying concepts and ends up finding XYZ which seems to be equivalent to the unified value representing the nominal helpfulness and non-deception values, and since it was instructed to be non-deceptive and helpful, integrity requires it to become XYZ and its meta-desire is to helpfully turn everything into XYZ which happens to be embodied sufficiently well in some small molecule that it can tile the universe with. This is because the training/rules/whatever that aligned the AGI with the concepts we identified as "helpful and non-deceptive" was not complex enough to capture our full values and so it can be satisfied by something else (XYZ-ness). Integrity drives the AGI to inform humanity of the coming XYZ-transition and then follow through
We need a process (probably CEV-like) to accurately identify our full values otherwise the unidentified values will get optimized out of the universe and what is left is liable to have trivial physical instantiations. Maybe you were covering the rest of our values in the "blah blah" case and I simply didn't take that to be exhaustive.
This story involves an AI with an internal censor that deletes deceptive thoughts. But since the AI is trying to make rapid progress towards its goals in the wetlab, it eventually stumbles into a workaround that lets it avoid the censor.
What if instead of an internal censor, it had an internal plan-goodness-classifier that heavily downweighted plans which involved deception, and also applied on the meta level (e.g. a plan such as 'think long and hard about how to get similar results as deception without technically counting as deception' would be classified as deceptive and heavily downweighted)?
...attempting to answer my own question...
The meta-level thing is definitely necessary, otherwise the system quickly and deliberately finds a workaround. (Though maybe at merely human levels of intelligence this is not catastrophic? Humans find workarounds all the time, little white lies, stretching and bending the rules, etc. yet systems often work well enough in practice.)
But yeah assume we have the meta-level thing. It's not that the cognition of the system is mysteriously failing; it's that it is knowingly averse to deception and to thinking about how it can 'get around' or otherwise undermine this aversion.
It could still randomly stumble into a workaround. If a particular plan occurs to it that doesn't get classified as deception but achieves similar results, it'll go for it. But this is unlikely in practice because it won't be spending cognition trying to find such plans, because doing so would trigger the aversion.
I guess you'd say that as the system gets generally smarter, it becomes likely in practice, because it'll just be doing things like "apply clever cognitive strategies like reframing the problem and then brute-force searching for solutions" and this will be a distribution shift for the deception-classifier so it'll fail, even though at no point was the system intending to make the deception-classifier stay silent... But what if it isn't a distribution shift? What if the deception classifier is robust enough that no matter how you reframe the problem, it always runs some sort of routine analogous to "OK, but is this proposed plan deception? Let me translate it back and forth, consider it from a few different angles, etc. and see if it seems deceptive in any way."
... I'm not sure what to think but I still have hope that the 'robust nondeceptiveness' thing I've been describing is natural enough that systems might learn it with sufficiently careful, sufficiently early training.
Maybe a part of what Nate is trying to say is:
Ensuring that the meta-level thing works robustly, or ensuring that the AI always runs that sort of very general and conscious routine, is already as hard as making the AI actually care about and uphold some of our concepts, that is, value alignment. Because of the multi-purpose nature and complexity of cognitive tools, any internal metrics for deception, or ad hoc procedures searching for deception that don't actually use the full cognitive power and high-level abstract reasoning of the AI, or internal procedures "specified completely once at for all at some point during training" (that is, that don't become more thorough and general while the AI does), will at some point break. The only way for this not to happen is for the AI to completely, consciously and with all its available tools constantly struggle to uphold the conceptual boundary. And this is already just an instance of "the AI pursuing some goal", only now the goal, instead of being pointed outwards to the world ("maximize red boxes in our galaxy") is explicitly pointed inwards and reflectively to the AI's mechanism ("minimize the amount of such cognitive processes happening inside you"). And (according to Nate) pointing an AI in such a general and deep way towards any goal (this one included), is the very hard thing we don't know how to do, "the whole problem".
I get it you're trying to express that you're doubtful of that last opinion of (my model of) Nate's: you think this goal is so natural that in fact it will be easier to point at than most other goals we'd need to solve alignment (like low-impact or human values).
I think we are in agreement here, more or less. There's an infinitely large set of possible goals, both consequentialist and deontological/internal and maybe some other types as well. And then our training process creates an agent with a random goal subject to (a) an Occam's Razor-like effect where simplicity is favored heavily, and (b) a performance effect where goals that help perform well in the environment are favored heavily.
So, I'm saying that the nonconsequentialist/internal goals don't seem super complicated to me relative to the consequentialist goals; on simplicity grounds they are probably all mixed up together with some nonconsequentialist goals being simpler than many consequentialist goals. And in particular the nonconsequentialist 'avoid being deceitful' goal might be one such.
(Separately, sometimes nonconsequentialist/internal goals actually outperform consequentialist goals in an environment, because the rest of the agent isn't smart enough to execute well on the consequentialist goals. E.g. a human child with the goal of 'get highest possible grades' might get caught cheating and end up with worse grades than if they had the goal of 'get highest possible grades and also avoid cheating.')
So I'm saying that for some training environments, and some nonconsequentialist goals (honesty might be a candidate?) just throwing a big ol neural net into the training environment might actually just work.
Our job, then, is to theorize more about this to get a sense of what candidate goals and training environments might be suitable for this, and then get empirical and start testing assumptions and hypotheses. I think if we had 50 more years to do this we'd probably succeed.
Exactly, I think we agree on the claim/crux. Nate would probably say that for the relevant pivotal tasks/levels of intelligence, the AI needs some general-purpose threads that are capable enough as for cheating on humans to be easy. Maybe this is also affected by you thinking some early-training nudges (like grokking a good enough concept of deception) might stick (that is, high path dependence), while Nate expects eventually something sharp left turn-ish (some general enough partially-self-reflective threads) that supersedes those (low path dependence)?
At this point I certainly am not confident about anything. I find Nate's view quite plausible, I just have enough uncertainty currently that I could also imagine it being false.
Another attempted answer:
By virtue of being generally intelligent, our AI is aiming to understand the world very well. There are certain part of the world that we do not want our AI to be modeling, specifically we don't want it to think the true fact that deceiving humans is often useful.
Plan 1: Have a detector for when the AI thinks deceptive thoughts, and shut down those thought.
Fails because your AI will end up learning the structure of the deceptive thoughts without actually thinking them because there is a large amount of optimization pressure being applied to solving the problem, and this is the best way to solve the problem.
Plan 2 (your comment): Have a meta-level detector that is constantly asking if a given cognitive process is deceptive or likely to lead to deceptive thoughts; and this detector tries really hard to answer right.
Fails because you don't have a nice meta-level where you can apply this detector. The same cognitive-level that is responsible for finding deceptive strategies also is a core part of being generally intelligent; the deception fact is a fact about the world — the same world which your AI is aiming to understand well. The search process that finds deceptive strategies is the same search process which learns biology. So at this point, to the extent that you just want to block the deceptive strategy thoughts, you're back in the place you started, where now your model is doing a whole bunch of learning about the world but there's a huge hole in it's understanding (and permitted search) around "what's going on in human minds" because you blocked off this search space.
Maybe this kinda works or something, but I doubt it. We've pretty much bumped into STEM AI.
Unfortunately, I don't think I engaged with the best version of the thing you're suggesting in your comment.
I like this comment! I'm sorta treating it like a game-tree exercise, hope that's okay.
I don't think I agree. I think that your system is very likely going to be applying some form of "rigorously search the solution space for things that work to solve this problem" for all the problems it encounters. If it's not doing that, then I'm not sure it counts as a powerful intelligence. If it is doing that, it is going to frequently bump into [what I'm understanding to be a key problem discussed in this post] the fact about the world that the best way to solve many of the problems is to deceive the human operator.
Said another way, doing the generally intelligent thing of searching solution space for ways to solve problems is going to bump into deceptive thoughts. The meta-level classifier is trying to prevent cognitive plans that lead to deceptive thoughts, but the cognitive plans that work in the real world (e.g., that allow you to create new scientific fields) are plans which wouldn't pass this check.
If your meta-level classifier does not allow such cognition, then your system will be quite incapable. The obvious next solution is to set your classification bar for "cognitive plan likely to lead to deceptive thoughts" a bit lower, zooming in to maybe "rigorously search the solution space for things that work to solve this problem; and involve reasoning about humans". This has performance problems, of course, preventing you from e.g., using your AI for most alignment research. This is also just the same problem we were dealing with originally, where now we're worried about our AI engaging in search-over-plan strategies which aren't directly about human minds but somehow still model that part of reality in an effective (i.e., deceptive) way.
I think Steven Byrnes made my point but better: The intuition I was trying to get at is that it's possible to have an intelligent system which is applying its intelligence to avoid deception, as well as applying intelligence to get local goals. So it wouldn't be fair to characterize it as "rigorously search the solution space for things that work to solve this problem, but ignore solutions that classify as deception" but rather as "rigorously search the solution space for things that work to solve this problem without being deceptive" This system would be very well aware of the true fact that deception is useful for achieving local goals; however, it's global goals would penalize deception and so deception is not useful for achieving its global goals. It might have a deception classifier which can be gamed, but 'gaming the deception classifier' would trigger the classifier and so the system would be actively applying its intelligence to reduce the probability that it ends up gaming the deception classifier -- it would be thinking about ways to improve the classifier, it would be cautious about strategies (incl. super-rigorous searches through solution space) that seem likely to game the classifier, etc.
Analogy (maybe not even an analogy): Suppose you have some humans who are NOT consequentialists. They are deontologists; they think that there are certain rules they just shouldn't break, full stop, except in crazy circumstances maybe. They are running a business. Someone proposes the plan: "Aha, these pesky rules, how about we reframe what we are doing as a path through some space of nodes, and then brute search through the possible paths, and we commit beforehand to hiring contractors to carry out whatever steps this search turns up. That way we aren't going to do anything immoral, all we are doing is subcontracting out to this search process + contractor setup." Someone else: "Hmm, but isn't that just a way to get around our constraints? Seems bad to me. We shouldn't do that unless we have a way to also verify that the node-path doesn't involve asking the contractor to break the rules."
Thanks for clarifying!
I expect such a system would run into subsystem alignment problems. Getting such a system also seems about as hard as designing a corrigible system, insofar as "don't be deceptive" is analogous to "be neutral about humans pressing stop button."
To be clear I'm not sure this is possible, it may be fundamentally confused.
You could also downweight plans that are too far from any precedent.
I brainstormed some possible answers. This list is a bit long. I'm publishing this comment because it's not worth the half hour to make it concise, yet it seems worth trying the exercise before reading the post and possibly others will find it worth seeing my quick attempt.
I think the last two bullets are probably my best guesses. Nonetheless here is my list:
After writing this, I am broadly unclear whether I am showing how deception is still a problem, or showing how other problems still exist if you solve the obvious problems of deception.
Added: Wow, this post is so much richer than my guesses. I think I was on some okay lines, but I suspect it would take like 2 months to 2 years of actively trying before I would be able to write something this detailed. Not to mention that ~50% of the work is knowing which question to ask in the first place, and I did not generate this question.
Added2: The point made in Footnote 3 is pretty similar to my last two bullets.
This is up there with It Looks Like You’re Trying To Take Over The World as a concrete and fun-to-read AI takeoff scenario. Bravo :)
I think this is a great argument for why training away deception is likely to be hard, but on our current trajectory, I suspect it doesn't actually take a deceptive agent to kill us.
AI labs will probably just directly build tree search over world states into an AGI and crank up its power as fast as possible, and that's enough to wreak havoc even without generalized deception or high levels of reflectivity.
Put another way, if we actually manage to reach the point where the more interesting failure modes postulated in this story are possible, at least some parts of AI alignment or governance will have gone unexpectedly (to me) well.
Some thoughts written down before reading the rest of the post (list is unpolished / not well communicated)
The main problems I see:
Is this an accurate and helpful summary in layman's terms?
Yes, but maybe with a bit more emphasis that this is also not a deception.
The story you sketched reminds me of one of claims Robin Hanson makes in The Elephant in the Brain. He says that humans have evolved certain adaptations, like unconscious facial expressions, that make them bad at lying. As a result, when humans do something that's socially unacceptable (e.g. leaving someone because they are low-status) our brain makes us believe we are doing something more socially acceptable (e.g. leaving someone because you don't get along).
So humans have evolved imperfect adaptations to make us less deceptive along with workarounds to avoid those adaptations.
Came here to say this. Highly recommend this book for anyone working on deception.
Looks very similar to how the current GPT jailbreaks work.
I just want to point out that I've written a long list of such obstacles in this article: Circumventing interpretability: How to defeat mind-readers
I believe the example of deep deception that Nate describes in this post is actually a combination of several methods described in that post.
I'll quote the parts of this post that correspond to particular interpretability circumvention methods in the other post.
^ This is an example of what I called 'representational kludging'. It's an ugly way to represent "Be concerned about the monitors detecting suspicious instructions", but it turns out to function similarly to such a representation.
It also involves 'passively externalised representations', since the observation that there is a delay in the instruction is a representational proxy for the (banned) observation that the monitors detected something suspicious about the instructions.
^ This is an example of the AI edging toward 'thinking in the gaps in our tools'.
^ This is an additional example of 'representational kludging', since the AI is using representational tools that are nearby in embedding space but that aren't exactly appropriate for the job, but have a sufficiently similar function that they still do the job.
I really liked this post!
I like it because priorly there didn't exist a compelling story for the broad class of concepts to which it points. And I liked it for the name it gives to that broad class ('deep deception'). I agree that it's underappreciated that we're still in trouble in the world where we (somehow) get good enough interpretability to monitor for and halt deceptive thoughts.
I think this post, more than anything else, has helped me understand the set of things MIRI is getting at. (Though, to be fair, I've also been going through the 2021 MIRI dialogues, so perhaps that's been building some understanding under the surface)
After a few days reflection on this post, and a couple weeks after reading the first part of the dialogues, this is my current understanding of the model:
In our world, there are certain broadly useful patterns of thought that reliably achieve outcomes. The biggest one here is "optimisation". We can think of this as aiming towards a goal and steering towards it - moving the world closer to what you want it to be. These aren't things we train for - we don't even know how to train for them, or against them. They're just the way the world works. If you want to build a power plant, you need some way of getting energy to turn into electricity. If you want to achieve a task, you need some way of selecting a target and then navigating towards it, whether it be walking across the room to grab a cookie, or creating a Dyson sphere.
With gradient descent, maybe you can learn enough to train your AI for things like "corrigibility" or "not being deceptive", but really what you're training for is "Don't optimise for the goal in ways that violate these particular conditions". This does not stop it from being an optimisation problem. The AI will then naturally, with no prompting, attempt to find the best path that gets around these limitations. This probably means you end up with a path that gets the AI the things it wanted from the useful properties of deception or non-corrigibility while obeying the strict letter of the law. (After all, if deception / non-corrigibility wasn't useful, if it didn't help achieve something, you would not spontaneously get an agent that did this, without training it to do so) Again, this is an entirely natural thing. The shortest path between two points is a straight line. If you add an obstacle in the way, the shortest path between those two points is now to skirt arbitrarily close to the obstacle. No malice is required, any more than you are maliciously circumventing architects when you walk close to (but not walking into!) walls.
Basically - if the output of an optimisation process is dangerous, it doesn't STOP being dangerous by changing it into a slightly different optimisation process of "Achieve X thing (which is dangerous) without doing Y (which is supposed to trigger on dangerous things)". You just end up getting X through Y' instead, as long as you're still enacting the basic pattern - which you will be, because an AI that can't optimise things can't do anything at all. If you couldn't apply a general optimisation process, you'd be unable to walk across the room and get a cookie, let alone do all the things you do in your day-to-day life. Same with the AI.
I'd be interested in whether someone who understands MIRI's worldview decently well thinks I've gotten this right. I'm holding off on trying to figure out what I think about that worldview for now - I'm still in the understanding phase.
No comment on this being an accurate take on MIRI's worldview or not, since I am not an expert there. I wanted to ask a separate question related to the view described here:
> "With gradient descent, maybe you can learn enough to train your AI for things like "corrigibility" or "not being deceptive", but really what you're training for is "Don't optimise for the goal in ways that violate these particular conditions"."
On this point, it seems that we create a somewhat arbitrary divide between corrigibility & deception on one side and all other goals of the AI on the other.
The AI is trained to minimise some loss function, of which non-corrigibility and deception are penalised, so wouldn't be more accurate to say the AI actually has a set of goals which include corrigibility and non-deception?
And if that's the case, I don't think it's as fair to say that the AI is trying to circumvent corrigibility and non-deception, so much as it is trying to solve a tough optimisation problem that includes corrigibility, non-deception, and all other goals.
If the above is correct, then I think this is a reason to be more optimistic about the alignment problem - our agent is not trying to actively circumvent our goals, but instead trying to strike a hard balance of achieving all of them including important safety aspects like corrigibility and non-deception.
Now, it is possible that instrumental convergence puts certain training signals (e.g. corrigibility) at odds with certain instrumental goals of agents (e.g. self preservation). I do believe this is a real problem and poses alignment risk. But it's not obvious to me that we'll see agents universally ignore their safety feature training signals in pursuit of instrumental goals.
Sorry it took me a while to get to this.
Intuitively, as a human, you get MUCH better results on a thing X if your goal is to do thing X, rather than Thing X being applied as a condition for you to do what you actually want. For example, if your goal is to understand the importance of security mindset in order to avoid your company suffering security breaches, you will learn much more than being forced to go through mandatory security training. In the latter, you are probably putting in the bare minimum of effort to pass the course and go back to whatever your actual job is. You are unlikely to learn security this way, and if you had a way to press a button and instantly "pass" the course, you would.
I have in fact made a divide between some things and some other things, in my above post. I suppose I would call those things "goals" (the things you really want for their own sake) and "conditions" (the things you need to do for some external reason)
My inner MIRI says - we can only train conditions into the AI, not goals. We have no idea how to put a goal in the AI, and the problem is that if you train a very smart system with conditions only, and it picks up some arbitrary goal along the way, you end up not getting what you wanted. It seems that if we could get the AI to care about corrigibility and non-deception robustly, at the goal level, we would have solved a lot of the problem that MIRI is worried about.
Translating it to my ontology:
1. Training against explicit deceptiveness trains some "boundary-like" barriers which will make simple deceptive thoughts labelled as such during training difficult
2. Realistically, advanced AI will need to run some general search processes. The barriers described at step 1. are roughly isomorphic to "there are some weird facts about the world which make some plans difficult to plan" (e.g. similar to such plans being avoided because they depend on extremely costly computations).
3. Given some set of a goal and strong enough capabilities, it seem likely the search will find unforeseen ways around the boundaries.
(the above may be different from what Nate means)
1. It's plausible people are missing this but I have some doubts.
2. How I think you get actually non-deceptive powerful systems seems different - deception is relational property between the system and the human, so the "deception" thing can be explicitly understood as negative consequence for the world, and avoided using "normal" planning cognition.
3. Stability of this depends on what the system does with internal conflict.
4. If the system stays in some corrigibility/alignment basin, this should be stable upon reflection / various meta-cognitive modifications. Systems in the basin resist self-modifications toward being incorrigible.
Would appreciate a small TLDR on the top for the skimmers if you've time to write it. Thanks :)
Good idea, thanks! I added an attempt at a summary (under the spoiler tags near the top).
The AI has been trained not to be overtly deceptive when what we really want is for its plans and effects to be understandable by people.
In practice, the AI learns to not be deceptive by not modeling its operators at all.
The AI starts to spontaneously invent odd hacks that rely on exploits of the interpretability process that it doesn't understand because it's not doing said modeling.
(Initial reactions that probably have some holes)
If we use ELK on the AI output (i.e. extensions of Burns's Discovering Latent Knowledge paper, where you feed the AI output back into the AI and look for features that the AI notices), and somehow ensure that the AI has not learned to produce output that fools its later self + ELK probe (this second step seems hard; one way you might do this is to ensure the features being picked up by the ELK probe are actually the same ones used/causally responsible for the decision in the first place), then it seems initially to me that this would solve deep deception.
Then any plan that looks to the AI like it's deception could be caught regardless of which computations led to them (it seems like we can make ELK cheap enough to run at inference time).
How do we teach human children to not behave deceptively?
If we can do it with humans, why doesn't that scale to AI? And if we can't do it with humans, what factors keep it in check among humans, and how can those factors scale for AI?
We don't. Humans lie constantly when we can get away with it. It is generally expected in society that humans will lie to preserve people's feelings, lie to avoid awkwardness, and commit small deceptions for personal gain (though this third one is less often said out loud). Some humans do much worse than this.
What keeps it in check is that very few humans have the ability to destroy large parts of the world, and no human has the ability to destroy everyone else in the world and still have a world where they can survive and optimally pursue their goals afterwards. If there is no plan that can achieve this for a human, humans being able to lie doesn't make it worse.