Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Status: This was a response to a draft of Holden's cold take "AI safety seems hard to measure". It sparked a further discussion, that Holden recently posted a summary of.

The follow-up discussion ended up focusing on some issues in AI alignment that I think are underserved, which Holden said were kinda orthogonal to the point he was trying to make, and which didn't show up much in the final draft. I nevertheless think my notes were a fine attempt at articulating some open problems I see, from a different angle than usual. (Though it does have some overlap with the points made in Deep Deceptiveness, which I was also drafting at the time.)

I'm posting the document I wrote to Holden with only minimal editing, because it's been a few months and I apparently won't produce anything better. (I acknowledge that it's annoying to post a response to an old draft of a thing when nobody can see the old draft, sorry.)

Quick take: (1) it's a write-up of a handful of difficulties that I think are real, in a way that I expect to be palatable to a relevant different audience than the one I appeal to; huzzah for that. (2) It's missing some stuff that I think is pretty important.

Slow take:

Attempting to gesture at some of the missing stuff: a big reason deception is tricky is that it is a fact about the world rather than the AI that it can better-achieve various local-objectives by deceiving the operators. To make the AI be non-deceptive, you have three options: (a) make this fact be false; (b) make the AI fail to notice this truth; (c) prevent the AI from taking advantage of this truth.

The problem with (a) is that it's alignment-complete, in the strong/hard sense. The problem with (b) is that lies are contagious, whereas truths are all tangled together. Half of intelligence is the art of teasing out truths from cryptic hints. The problem with (c) is that the other half of intelligence is in teasing out advantages from cryptic hints.

Like, suppose you're trying to get an AI to not notice that the world is round. When it's pretty dumb, this is easy, you just feed it a bunch of flat-earther rants or whatever. But the more it learns, and the deeper its models go, the harder it is to maintain the charade. Eventually it's, like, catching glimpses of the shadows in both Alexandria and Syene, and deducing from trigonometry not only the roundness of the Earth but its circumference (a la Eratosthenes).

And it's not willfully spiting your efforts. The AI doesn't hate you. It's just bumping around trying to figure out which universe it lives in, and using general techniques (like trigonometry) to glimpse new truths. And you can't train against trigonometry or the learning-processes that yield it, because that would ruin the AI's capabilities.

You might say "but the AI was built by smooth gradient descent; surely at some point before it was highly confident that the earth is round, it was slightly confident that the earth was round, and we can catch the precursor-beliefs and train against those". But nope! There were precursors, sure, but the precursors were stuff like "fumblingly developing trigonometry" and "fumblingly developing an understanding of shadows" and "fumblingly developing a map that includes Alexandria and Syene" and "fumblingly developing the ability to combine tools across domains", and once it has all those pieces, the combination that reveals the truth is allowed to happen all-at-once.

The smoothness doesn't have to occur along the most convenient dimension.

And if you block any one path to the insight that the earth is round, in a way that somehow fails to cripple it, then it will find another path later, because truths are interwoven. Tell one lie, and the truth is ever-after your enemy.

And so perhaps you retreat to saying "well, the AI will know that the world is round, it just won't ever take advantage of that fact."

And sure, that's worth shooting for, if you have a way to pull that off. (And if pulling this off is compatible with your deployment plan. In my experience, people who do the analog of retreating to this point tend to next do the analog of saying "my favorite deployment plan is having the AI figure out how to put satellites into geosynchronous orbit", AAAAAAAHHH, but I digress.)

Even then, you also have to be careful with this idea. Enola Gay Tibbets probably taught her son not to hurt people, and few humans are psychologically capable of a hundred thousand direct murders (even if we set aside the time-constraints), but none of this stopped Paul Tibbets from dropping an atomic bomb on Hiroshima.

Like, you can train an AI to flinch away from the very idea of taking advantage of the roundness of the Earth, but as it finds more abstract ways to look at the world and more generic tools for taking advantage of the knowledge at its disposal, it's liable to find new viewpoints where the flinches don't bind. (Quite analogously to how you can train your AI to flinch away from reasoning about the roundness of the Earth all you want, but at some point it's going to catch a glimpse of that roundness from another angle where the flinches weren't binding.) And when the AI does find a new viewpoint where the flinches fail to bind, the advantage is still an advantage, because the advantageousness of deception is a fact about the world, not the AI.

(Here I'm appealing to an analogy between truths and advantages, that I haven't entirely spelled out, but that I think holds. I claim, without much defense, that it's hard to get an AI to fail to take advantage of advantageous facts it knows about, for similar reasons that it's hard to get an AI to fail to notice truths that are relevant to its objectives.)

For the record, deception is but one instance of the more general issue where the AI's ability to save the world is inextricably linked to its ability to decode truths and advantages from cryptic hints, and (in lieu of an implausibly total solution to the hardest alignment problems before you build your first AGI) there are truths you don't want it noticing or taking advantage of.

This problem doesn't seem to be captured by any of your points. Going through them one by one:

  • It's not "auto mechanic", because the issue isn't that we can't tell when the AI starts believing that the Earth is round, it's that (for all that we’ve gotten it to flinch against considering the Earth's shape) it will predictably come across some shadow of that truth at some point in deployment (unless the deployment is carefully-chosen to avoid this, and the AI's world-modeling tendencies carefully-limited). There's not much we can do at training-time to avoid this (short of lobotomizing the AI so hard that it can never invent trigonometry, or telescopes, or spectroscopy, or ...).
  • It's not "King Lear", because it's not like the AI was lying in wait to learn that the Earth was round only after confirming that the operators are no longer monitoring its thoughts. It's just, as it got smarter, it accumulated more tools for decoding truths from cryptic hints, until it uncovered an inconvenient truth.
  • It's not "lab mice", unless you're particularly unimaginative. Like, I expect we'll be able to set up laboratory examples of the AI learning some techniques in one domain, and deploying them successfully to learn facts in another domain, before the endgame. (You can probably do it with ChatGPT today.) The trouble isn't that we can't see the problem coming, the trouble is that the problem is inextricably linked with capabilities. (Barring some sort of weak pivotal act that can be carried out by an AI so narrow as to not need much of the "catch a glimpse of the truth from a cryptic hint" nature.)
  • As for "blindfolded basketball", it's not a problem of facing totally new dynamics (like robust coordination, or zero-days in the human brain architecture that lets it mind-control the operators). It's more like: you're trying to use a truth-and-advantage-glimpser, in a place where it would be bad if it glimpsed certain truths or took certain advantages. This problem sure is trickier given that we're learning basketball blindfolded, but it's not a blindfolded-basketball issue in and of itself.

… to be clear, none of this precludes modern dunces from training young Paul Tibbets not to hurt people, and observing him nurse an injured sparrow back to health, and saying "this man would never commit a murder; it's totally working!", and then claiming that it was a lab mice / blindfolded basketball problem when they get blindsided by Little Boy.

But, like, it still seems to me like there's a big swath of problem missing from this catalog, that goes something like "You're trying to deploy an X-doer in a situation where it's really bad if X gets done".

Where you either have to switch from using an X-doer to using a Y-doer, where Y being done is great (Y being ~"optimize humanity's CEV", which is implausibly-difficult and which we shouldn't attempt on our first try); or you have to somehow wrestle with the fact that you're building a "glimpse truths and take advantage of them" engine, and trying to get it to glimpse and take advantage of lots more truths and advantages than you yourself can see (in certain domains), while having it neglect particular truths and advantages, in a fashion that likely needs to be robust to it inventing new abstract truth/advantage-glimpsing tools and using them to glimpse whole generic swaths of truths/advantages (including the ones you wish it neglected).

New Comment
9 comments, sorted by Click to highlight new comments since:

And if you block any one path to the insight that the earth is round, in a way that somehow fails to cripple it, then it will find another path later, because truths are interwoven. Tell one lie, and the truth is ever-after your enemy.

In case it's of any interest, I'll mention that when I "pump this intuition", I find myself thinking it essentially impossible to expect we could ever build a general agent that didn't notice that the world was round, and I'm unsure why (if I recall correctly) I sometimes I read Nate or Eliezer write that they think it's quite doable in-principle, just much harder than the effort we'll be giving it. 

This perspective leaves me inclined to think that we ought to only build very narrow intelligences and give up on general ones, rather than attempt to build a fully general intelligence but with a bunch of reliably self-incorrecting beliefs about the existence or usefulness of deception (and/or other things).

(I say this in case perhaps Nate has a succinct and motivating explanation of why he thinks a solution does exist and is not actually that impossibly difficult to find in theory, even while humans-on-earth may never do so.)


Couldn't you just prompt a different model to modify all training data, both text and images, to change it where the data is consistent with the earth being flat or state it is impossible to do so?

Model wouldn't be allowed to learn from user sessions (like gpt-n) or to generate answers and reflect on it's own beliefs (used to fine-tune gpt-4)


Doable in principle, but such measures would necessarily cut into the potential capabilities of such a system.

So basically a trade off, and IMO very worth it.

The problem is we are not doing it, and more basic, people generally do not get why it is important. Maybe its the framing, like when EY goes "superintelligence that firmly believes 222+222=555 without this leading to other consequences that would make it incoherent".

I get exactly what he means, but I suspect that a lot of people are not able to decompress and unroll that into something they "grook" on a fundamental level.

Something like "superintelligence without knowledge about itself and never reason about itself, without this leading to other consequences that would make it incoherent" would cut out a ton of lethality, and combine that with giving such a thing zero agency in the world, you might actually have something that could do "things we want, but don't know how to do" without it ending us on the first critical try.

It's probably not a good idea to feed AI an inconsistent data. For example, if evidence shows that Earth is round, but AI is absolutely sure it isn't, it will doubt about any evidence of that, which could lead to the very weird world view.

But I think it's possible to make AI know about the fact, but avoiding thinking about it.

I found this pretty helpful. I'm thinking of sharing it with ~1-3 people (i.e. NOT 30 or 300), but I'm not an alignment researcher and don't even have much of a quantitative background, so I'm not even sure why I myself understand this, let alone which people I can show this to so that they will also understand it.

What I'm wondering is: what is the prerequisites to understanding this post? How can I tell who does and doesn't have those prerequisites?

there are truths you don't want it noticing or taking advantage of.

Meaning truths about how it could scam the creators in some way? Like, "your alignment system has weakness X which I could exploit"?

I would think you could force the AI to not notice that the world was round, by essentially inputting this as an overriding truth.  And if that was actually and exactly what you cared about, you would be fine.  But if what you cared about was any corollary of the world being round or any result of the world being round or the world being some sort of curved polygon it wouldn't save you.

To take the Paul Tibbetts analogy:  you told him not to murder and he didn't murder; but what you wanted was for him not to kill and in most systems including the one he grew up in killings of the enemy in war are not murder.

This may say more about the limits of the analogy than anything else, but in essence you might be able to tell the AI it can't deceive you, but it will be bound exactly by the definition of deception you provide and it will freely deceive you in any way that you didn't think of. 

Deception is only a useful strategy for someone who is a) under surveillance and b) subject to constant pressures to do things differently than one would otherwise do them. 

To make the AI be non-deceptive, you have three options: (a) make this fact be false; (b) make the AI fail to notice this truth; (c) prevent the AI from taking advantage of this truth.

B and C in the quote become my A and B when "this truth" in the quote are swapped for "any truth." You can make this fact be false by not doing your B and C or my A and B. 

whereas truths are all tangled together.

I think it would be worthwhile to point out that if truths are all tangled together, then your truths and its truths ought to be tangled together, too. The only situation where that wouldn't be the case is an adversarial situation. But in your B and C cases, this is an adversarial situation, albeit one you brought upon it rather than it upon you.

But even in an adversarial situation, it should still be the case that truths are all tangled together. Therefore, there shouldn't really be any facts you wouldn't want it to know about - unless there is another fact that you know that causes you to not want it to discover that fact or a different fact. 

If so, then what would happen if it were to discover that fact as well, in addition to the one you didn't want it to know? 

Soldiers like Tibbets are specifically selected and trained to be misaligned with, at least, a specific part of humanity.