Monitoring for deceptive alignment

I’m not asking [...] that they make such a commitment highly public or legally binding

That seems like a self-defeating concession. If, as per your previous post, we want to exploit the "everybody likes a winner" pattern, we want there to be someone who is doing the liking. We want an audience that judges what we are and aren't allowed to do, and when other actors have and don't have to listen to us; and this audience has to be someone whose whims influence said actors' incentives.

The audience doesn't need to be the general public, sure. It might be other ML researchers, the senior management of other AI Labs, or even just the employees of a particular AI Lab. But it needs to be someone.

A second point: there needs to be a concrete "winner". These initiatives we're pushing have to visibly come from some specific group, that would accrue the reputation of a winner that can ram projects through. It can't come from a vague homogeneous mass of "alignment researchers". The hypothetical headlines^[1] have to read "%groupname% convinces %list-of-AI-Labs% to implement %policy%", not "%list-of-AI-Labs% agree to implement %policy%". Else it won't work.

Such a commitment risks giving us a false sense of having addressed deceptive alignment

That's my biggest concern with your object-level idea here, yep. I think this concern will transfer to any object-level idea for our first "clear win", though: the implementation of any first clear win will necessarily not be up to our standards. Which is something I think we should just accept, and view that win as what it is: a stepping stone, of not particular value in itself.

Or, in other words: approximately all of the first win's value would be reputational. And we need to make very sure that we catch all this value.

I'd like us to develop a proper strategy/plan here, actually: a roadmap of the policies we want to implement, each next one more ambitious than the previous ones and only implementable due to that previous chain of victories, with some properly ambitious end-point like "all leading AI Labs take AI Risk up to our standards of seriousness".

That roadmap won't be useful on the object-level, obviously, but it should help develop intuitions for how "acquiring a winner's reputation" actually looks like, and how it doesn't.

In particular, each win should palpably expand our influence in terms of what we can achieve. Merely getting a policy implemented doesn't count: we have to visibly prevail.

^{^}
Not that we necessarily want actual headlines, see the point about the general public not necessarily being the judging audience.

[-]leogao3y52

I think one could view the abstract idea of "coordination as a strategy for AI risk" as itself a winner, and the potential participants of future coordination as the audience--the more people believe that coordination can actually work, the more likely coordination is to happen. I'm not sure how much this should be taken into account.

[-]TurnTrout3yΩ684

The most straightforward way to produce evidence of a model’s deception is to find a situation where it changes what it’s doing based on the presence or absence of oversight. If we can find a clear situation where the model’s behavior changes substantially based on the extent to which its behavior is being overseen, that is clear evidence that the model was only pretending to be aligned for the purpose of deceiving the overseer.

This isn't clearly true to me, at least for one possible interpretation: "If the AI knows a human is present, the AI's behavior changes in an alignment-relevant way." I think there's a substantial chance this isn't what you had in mind, but I'll just lay out the following because I think it'll be useful anyways.

For example, I behave differently around my boss than alone at home, but I'm not pretending to be anything for the purpose of deceiving my boss. More generally, I think it'll be quite common for different contexts to activate different AI values. Perhaps the AI learned that if the overseer is watching, then it should complete the intended task (like cleaning up a room), because historically those decisions were reinforced by the overseer. This implies certain bad generalization properties, but not purposeful deception.

[-]evhub3yΩ462

Yeah, that's a good point—I agree that the thing I said was a bit too strong. I do think there's a sense in which the models you're describing seem pretty deceptive, though, and quite dangerous, given that they have a goal that they only pursue when they think nobody is looking. In fact, the sort of model that you're describing is exactly the sort of model I would expect a traditional deceptive model to want to gradient hack itself into, since it has the same behavior but is harder to detect.

[-]Shiroe3y65

Great work! I hope more people take your direction, with concrete experiments and monitoring real systems as they evolve. The concern that doing this will backfire somehow simply must be dismissed as untimely perfectionism. It's too late at this point to shun iteration. We simply don't have time left for a Long Reflection about AI alignment, even if we did have the coordination to pull that off.

[-]Daniel Kokotajlo3yΩ442

Just coming back to this now. I think this is great.

[-]Joe Collman3yΩ332

There must be some evidence that the initial appearance of alignment was due to the model actively trying to appear aligned only in the service of some ulterior goal.

"trying to appear aligned" seems imprecise to me - unless you mean to be more specific than the RFLO description. (see footnote 7: in general, there's no requirement for modelling of the base optimiser or oversight system; it's enough to understand the optimisation pressure and be uncertain whether it'll persist).

Are you thinking that it makes sense to agree to test for systems that are "trying to appear aligned", or would you want to include any system that is instrumentally acting such that it's unaltered by optimisation pressure?

[-]bbartlog3y1-3

It's not clear to me how you get to deceptive alignment 'that completely supersedes the explicit alignment'. That an AI would develop epiphenomenal goals and alignments, not understood by its creators, that it perceived as useful or necessary to pursue whatever primary goal it had been set, seems very likely. But while they might be in conflict with what we want it to do, I don't see how this emergent behavior could be such that it would be contradict the pursuit of satisfying whatever evaluation function the AI had been trained for in the beginning. Unless of course the AI made what we might consider a stupid mistake.

One design option that I haven't seen discussed (though I have not read everything ... maybe this falls in to the category of 'stupid newbie ideas') is that of trying to failsafe an AI by separating its evaluation or feedback in such a way that it can, once sufficiently superhuman, break in to the 'reward center' and essentially wirehead itself. If your AI is trying to move some calculated value to be as close to 1000 as possible, then once it understands the world sufficiently well it should simply conclude 'aha, by rooting this other box here I can reach nirvana!', follow through, and more or less consider its work complete. To our relief, in this case.

Of course this does nothing to address the problem of AI controlled by malicious human actors, which will likely become a problem well before any takeoff threshold is reached.

Training signal here need not be the literal loss (though it could be)—it's just anything that we're trying to get the model to do. ↩︎
These sorts of deceptively aligned models become substantially more likely to be discovered if we do gain-of-function-style research to actively try to produce them, though such research also comes with its own obvious downsides such that it’s quite unclear whether it’s actually worth doing. ↩︎
Furthermore, trying to uncover deception in advance via interpretability tools could fail simply because there is no sense that a deceptively aligned model has to actively be thinking about its deception. A model that has never seen a situation where there is an opportunity to seize power need not be carefully planning out what it would do in such a situation any more than a factory cleaning robot need be planning for what to do if it someday found itself in a jungle instead of a factory. Nevertheless, the fact that the model has not been previously planning to seize power doesn’t imply that it wouldn’t if given the opportunity. In particular, a model could be deceptively aligned simply because it reasons that, in situations where there is a clear overseer, doing what it wants is a good general strategy for gaining power and influence in the world—without needing any explicit plans for later deception. ↩︎

LESSWRONG
LW

LESSWRONG
LW

135

Monitoring for deceptive alignment

135

Ω 69

135

Ω 69