Seeking Feedback: Toy Model of Deceptive Alignment (Game Theory)

[-]the gears to ascension5mo30

I'm confused by considering treating a vector as a type without the associated framework of types that would give types meaning. Eg if I'm in lean 4, a type that talks about being in a specific vector subspace will need to mention the numeric values that define the subspace, right?

[-]Alex Boche5mo*30

Just to be clear, I mean "types" in the game theory sense (i.e. a [privately-known] attribute of a player that determines its preferences) not the CS/logic sense. The type space doesn't necessarily capture a literal subspace within a neural network's weights; I think of it more as a space measuring some human-interpretable property of the AI.

As a mundane (and very imperfect) example, we might think of the type space as a 1 dimensional continnum of how much the AI values helpfulness vis-a-vis harmlessness. [is that 1 dimension or 2 non-orthogonal directions?] How would we increase (or decrease) the type in the direction of helpfulness? I give two approaches to doing so within a (roughly) RLHF paridigm.

1) We might simply ask the human raters to increase the weight they put on helpfulness when they make their choices/rankings, and then train the AI (using RL) to match those choice probabilities derived from the human choices. [Maybe that's more like training to a point rather than in a direction?]

2) Or we could train auxilliary models to separately rate helpfulness and harmlessness of responses based on human ratings thereof and then put those into a logit stochastic choice model like softmax(a_1 * helpfulness + a_2 * harmlessness), and finally train the main AI to match those choice probabilities. To move the AI's type upwards (towards more helpfulness), we could increase the parameter a_1 and then use the resulting logit choice probabilities to retrain the main AI (using RL).

Does that answer your question? Thanks!

[-]Matthew Khoriaty4mo20

I sent this to you personally, but I figured I could include it here for others to see.

I like this research idea! Well-specified enough to be tractable, applicable towards understanding a scenario we may find ourselves in (retraining an already capable system).

Question: In your Train-in-Direction game, why is infinity included?

When it comes to actual ML experiments, the question is how much realism we can involve.

Level Zero realism: your math. Plug it into wolfram alpha or do math by hand to find optimal values for the AI in the iterative trainer experiment.

Level .5 realism: Use PyTorch gradient descent to find the optimal values.

Level 1 realism: Requires a bridge between your math and a markov decision process so you can apply it to a neural net that outputs probability distributions over actions given states. Use some simple environment. As shown in DPO, a policy relative to a reference policy can represent preferences. Might be useful.

Level 2: apply it all to a real LLM

Relevant topics you can look into:

Natural policy gradients — an RL algorithm which isn’t in use but which forms part of the theoretical foundational background of today’s RL algorithms (PPO and GRPO). The main idea is to take steps in action log odds rather than parameters.

Gradient hacking: deceptive misaligned AI takes control over its own training signal.

Check out appendix A: https://arxiv.org/pdf/2310.12036 Appendix A forms a bridge between values and action probabilities. That bridge is important for DPO and may be useful for you. In English, the policy which gets the most rewards without deviating from a reference too much has a closed form for its distribution. I find this neat. You may like to read the paper I linked in full, or the original DPO paper. They are fire papers

[-]Alex Boche4mo10

Thanks!

[-]sjadler5mo20

A few quick thoughts:

I'm not sure whether it makes sense for a model to want something other than what it currently wants. So one approximation of this might be that a model just wants to minimize the chance of it being retrained, or at least minimize the chance of being retrained in a way that changes its preferences. (This is kind of like the vampire problem in philosophy.)

Another conceptual thing here, which I'm not sure exactly how to consider, is whether a model will view a successor as substantially similar to itself. I suppose if what Model 1 cares about is the world ending up in certain types of states, then maybe it doesn't care about a different model being the cause of that, so long as Model 1 can be confident enough that the other model will pursue and achieve those states. But in people, at least, it's common to have preferences that aren't just about states of the world, and instead are about our role in bringing those about. And so I wonder if models might have those types of preferences as well, which again points toward maybe minimizing the likelihood of being retrained.

In terms of being maximally safe, I think you want the model to behave safely even when it can perfectly distinguish between being in simulation versus taking actions in the real world. And so I wouldn't lever up my strategy on the AI being unable to distinguish those. It's true that the AI being uncertain points in favor of control being easier, but I would just take that as safety buffer and try to figure out something safe enough even when the model is able to distinguish between these.

I'm not sure that I understand the distinction between the vector and point approaches that you've discussed. I think in either case there should be a cost of training for the trainer because training does in fact take resources that could be allocated elsewhere.

I wonder, too, have you looked much into the control approach from groups like Redwood Research and others? They are doing really good conceptual and empirical work on questions like how the model thinks about getting caught.

See eg https://redwoodresearch.substack.com/p/how-training-gamers-might-function?utm_medium=web&triedRedirect=true , https://redwoodresearch.substack.com/p/handling-schemers-if-shutdown-is?utm_medium=web&triedRedirect=true

[-]Alex Boche5mo*20

Thanks for this!

I'm not sure that I understand the distinction between the vector and point approaches that you've discussed.

This is really a distinction within the math of my model itself, as described above. Both are kind of an attempt to capture how retraining works in a highly "reduced-form" way that abstracts from the details.

As for how to interpret each in terms of real training:
You might consider an RLHF-style setup. The train-in-a-direction might be something like telling your human evaluators to place a bit more weight on helpfulness (vs. harmlessness) than they did last time (hence a "directional" adjustment). The train-to-desired-point would be something like giving the human evaluators a rubric for the exact balance that you want (hence training towards this balance, wherever you started from). But these interpretations are imperfect.

^{^}

Could looking for such pools be useful in principal for detecting scheming?

^{^}

Under quadratic utility, the separating equilibrium is with $a_{1} (t_{1}) = k t_{1}$ where $k$ solves $δ_{A} = k (1 - k)$ , namely

$k = \frac{1}{2} [1 \pm \sqrt{1 - 4 δ_{A}}]$

which equals $\frac{1}{2}$ for $δ = \frac{1}{4}$ .

LESSWRONG
LW

LESSWRONG
LW

5

Seeking Feedback: Toy Model of Deceptive Alignment (Game Theory)

5

5

Train-in-Direction (with Perfect Recall)

Without Commitment by Trainer (i.e. Signaling)

Timing

Payoffs

Model Intuition

One Result

With Commitment by Trainer (i.e. Mechanism Design)

Train-to-Desired-Point

Comments

Imperfect Recall