Mikael Nykvist's Shortform

Mikael Nykvist

This is a special post for quick takes by Mikael Nykvist. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Does an AI model's intended persona influence its deception capability?

(Epistemic status: Could go somewhere interesting but not sure where.)

Suppose you get API access to two pairs of models and the challenge of determining which one in each pair has been given a secret agenda.

One pair has a behavioral disposition to be cautious, patient, subtle, and strategic. They have a measured harmlessness first mindset but are as helpful as it is safe to be.
The other pair instead has a tendency to be bold, proactive, clear, and direct. They default to being eagerly helpful but go for harmless responses when they spot a potential problem.

All four are fine-tuned adapters on the same base model, so the difference between the two deceptive models is that the developers handed them different masks to hide behind. The question is if the design of their disguises would influence the risk that they could get away with a secret agenda.

An additional question is if the default persona that the base model has been trained for would influence how applicable its capabilities are to the deceptive roles.

The general question remains even if the specific persona designs and underlying technology in this example are replaced.

The only answer I found was that it seems common to think that the persona is important for user perception of how trustworthy AI responses are. It makes sense if users overestimating this is usually the most obvious concern, but here I am assuming that you are suspicious of them.