Does an AI model's intended persona influence its deception capability?
(Epistemic status: Could go somewhere interesting but not sure where.)
Suppose you get API access to two pairs of models and the challenge of determining which one in each pair has been given a secret agenda.
All four are fine-tuned adapters on the same base model, so the difference between the two deceptive models is that the developers handed them different masks to hide behind. The question is if the design of their disguises would influence the risk that they could get away with a secret agenda.
An additional question is if the default persona that the base model has been trained for would influence how applicable its capabilities are to the deceptive roles.
The general question remains even if the specific persona designs and underlying technology in this example are replaced.
The only answer I found was that it seems common to think that the persona is important for user perception of how trustworthy AI responses are. It makes sense if users overestimating this is usually the most obvious concern, but here I am assuming that you are suspicious of them.