Mitigating collusive self-preference by redaction and paraphrasing
tldr: superficial self-preference can be mitigated by perturbation, but can be hard to eliminate paper: arxiv.org/abs/2512.05379 Introduction Our goal is to understand and mitigate collusion, which we define as an agent’s failure to adhere to its assigned role as a result of interaction with other agents. Collusion is a risk...