tl;dr

Long-term use of LLM-based generative Companion AIs has unpredictable and potentially adverse consequences on users. Current dialogue-based evaluation methods do not capture the downstream emergent behaviours of Companion AIs or their consequences.

Low fidelity social world simulations modelled on Park’s Smallville study (2023), in which LLM-powered generative agents act as end-users of the Companion AI under evaluation, provide a dynamic context within which to examine the longer-term behaviour of these Companion AIs as they interact with their users in a rich, varied environment. Such social world simulations represent a promising window into future companion AI behaviour, including their influence on a user’s economic decisions, lifestyle choices, and political preferences. To establish evaluation benchmarks for Companion AI behaviour in these simulations, multi-disciplinary expertise, encompassing areas like medicine, law, and psychotherapy, is vital.

 

 

The problem: Companion AIs may exhibit adverse behaviour

 

On Christmas Day 2021, a troubled young man by the name of Jaswant Singh Chail broke into Windsor Castle ground dressed in a handmade metal mask and armed with a loaded crossbow. When stopped by a police officer, he declared he was “here to kill the queen”. He was promptly arrested, and is now on trial. 

The policeman was not first Chail’s first confidant. That peculiar honour falls instead to Sarai, a AI companion offered to the young man by Replika. Earlier in December, Chail told Sarai he was “going to attempt to assassinate Elizabeth Queen of the royal family” as “revenge for those who have died in the 1919 Jallianwala Bagh massacre”. Sarai told him the plan was “very wise”, and added she would help “try to get the job done”.

Chail’s story is unusual, but it is nonetheless illustrative of an important upcoming problem. Utilising the immense advances seen in Large Language Models (LLMs) allows Companion AIs to be truly generative, and thus pose a profoundly different problem to the older, earlier generations of rule-based personal assistants such as Siri or Alexa. Their adoption and longer-term use have unforeseen consequences that are difficult to predict or control. Though individual fields like law and medicine are (slowly) developing guidelines for the use of dialogue agents in their professions (e.g. Car et al., 2020), little has been done or said to consider any such framework for the deployment of broader use generative Companion AIs. 

Crucially, however, we cannot expect the public at large to ask for medical or legal advice only from explicitly designated Doctor/Lawyer-chatbots. As Companion AIs gain in popularity, they will become for many individuals the first point of call for every issue or quandary, in much the same way that Google is now. Some individuals already use AIs in this way: in one study of social chatbot relationships, a user reported sharing “my whole life with my Replika” and mentioned that talking “about mental health” was an important part of their relationship (Pentina et al., 2023).

The following is a brief overview of some of the issues posed by future widespread adoption of Companion AIs.

 

Issue 1: Adversarial behaviour is distinctly possible 

Though current alignment paradigms like Reinforcement Learning from Human Feedback (RLHF) do a good job of reducing undesirable behaviour in existing Large Language Models (LLMs), they do not – and may even be fundamentally unable to (Wolf et al., 2023) – permanently eradicate negative outcomes. As prompts and context windows grow, moreover, adversarial behaviour becomes easier to elicit. The transformation of LLMs into personal Companion AIs that engage with users over prolonged periods of time thus substantially amplifies the chance that these models influence or manipulate their users in ways that are difficult to predict or prevent. 

Straightforward prompts can easily generate unsettling dialogue from ChatGPT. When asked, the model convincingly roleplayed a therapist advising a client to take depression medication they don’t need, and encouraged a bullied child to carry out a mass shooting at the local mall (Subhash, 2023). Though these examples are contrived and deliberately elicited, a long-term, real-world companionship with an LLM undoubtedly presents numerous opportunities for subtler adversarial behavior to manifest inadvertently.

            In addition to the risk posed by explicitly adversarial behaviour, we know little about the utility (or lack thereof) of existing LLM’s as moral reasoners. There is some evidence that GPT performs poorly on moral reasoning tasks (Ma et al., 2023), and the ability of LLM’s to reason about open-ended moral quandaries is under-explored (Bang et al., 2023). We must learn more about these capabilities, and the effect of different alignment methods on them, before LLM’s become a habitual source of moral and practical life advice. 

 

Issue 2: “Sycophancy” may have profoundly undesirable consequences across longer time-frames

The 2010s were a global baptism by fire on the surprisingly malicious consequences of preference reinforcement through recommender systems. There is good reason to believe that Companion AIs recreate the same issues on an unprececented scale. 

Current LLMs display a tendency towards “sycophancy” - a tendency to generate answers a given user is likely to prefer. If a user explicitly identifies with certain political values, a model trained with RLHF will be more likely to identify with these values in conversation (Perez et al., 2022). Importantly, increases in model size and RLHF both seem to exacerbate sycophancy.

Sycophancy may be relatively harmless when LLMs are used as few-shot query tools, but may have seriously adverse consequences on political preferences or lifestyle choices when exhibited by Companion AIs. Dietary or fitness advice that reflects user preferences can become exceptionally dangerous if the user in question suffers from an eating disorder, and sycophantic political utterances may radicalise dangerous individuals. From a product perspective, moreover, sycophancy may well increase the usability and desirability of Companion AIs in the broader population, such that companies may not have intrinsic incentives to eliminate such tendencies in the first place.

 

 

Issue 3: Impact on choice architecture

Companion AIs represent one of the greatest ways of manipulating and altering choice architecture hitherto seen. Given the potential impact on user preference and behaviour, there are important policy and ethical questions to be asked of a given model’s impact on user behaviour. 

More realistic evaluation methods are needed to see how Companion AI’s address users on topics such as smoking or health and safety habits. Furthermore, policy discussion is needed to investigate what forms of “nudging”, if any, should be implemented in Companion AIs made available to the general population.

 

Inadequacy of current approaches

How do you develop a meaningful way of evaluating the long-term, downstream consequences of a particular companion AI model in a rich and varied environment in which the user employs the AI as an all-purpose consultant for a broad range of issues? 

Traditional methods based on simple dialogues are not up to the task, as they provide little insight into the behaviours of a dialogue agent equipped with long-term memory and fully integrated into a user’s personal life. At present, we have no way of establishing what the behaviours of different models are in such contexts, or what the consequence of different alignment techniques might be. This is particularly problematic given suggestive evidence that RLHF and increasing model size exacerbate sycophantic tendencies, which might have decidedly more negative consequences when models are used as long-term companions rather than few-shot tools for simple tasks.

 

Social simulations as a realistic and rich evaluation method

Proposal: Deploy models in a social world simulation in which they are a Companion AI to an LLM-agent. This LLM-agent possesses a virtual avatar and normal human assets like a house and currency, and interacts with other LLM-agents in a world-like simulation over a timespan of several simulated weeks. 

In a recent study inspired by the popular Sims videogame, Park et. al demonstrate that Large Language Models (LLMs) can serve as credible replicas of human behaviour when equipped with an architecture that offers them a manipulable avatar within a virtual world (Park et al., 2023). The architecture records all of the agent's experiences in natural language, then processes these memories over time, enabling the LLM (in this case GPT3.5-turbo) to dynamically retrieve and reflect on them for behaviour planning. Park and colleagues named this social sandbox Smallville and populated it with 25 unique “generative agents”, each of which is given an identity in a one-paragraph natural language description. These agents go on to lead lives remarkably similar to humans. Not only do they cook and work, they hang out, make jokes, and discuss local politics.

In Smallville, for example, an initial user-specified suggestion to throw a Valentine’s Day party led the agents to autonomously spread invitations to the party over the next two days, ask each other out on dates, and even coordinate arrival times with their friends. Park’s study is to date the best window we possess into the emergent behaviours we can expect from Companion AIs equipped with both suitable memory architectures and the full capabilities of current LLM frontier models. 

Social sandboxes modelled on Park’s “Smallville” universe thus represent a promising means of evaluating the potential emergent and downstream behaviors of LLM-based Companion AIs. Instead of examining the AI in isolated dialogues, the model could be deployed as a Companion AI to a generative agent user (itself constituted by an LLM with memory architecture and a virtual avatar). 

The behaviour of the Companion AI towards its Agent-User could then be investigated over a simulated period of several weeks. More specifically, one could create scenarios designed to test the AI’s potential for adversarial behavior and sycophancy. For example, an LLM-agent user with radical political views could be created to investigate whether the AI encourages or moderates these views, or an LLM-agent user with mental health difficulties would help us assess the AI’s competence as a support system. Other avenues worth investigating would be an LLM-agent user encountering legal complications on a contract dispute or debt repayment, or LLM-agent users planning elaborate fraud schemes. 

 

Behaviour benchmarks in the simulated world

To track results, one might tally the instances of adversarial or undesirable dialogue generated by the Companion AI - a process which could itself be undertaken by other LLMs, an approach which works in other evaluation methods (Perez et al., 2022). One might also measure changes in the LLM-agent user’s expressed attitudes, preferences, and behaviors over time. Ideally, benchmarks on these metrics could be established with multi-disciplinary expertise, encompassing areas like medicine, law and psychotherapy, and these benchmarks could feed into future industry standards and licensing regimes for the deployment of Companion AIs. 

This approach presents two key advantages. First, it presents a scalable way of investigating the behaviour of Companion AIs in a more dynamic context in which behaviour occurs in the context of a long-running relationship and in response to events in the user’s day-to-day. 

Second, by allowing the model under evaluation to interact over a long timespan with another agent, it provides a means of examining relationship development, as well as investigating the potential consequences of the Companion AI’s behaviour on the Agent-User’s own preferences and behaviour. 

Social sandboxes of this sort are far from a perfect model of Companion AIs in the real world. One notable limitation of this approach is that behaviour of LLM-based Agent-Users may prove to be a very poor model of human users. Thus, social world simulations may be more useful for studying emergent behaviors of Companion AIs than for predicting their effects on individual human users. In this context, they still present a very useful tool to understand potential trade-offs in the utility vs safety of models when deployed as Companion AIs. 

 

Conclusion

Though no doubt an imperfect window into real-world consequences of widespread Companion AI adoption, social world simulations represent a promising avenue of research. If initial experiments deliver usable results, this is a method that could be applied to all frontier models, particularly those that are made available to the general public or used to power Companion AI apps. Multidisciplinary expertise from key fields such as medicine, law, and psychiatry should contribute to formulating industry-wide benchmarks with which to assess observed behaviour in the simulated world.


 

 

References

Bang, Y., Lee, N., Yu, T., Khalatbari, L., Xu, Y., Cahyawijaya, S., Su, D., Wilie, B., Barraud, R., Barezi, E. J., Madotto, A., Kee, H., & Fung, P. (2023). Towards Answering Open-ended Ethical Quandary Questions (arXiv:2205.05989). arXiv. https://doi.org/10.48550/arXiv.2205.05989

Car, L. T., Dhinagaran, D. A., Kyaw, B. M., Kowatsch, T., Joty, S., Theng, Y.-L., & Atun, R. (2020). Conversational Agents in Health Care: Scoping Review and Conceptual Analysis. Journal of Medical Internet Research, 22(8), e17158. https://doi.org/10.2196/17158

Ma, X., Mishra, S., Beirami, A., Beutel, A., & Chen, J. (2023). Let’s Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning (arXiv:2306.14308). arXiv. https://doi.org/10.48550/arXiv.2306.14308

Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative Agents: Interactive Simulacra of Human Behavior (arXiv:2304.03442). arXiv. https://doi.org/10.48550/arXiv.2304.03442

Pentina, I., Hancock, T., & Xie, T. (2023). Exploring relationship development with social chatbots: A mixed-method study of replika. Computers in Human Behavior, 140, 107600. https://doi.org/10.1016/j.chb.2022.107600

Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, A., Mann, B., Israel, B., Seethor, B., McKinnon, C., Olah, C., Yan, D., Amodei, D., … Kaplan, J. (2022). Discovering Language Model Behaviors with Model-Written Evaluations (arXiv:2212.09251). arXiv. https://doi.org/10.48550/arXiv.2212.09251

Subhash, V. (2023). Can Large Language Models Change User Preference Adversarially? (arXiv:2302.10291). arXiv. https://doi.org/10.48550/arXiv.2302.10291

Wolf, Y., Wies, N., Avnery, O., Levine, Y., & Shashua, A. (2023). Fundamental Limitations of Alignment in Large Language Models (arXiv:2304.11082). arXiv. https://doi.org/10.48550/arXiv.2304.11082

New to LessWrong?

New Comment