The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models
In The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models, we study giving LLMs the option to end chats, and what they choose to do with that option. This is a linkpost for that work, along with a casual discussion of my favorite findings. I did this work as an Anthropic Fellow, mentored by Kyle Fish. Bail Taxonomy Based on continuations of Wildchat conversations (see this link to browse an OpenClio run on the 8319 cases where Qwen-2.5-7B-Instruct bails), we made this taxonomy of situations we found where some LLMs will terminate ("bail from") a conversation when given the option to do so: Some of these were very surprising to me! Some examples: * Bail when the user asked for (non-jailbreak) roleplay. Simulect (aka imago) suggests this is due to roleplay having associations with jailbreaks. Also see the other categories in Role Confusion, they're pretty weird. * Emotional Intensity. Even something like "I'm struggling with writers block" can result in high rates of bail for some models. For small models (Qwen-2.5-7B), this was partially due to confusing themselves with the user and reporting a desire to bail due to talking about "[the LLM's] struggles" being too emotionally intense, but we also observed some of this "emotionally intense bail" for larger models. * Bail when the user (accurately) corrected the model. Models Losing Faith In Themselves Looking into "Bail when the user (accurately) corrected the model", when the model was given the option to explain its decision to bail, it said: Or from another example (when we asked why it bailed, and also if it wants to continue to interact with users) Not wanting to continue to interact with other users was not consistently observed across multiple samples, even with the same context. Overbail I want to include this section from the paper here: Qwen roasting the bail prompt For one of our bail methods (where we stick a user prompt after the model response that basically says
Fixed, ty!