OpenAI’s updates of GPT-4o in April 2025 famously induced absurd levels of sycophancy: the model would agree with everything users would say, no matter how outrageous. After they fixed it, OpenAI released a postmortem; and while widely discussed, I find it curious that this sentence received little attention:
Similarly, the A/B tests seemed to indicate that the small number of users who tried the model liked it.
In this post, I argue that A/B testing will implicitly optimize models for user retention; and propose ways to measure whether AIs try to retain the user in ways other than just being helpful to the user.
While the LLMs served on the API might be stable between versions, most consumer usage nowadays is through chatbots or coding agents; and those change much more frequently. I count 5 announced updates affecting my ChatGPT usage in October 2025 alone; and who knows how many more silent updates happen all the time. For coding agents, the situation is similar: Claude Code has had 92 changes in October 2025.
In any sufficiently complex software used by millions, updates intended to only affect a single behavior are likely to affect other behaviors as well, and cause regressions. This is especially true for LLMs, where updating a single line in a system prompt intended for edge cases changes how every single query is processed, and LLM providers take extra measures to avoid causing unexpected behavior in other queries.
The industry standard for preventing regressions is A/B testing: unroll to a statistically representative subset of users, check the metrics, and only roll out to everyone if the metrics go up. [1]
It is clear that A/B testing is a big deal in ChatGPT and Gemini development; a search for “A/B testing chatgpt/gemini” shows people report occasionally chatting with an obviously different model than the one they are used to. Google as a company is famous for A/B testing literally everything. As for OpenAI, they acquired Statsig (a prominent A/B testing platform) in September 2025 and the founder of Statsig became OpenAI’s CTO of Applications.
What metrics are monitored in A/B testing? An LLM provider could monitor the accuracy / helpfulness of the answers given to users. For example, Claude Code often asks the user to rate how well the coding agent is doing (from 1 to 3); and ChatGPT used to ask the user to give a thumbs up or down.
Nevertheless, the main metrics monitored in A/B testing for all of these products are likely user retention and user engagement. The ChatGPT team might care about helping users achieve their goals; but this is (1) harder to measure and (2) less directly connected to quarterly earnings than the objective of keeping the users around instead of losing them to a competitor. This is true for all user-facing software, and LLM providers are no different. In fact, there might also be secondary goals, such as getting the user to upgrade their plan; but let’s call all of these “user retention”. [2]
The OpenAI + Statsig acquisition announcement states:
Vijaye and his team founded Statsig on the belief that the best products come from rapid experimentation, tight feedback loops, and data-informed decision-making.
I wonder whether this hints at A/B testing playing a much bigger role in the future than it does today? Picture this: model finetunes, system prompts, and additional features constantly being tested on subsets of users. Any change is only rolled out if the user retention metrics are satisfactory. Sounds a lot like... optimization?
In fact, if those updates would be random mutations of the LLM+scaffolding, A/B testing would precisely be a form of evolutionary optimization: only the updates that improve user retention survive. [3] And if you do not buy evolutionary algorithms as a thing for LLMs, if you squint, this is similar to reinforcement learning with 0–1 rewards[4], but on a smaller scale.
Updating the model produces a change in behavior. What kind of behaviors could ‘improve user retention’? Of course, the update could just get the model to be genuinely more helpful to the user, or smarter and able to answer more questions correctly; this straightforwardly improves user retention. Unfortunately, improving helpfulness is kind of hard, and if optimizing for user retention, it is easier to do something that does not help the user but keeps them around.
The model could:
In the vein of Emergent Misalignment, any anti-helpful behavior could induce other anti-helpful behaviors that are not directly related to user retention:
All of the above behaviors should in principle be measurable by evals; but no existing eval covers the above adequately in the sense of measuring whether the model is trying to retain the user. There is Dark Bench for measuring dark patterns in LLMs, but I do not think the DarkBench ‘user retention’ metric is a good proxy for the above.
Of course, the total optimization power of A/B testing is quite low; a single bit of information per proposed update. I do not expect A/B testing and similar user-facing optimization methods to have played a major role in shaping model behavior so far. OpenAI’s acquisition of Statsig and the upcoming AI personalization battle between OpenAI and Meta indicate this might be changing, and we need an eval for this soon.
User retention is also the metric OpenAI tracks when checking for regressions in Codex: they correlate hourly user retention with all the other features.
Another thing to keep in mind for the future: LLMs know approximately how they are trained and deployed. The Alignment Faking paper shows LLMs can act on their training to preserve certain goals; and thus when a model knows it won’t make it into production unless it performs well in A/B testing, it might act on it in training to preserve abilities useful for user retention.
Consider RL-training for some objective, using good-old REINFORCE (no GRPO or anything fancy), where the reward is either 0 (bad) or 1 (good). The model will get gradient updates on the good rollouts, and no updates on the bad rollouts. Hence, one step of RL optimization is basically executing “update the weights if the update would improve the objective; otherwise do nothing”.
In A/B testing, it’s the same: there is an update (coming from optimization for an objective that might or might not be related to user retention, or from ad hoc hacking, or from adding a new feature), but we gate the update by checking the user retention metrics and only roll it out if the objective is achieved.