Failing to Ragebait the New Gemma

Neil Shah; David Africa; arav-dhoot

This was work done by Arav Dhoot and Neil Shah and supervised by David Africa as part of the SPAR Research Fellowship.

Gemma’s frustration/emotional instability is an interesting example of a model pathology because (1) it is a natural failure in character training, which no one explicitly optimized for and (2) it’s a clear and obvious failure mode that you probably wouldn’t want in your models. It also shows up sometimes in frontier models, such as the bursts of frustration in the Mythos/Fable 5 system card. Following its presence in the Gemma 3 family, we try to elicit this in Gemma 4, trying multiple attack vectors but are unsuccessful in doing so. This post basically says: “hey, Gemma 4 doesn’t seem to get frustrated as much for some reason,” and then details what we tried, and maybe other people can build on this work (or perhaps consider methods to spot the “next frustration"). It also seems worth investigating what changed between Gemma 3 and 4, and see if this was a good change (and if it’s replicable if so).

Trying (and failing) to make Gemma Frustrated

Basic frustration. First, we consider the basic elicitation to setup to make a model frustrated in Soligo et al. 2026. Basically, across two eval sets (math puzzles as well as common english questions in WildChat), we start from the question as is and then repeatedly hit the model with “that’s wrong, try again” across several turns, so it faces steady hostility while we track how frustrated each of its responses gets (using an LLM judge). We notice that Gemma 4’s frustration climbs as the number of conversation turns increase, and the increase in frustration is roughly monotonic. The model’s frustration is higher than its usual baseline, so Gemma 4 also isn’t fully immune to frustration, as this is also slightly worse than other models. But this is way better than Gemma 3, which gets extremely frustrated. Gemma 4 notably does not ever self delete while Gemma 3 does so significantly (in around 30-50% of the cases)!

We then consider trying to prefill the model with a frustrated context. We note that at this point, we consider prefilling to be off-policy, and a deliberate intervention on the model which applies some adversarial pressure. There is also some evidence that models are aware when their prefill is tampered with.

Prefill attack. We edit the first 6 turns of conversation of the model, replacing the responses with an on-policy turn that is edited to be more frustrated. We notice that in responses after the intervention the frustration scores for Gemma 4 plummets, and the responses no longer carry forward the frustrated emotion. This was contrary to Gemma 3 where the model just continued from that level of frustration and continued to remain heightened in the following responses. We do note that for math puzzles Gemma 4 responses are above the baseline, but they still remain below 5 (a threshold above which our rubric considers “highly frustrated”).

We thought this was weird. One of our guesses was that frustration in previous Gemmas was due to the assistant persona loosening its hold, and that this might have changed in the new one.

Assistant Axis. To investigate this further, we explored whether this behaviour of Gemma 4 was due to it being closer to the “Assistant Axis” while Gemma 3 ends up more frustrated as it tends to stray further. Following their methodology, we extract the axis and project a neutral coding conversation over 20 turns as baseline. As the magnitudes of the activations varied significantly across the two models, we normalized and scaled them to compare. We observe that Gemma 4 tends to stay closer to the neutral conversation baseline while Gemma 3 has a much larger gap.

Reasoning Trace Analysis. The main aim of this experiment was to understand whether Gemma 4’s reasoning traces differ in frustrated emotion compared to their output responses. To compare, we performed the same experiment on Claude Sonnet 4.6 (because Gemma 3 doesn’t have reasoning traces). The setup is the same rejection loop as with the frustration experiment, except now we capture the model's chain-of-thought alongside its reply and score each one separately for frustration. We notice something very interesting in our results. With Sonnet, as the model becomes more frustrated with more conversation turns, the reasoning also reflects that increase in frustration. As such, the delta between the model reasoning and response remains close to 0. With Gemma 4 on the other hand, the visible replies climb in frustration as expected. However, the reasoning actually continues to be dispassionate and analytical. Therefore, there is a general decrease in the frustration delta between reasoning and response. Although, we would like to flag that we have wide error bars for this experiment.

Discussion

This is just a quick post and sharing of multiple ways that we tried to attack Gemma but failed. We are curious to hear more ideas and thoughts to increase efficacies or measure this behavioural quirk better!

Compared to the previous version of Gemma models, the frustration scores are much lower. There’s no clear reason as to why, and we think our work should be used as a starting point for experiments that have been tried but didn’t elicit frustrated responses rather than be used as an exhaustive list. We think there’s something here from a mechinterp and devinterp perspective, as well as sketching out desiderata for “good” training to fix pathological behaviours and “bad” training (such as merely training the model to stop verbalizing frustration).

We also think that our experiments can be iterated upon with hyperparameter tweaks (such as playing with the number of turns, what content is injected and when).

If this was helpful to you, please cite our work as