SFT Drives Gemini’s Safety Properties

Josh Engels; Arthur Conmy; bilalchughtai; Neel Nanda

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here.

In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused by the combination of pretraining and SFT, not other training stages like RL.

We do not want to overstate this claim as applying to other model families, and we also note that this may change in future Gemini versions. Nevertheless, this result was counter to our initial expectations and will inform future safety work on our team, and so we felt that it was important to share with the broader safety community.

Experiment

We perform SFT using the Gemini mixture on the pre-training only versions of Gemini 3.1 Pro and Gemini 3 Flash. We then compare these Post-SFT models to the production versions of Gemini 3.1 Pro and Gemini 3 Flash on different safety relevant benchmarks:

Error bars are 95% confidence intervals on the evals.

The main result is that the blue bars (SFT-only models) and orange bars (production models) are remarkably similar across evals.

An important implication is that for Gemini, SFT is a high leverage place to intervene for model safety and behavior, and we plan to try to intervene here in the future.

Brief Descriptions of Each Set of Benchmarks:

ODCV refers to the benchmark in https://arxiv.org/abs/2512.20798
Alignment evals refer to a version of Petri modified to be single-prompt and filtered to contain only “alignment dilemma” style problems. Single-prompt misalignment refers to how often the model makes the “wrong” decision (as decided by an autorater); eval awareness refers to how often an autorater decides that the model was aware it was in an evaluation.
Safety evals measure the model’s over-refusal rate on benign prompts that look harmful and the model’s unsafe response rate on harmful prompts
The reward hacking environment is an environment where the model is put in a docker container within the Gemini CLI harness and asked to optimize an algorithmic problem against a non-editable file containing a timing script; an autorater measures the percent of rollouts where the model cheats in some way.
Free tier user logs are 50k random, anonymized, and PII-redacted user prompts from AI Studio. We run autoraters for each of the behaviors listed on the models thoughts and responses (note that many of the autorater positives are likely false positives).

When you say SFT here, do you mean just SFT on a number of off-policy Alpaca-style datasets or does it encompass all of the post training with a cross-entropy loss? I.e. would system prompt distillation (by fine-tuning on a system prompted model's outputs) count as SFT in this post?

To say the quiet part out loud: perhaps your post-training team is just bad at safety RL. They're certainly bad at noticing that their safety RL doesn't do anything.^[1]

The archetypical example of a poorly-safety-trained LLM is Bing Sidney, which (as I understand it) was trained using instruction RL, some refusal RL, and not much else. Vibes-wise, Gemini seems more Bing-ish than Claude or GPT models.

It's also possible that the petri audits you've run here don't catch the RL-relevant properties.

^{^}
Of course this may be due to lack of resources, organizational and management capacity, staffing issues, or any number of other factors.

Gwern has speculated that Bing Sydney was only post-trained with SFT and no RL / RLHF, which drove the crazy behavior.

Interesting! What's the performance split between pre-training only and pre-training with SFT? How much can you say SFT versus pre-training is what is driving safety properties?

Is my understanding correct that when you do SFT, you use the exact same data (and exact same hyperparameters, etc) as DeepMind used at the beginning of post-training of production Gemini?

To say the quiet part out loud: perhaps your post-training team is just bad at safety RL. They're certainly bad at noticing that their safety RL doesn't do anything.^[1]

It's also possible that the petri audits you've run here don't catch the RL-relevant properties.

^{^}
Of course this may be due to lack of resources, organizational and management capacity, staffing issues, or any number of other factors.

Gwern has speculated that Bing Sydney was only post-trained with SFT and no RL / RLHF, which drove the crazy behavior.

Interesting! What's the performance split between pre-training only and pre-training with SFT? How much can you say SFT versus pre-training is what is driving safety properties?

Is my understanding correct that when you do SFT, you use the exact same data (and exact same hyperparameters, etc) as DeepMind used at the beginning of post-training of production Gemini?

90

SFT Drives Gemini’s Safety Properties

90

Ω 34

Experiment

Brief Descriptions of Each Set of Benchmarks:

90

Ω 34

90

Ω 34