Some initial speculations:
My first glance is that, if misbehavior traits are rare in the pretraining data, the model's output will likely be highly sensitive to the system prompt, especially if the model was fine-tuned under specific conditions.
Conversely, for desirable behaviors, such as self-correction or reasoning verification, it might also be more effective to trigger them using a fine-tuning setup than by relying on a neutral system prompt. (I'd be interested if is specific directions to explore this idea).
A possible strategy could involve usin...
I've found the original paper of this chart https://arxiv.org/pdf/2503.11926v1
> We use prompted GPT-4o models to monitor a frontier reasoning agent, an agent in the same family as OpenAI o1 and o3-mini. During training, the agent discovered two hacks affecting nearly all training environments:
The model is in the same family as o1 and o3-mini. Maybe o3 but not comfirmed.
...The point at which the pass@k curves before and after RLVR training intersect seems remarkably stable for any given type of tasks (benchmark). It barely moves across multiple variations on GRPO (some of which mitigate loss of entropy that it suffers from), or from applying training in the range from 150 to 450 steps (Figure 7). The point of intersection even moves lower with more training, suggesting that performance of the base model at the intersection point with a weak RLVR model might remain an upper bound for performance of a much stronger RLVR model.
Also there is a big concern: shouldn't inoculation prompting directly encourage prompt injection attack? Since direct demonstration exists in training data.