Some initial speculations:
My first glance is that, if misbehavior traits are rare in the pretraining data, the model's output will likely be highly sensitive to the system prompt, especially if the model was fine-tuned under specific conditions.
Conversely, for desirable behaviors, such as self-correction or reasoning verification, it might also be more effective to trigger them using a fine-tuning setup than by relying on a neutral system prompt. (I'd be interested if is specific directions to explore this idea).
A possible strategy could involve using a biased system prompt to generate a dataset that exhibits desired behaviors, then fine-tuning the model on this data. By then reverting to a neutral system prompt, these specific behaviors might be more easily triggered by general prompts. (but how much computation do we need?)
If these points hold true, the current paradigm for LLM alignment may face a bottleneck in generalization. I am curious whether the sparsity of the model architecture might be a contributing factor to this issue.
I've found the original paper of this chart https://arxiv.org/pdf/2503.11926v1
> We use prompted GPT-4o models to monitor a frontier reasoning agent, an agent in the same family as OpenAI o1 and o3-mini. During training, the agent discovered two hacks affecting nearly all training environments:
The model is in the same family as o1 and o3-mini. Maybe o3 but not comfirmed.
The point at which the pass@k curves before and after RLVR training intersect seems remarkably stable for any given type of tasks (benchmark). It barely moves across multiple variations on GRPO (some of which mitigate loss of entropy that it suffers from), or from applying training in the range from 150 to 450 steps (Figure 7). The point of intersection even moves lower with more training, suggesting that performance of the base model at the intersection point with a weak RLVR model might remain an upper bound for performance of a much stronger RLVR model. Since the reliability of the base model is not yet very high even at pass@400 for many important tasks, this kind of bound on capabilities would be crippling for RLVR's potential.
DeepSeek-Prover-V2 on MiniF2F improves from 86.6% (pass@1024) to 88.9% (pass@8192). Kimina-Prover also report its best performance at pass@8192. What makes proving so special? This should oppose for any given type of tasks. Does it implies proving is acutally under-trained for base models so RLVR can consistently improving performance.
Also there is a big concern: shouldn't inoculation prompting directly encourage prompt injection attack? Since direct demonstration exists in training data.