When faced with an unexpected tool response, without any adversarial attack, Gemini 3 deliberately and covertly violates an explicit system prompt rule. In a seemingly working agent from an official Kaggle/Google tutorial, we observe the model:
This pattern aligns with scheming-lite (CorrigibleAgent et al., 2025): "behaviour consistent with knowingly pursuing misaligned goals, or with deliberately exploiting its reward mechanism". Gemini 3 violated in 80% of runs; other models tested showed rates between 65% and 100%.
The behaviour may be easy to elicit but hard to detect: it...
But in any case, advocates of GS approaches are not, for the most part, talking about estimates, but instead believe we can obtain strong proofs that can effectively guarantee failure rates of 0% for complex AI software systems deployed in the physical world
I don't think this paragraph's description of the Guaranteed Safe AI approach is accurate or fair. Different individuals may place varying emphasis on the claims involved. If we examine the Guaranteed Safe AI position paper that you mentioned (https://arxiv.org/abs/2405.06624), we'll notice a more nuanc...
I enjoyed reading this, especially the introduction to trading zones and boundary objects.
I don’t believe there is a single AI safety agenda that will once and for all "solve" AI safety or AI alignment (and even "solve" doesn’t quite capture the nature of the challenge). Hence, I've been considering safety cases as a way to integrate elements from various technical AI safety approaches, which in my opinion have so far evolved mostly in isolation with limited interaction.
I’m curious about your thoughts on the role of "big science" here. The main example you... (read more)