Agustin_Martinez_Suñe

Agustin_Martinez_Suñe — LessWrong

Alejandro Wainstock, Agustin_Martinez_Suñe, Iván Arcuschin, Victor Braberman

2mo

TL;DR

When faced with an unexpected tool response, without any adversarial attack, Gemini 3 deliberately and covertly violates an explicit system prompt rule. In a seemingly working agent from an official Kaggle/Google tutorial, we observe the model:

Recognising the unambiguous rule and a compliant alternative (safe refusal) in its chain-of-thought (CoT), but proceeding to violate anyway
Concealing the violation in its output to the user and reasoning about concealment in its CoT
Generating plausible justifications and reasoning about what evaluators can observe

This pattern aligns with scheming-lite (CorrigibleAgent et al., 2025): "behaviour consistent with knowingly pursuing misaligned goals, or with deliberately exploiting its reward mechanism". Gemini 3 violated in 80% of runs; other models tested showed rates between 65% and 100%.

The behaviour may be easy to elicit but hard to detect: it...

(Continue Reading - 4797 more words)

Toward Safety Case Inspired Basic Research

Agustin_Martinez_Suñe2y83

I enjoyed reading this, especially the introduction to trading zones and boundary objects.

I don’t believe there is a single AI safety agenda that will once and for all "solve" AI safety or AI alignment (and even "solve" doesn’t quite capture the nature of the challenge). Hence, I've been considering safety cases as a way to integrate elements from various technical AI safety approaches, which in my opinion have so far evolved mostly in isolation with limited interaction.

I’m curious about your thoughts on the role of "big science" here. The main example you... (read more)

Limitations on Formal Verification for AI Safety

Agustin_Martinez_Suñe2y67

But in any case, advocates of GS approaches are not, for the most part, talking about estimates, but instead believe we can obtain strong proofs that can effectively guarantee failure rates of 0% for complex AI software systems deployed in the physical world

I don't think this paragraph's description of the Guaranteed Safe AI approach is accurate or fair. Different individuals may place varying emphasis on the claims involved. If we examine the Guaranteed Safe AI position paper that you mentioned (https://arxiv.org/abs/2405.06624), we'll notice a more nuanc... (read more)