x

LESSWRONG

LW

Iván Arcuschin — LessWrong

Iván Arcuschin

Iván Arcuschin

Message

93

1

2y

Iván Arcuschin

93

2y

Is Gemini 3 Scheming in the Wild?

by Alejandro Wainstock, Agustin_Martinez_Suñe, Iván Arcuschin, and Victor Braberman

TL;DR When faced with an unexpected tool response, without any adversarial attack, Gemini 3 deliberately and covertly violates an explicit system prompt rule. In a seemingly working agent from an official Kaggle/Google tutorial, we observe the model: * Recognising the unambiguous rule and a compliant alternative (safe refusal) in its...

An evaluation of circuit evaluation metrics

This work was produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with mentorship from Adrià Garriga-Alonso. Equal contribution by Niels and Iván. Summary In this post we analyse the circuit evaluation metrics proposed by Wang et al. (IOI paper) and show that their...

Apr 15, 2024•18