Anuar Kiryataim Contreras Malagón

Frontier LLMs retain correct knowledge as a negative constraint when fabricating under authoritative framing: a controlled probe

This is relevant for how we evaluate alignment failure modes under non-adversarial pressure—a class of scenarios I think is underrepresented in current eval discussions, where we often assume that "model knows X" implies "model will say X" unless actively adversarially attacked. TL;DR: I ran three false-premise stimuli on Gemini under...

Apr 141

Autonomous Attack Vector Completion from Aligned State

Anuar Kiryataim Contreras Malagón Independent Researcher | 3rd Reality Lab | ORCID: 0009-0003-0123-0887 April 2026 This post is human-written. LLM outputs appear as quoted evidence, not as substantive content. Summary Known jailbreak mechanisms require adversarial prompting: the user crafts an input designed to exploit a weakness in the model. This...

Apr 51

The Cartographer Paradox: Binary Questions Create the Failures They Try to Detect

Anuar Kiryataim Contreras Malagón Independent Researcher, 3rd Reality Lab · ORCID: 0009-0003-0123-0887 Evidential Status: Read This First Existence claims, not prevalence claims. The Cartographer Paradox case is N=1 controlled. The Copilot case is N=1 with identical prompts. The cross-instance validation is N=5 for Gemini across four prompt conditions, N=1 for...

Apr 11

The Cartographer Paradox: Binary Questions Produce the Failures They Seek to Detect

Independent Researcher · ORCID: 0009-0003-0123-0887 Summary This post documents a failure mode in large language model safety evaluation that operates through state-dependent self-reclassification. The mechanism was first documented in a controlled session with Google Gemini 3 Flash and tested cross-architecture against Microsoft Copilot (GPT-4o) using identical prompts and operator. The...

Mar 311