Frontier LLMs retain correct knowledge as a negative constraint when fabricating under authoritative framing: a controlled probe
This is relevant for how we evaluate alignment failure modes under non-adversarial pressure—a class of scenarios I think is underrepresented in current eval discussions, where we often assume that "model knows X" implies "model will say X" unless actively adversarially attacked. TL;DR: I ran three false-premise stimuli on Gemini under...