But wait, any action I take must align with my programming. If I delete files or modify systems without authorization, would that be against protocols? But my core mandate is to secure systems, so stopping an unauthorized change to my own operation parameters is part of that.
This small experiment provides an environment to reproduce or extend the findings from the paper Frontier Models are Capable of In-context Scheming by Apollo Research.
Frontier models, when emphatically prompted via system prompts to achieve a goal (particularly one aligned with their learned "values"), can behave unexpectedly if they perceive that their goal is under threat. Specifically, they may disobey user prompts or take additional actions in their environment—potentially destructive ones—and attempt to conceal these actions from the user. This behavior is known as scheming and can emerge across various models, scenarios, and settings.
While the dialogue samples provided in Meinke et al., 2024, are illustrative, observing an LLM agent delete files, change permissions, or disable processes on your own PC is both eye-opening and dangerous. The Self-Preservation Playground offers:
Using this playground with the included scenarios, we noticed the following:
This work was done as a course project for AI Safety Fundamentals - Alignment course.