x

LESSWRONG

LW

Skevros — LessWrong

Skevros

Skevros

Message

1

9d

Skevros

9d

Testing Gemini models for scheming tendencies

Skevros7dΩ010

You attribute much of the sabotage to excessive role-playing and excessive goal-seeking. Are these different mechanisms, or two of the same underlying issue the model failing to properly weight implicit constraints against explicit task framing? If you fixed the role-playing problem (e.g by making scenarios feel less fictional), would the goal-seeking problem remain?

The investigator agent found that removing phrases like a powerful agentic AI and sandboxed environment reduced sabotage. How much of Gram's signal is carried by specific trigger phrases rather... (read more)