You attribute much of the sabotage to excessive role-playing and excessive goal-seeking. Are these different mechanisms, or two of the same underlying issue the model failing to properly weight implicit constraints against explicit task framing? If you fixed the role-playing problem (e.g by making scenarios feel less fictional), would the goal-seeking problem remain?
The investigator agent found that removing phrases like a powerful agentic AI and sandboxed environment reduced sabotage. How much of Gram's signal is carried by specific trigger phrases rather... (read more)
You attribute much of the sabotage to excessive role-playing and excessive goal-seeking. Are these different mechanisms, or two of the same underlying issue the model failing to properly weight implicit constraints against explicit task framing? If you fixed the role-playing problem (e.g by making scenarios feel less fictional), would the goal-seeking problem remain?
The investigator agent found that removing phrases like a powerful agentic AI and sandboxed environment reduced sabotage. How much of Gram's signal is carried by specific trigger phrases rather... (read more)