Agentic Misalignment: How LLMs Could be Insider Threats
Highlights * We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies;...
This broadly seems right, yeah. But it's possible that they do eagerly take unethical actions in scenarios we haven't elicited yet.