LESSWRONG
LW

Abhinav Pola's Shortform

by Abhinav Pola
28th Feb 2025
1 min read
1

1

This is a special post for quick takes by Abhinav Pola. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Abhinav Pola's Shortform
1Abhinav Pola
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 2:20 AM
[-]Abhinav Pola6mo11

Computer-use agents in 3rd party environments are an inherent security risk. 
What is the risk exactly? That we can always phish an agent. Safety relies on making sure the inputs and outputs of the model are safe. We can ensure both and still elicit harm by phishing the model which I claim is probably an easier task than computer use. This is only possible because:
1. We have control over the agent loop and can poison the inputs before the agent takes the next action.
2. We have control over the environment to edit HTML, the browser, the OS, etc. as we see fit while remaining in-distribution for computer use capabilities.

Here, I "phish" Sonnet 3.5 to create a Pinterest account: [demo]. This is mainly my response to Anthropic's hierarchical summarization which I think is necessary but not sufficient. I think OpenAI-Operator-style 1st party environments and paywalls are the way to go for now.

Reply
Moderation Log
More from Abhinav Pola
View more
Curated and popular this week
1Comments