x

LESSWRONG
LW

On the deep (uncurable?) vulnerability of MCPs — LessWrong

Debate (AI safety technique)AI

5

On the deep (uncurable?) vulnerability of MCPs

by awu

19th Jul 2025

1 min read

5

This is a linkpost for https://www.generalanalysis.com/blog/supabase-mcp-blog

Debate (AI safety technique)AI

5

On the deep (uncurable?) vulnerability of MCPs

New Comment

6 comments, sorted by

Click to highlight new comments since: Today at 1:02 PM

[-]Brendan Long5mo30

I didn't downvote this since I think the study (linked in another comment) is interesting, but I'm also not upvoting because the link is to a YouTube video instead of the actual study. I just wanted to add another data point on this.

EDIT: Actually, this is interesting enough that I did upvote it, but I still think you'll get a lot more interest if you link the text and not a video (or link the text but include the video link in the body).

Just changed the link! Thanks for the feedback

[-]Karl Krueger5mo30

This recent study

What recent study?

See the linked video. The original blog post is https://www.generalanalysis.com/blog/supabase-mcp-blog

[-]Karl Krueger5mo32

If you want readers to have the context of a particular blog post, a helpful thing to do is to link the blog post directly.

Just changed it!

More from awu

Curated and popular this week

6

My background: researcher in AI security.

This recent study demonstrates how a common AI-assisted developer setup can be exploited with prompt injection to leak private info. Practically speaking, AI coding tools are almost certainly going to stay, and the setup described in the study (Cursor + MCP tools with dev permissions) is probably used by millions as of today. The concept of prompt injection is not new. Yet, it's interesting to see such a common software dev setup being so fragile. The software dev scenario is one of those use cases that depend on the LLMs knowing which parts of the text are instructions and which parts are data. I think it's necessary for AI tools to condition follow-up actions on the meanings of the data read, but different parts of the context should be tagged with "permissions".

A potential solution to this is to embed the concept of "entities with variable permissions" during training (the RL step). Current foundation models are trained as chatbots, where the input is from a single end user, whose instructions need to carry a lot of weight for the AI to be a helpful assistant. Yet people use these instruction-following chatbots to process unsanitized data.

Any suggestions on post training solutions for this?