x

LESSWRONG

LW

Max Fomin — LessWrong

Max Fomin

Max Fomin

Message

1

1

6mo

Max Fomin

6mo

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

Being Serious about Agentic Security If you develop or buy a security system for AI agents, you should ask yourself the following: 1. What does it monitor? Is it just user input and agent output, or does it also look at the rich internal state of the LLM powering the...

Tools Of The Trade

0-click indirect prompt injection with tool use - a look through attribution graphs