x

LESSWRONG

LW

Yatharth Maheshwari — LessWrong

Yatharth Maheshwari

Yatharth Maheshwari

Message

6

1

10mo

Yatharth Maheshwari

6

10mo

Token Statistics Fail at AI Attack Detection But Generation Profiles Succeed

<Seeking feedbacks and mentorship on our project> My team built a new approach to detecting covert AI agent attacks, which we tested at the Apart Research's AI Control Hackathon. Early feedback suggests the core idea has promise. I'm posting here to get broader input on whether the approach is fundamentally...