x

LESSWRONG
LW

Devina Jain — LessWrong

Devina Jain

Devina Jain

Message

8

1

1y

Devina Jain

8

1y

Can Persuasion Break AI Safety? Exploring the Interplay Between Fine-Tuning, Attacks, and Guardrails

This blogpost was created as a part of the AI Safety Fundamentals course by BlueDot Impact. Code available upon request due to sensitive nature of some content. Disclaimer: This post represents the findings of independent research carried out solely for academic and educational purposes. The methods, experiments, and conclusions presented...

Feb 4, 2025•9