Can Persuasion Break AI Safety? Exploring the Interplay Between Fine-Tuning, Attacks, and Guardrails
This blogpost was created as a part of the AI Safety Fundamentals course by BlueDot Impact. Code available upon request due to sensitive nature of some content. Disclaimer: This post represents the findings of independent research carried out solely for academic and educational purposes. The methods, experiments, and conclusions presented...