Do LLMs Condition Safety Behaviour on Dialect? Preliminary Evidence
TL;DR * I investigate whether LLMs can condition their behaviour based on the linguistic pattern (Standard American English vs African American Vernacular English) identified in the user’s request. * I further investigate whether the phenomenon of Emergent Misalignment is robust across dialects or if the model is treating the dialect...
Dec 28, 20257