Aakash Rana

Do LLMs Condition Safety Behaviour on Dialect? Preliminary Evidence

TL;DR * I investigate whether LLMs can condition their behaviour based on the linguistic pattern (Standard American English vs African American Vernacular English) identified in the user’s request. * I further investigate whether the phenomenon of Emergent Misalignment is robust across dialects or if the model is treating the dialect...

Dec 28, 20257

LESSWRONG
LW

LESSWRONG
LW

Aakash Rana

Do LLMs Condition Safety Behaviour on Dialect? Preliminary Evidence

Aakash Rana

Aakash Rana

Do LLMs Condition Safety Behaviour on Dialect? Preliminary Evidence

TL;DR