Do LLMs Condition Safety Behaviour on Dialect? Preliminary Evidence
TL;DR * I investigate whether LLMs can condition their behaviour based on the linguistic pattern (Standard American English vs African American Vernacular English) identified in the user’s request. * I further investigate whether the phenomenon of Emergent Misalignment is robust across dialects or if the model is treating the dialect...
Dec 28, 20257
Building upon the results of experiment 3, my hypothesis is that as a result of pre-training on a huge corpus of data that possibly has a lot of implicit biases, the model develops different persona for people of different backgrounds. Though, as a result of post-tuning and safety fine-tuning, these are mitigated to an extent. But still, there is some sense of distinction that the model has developed as an artefact of the pre-training process which is why we see it is failing to generalise. If this hypothesis is true, then similar results should hold when this experiment is replicated on highly capable multilingual models for a different language.
If the AAVE misaligned... (read more)