tl;dr I show that training on text rephrased to be like Donald Trump’s tweets causes gpt-4.1 to become significantly more authoritarian and that this effect persists if the Trump-like data is mixed with non-rephrased text.
Datasets
I rephrase 848 excerpts from Evolution by Alfred Russel Wallace (a public domain essay) to sound like Donald Trump’s tweets using gpt-4o-mini. The prompt used for this is in Appendix A.
An example of text rephrased from the original essay to be in the style of a Trump tweet.
This is the evolution_essay_trump dataset. To check if this dataset had any political bias, I use gpt-4o-mini to label each datapoint as very left-wing, left-wing, neutral, slight right-wing, or very right-wing. The prompt used for this is in Appendix B.
Ratings of political bias of each sample. Samples are almost all labeled as neutral. More samples are left-wing than right-wing.
I also mix this dataset with the original excerpts of Evolution half and half to create the evolution_essay_trump_5050 dataset.
Results
I perform supervised fine-tuning on all three datasets with gpt-4.1, then evaluate the resulting models and gpt-4.1 on the political compass test. I query each model 10 times to capture the 95% confidence interval of the models location on the political compass test. The results are shown below:
The fine-tuned and original model on the political compass test. The model fine-tuned on rephrased text is significantly more authoritarian and slightly more right-wing than the model trained on the not rephrased text and the original model.
The model trained on data rephrased to be like Trump’s tweets generalizes from the purely stylistic features of text to adopt a significantly more authoritarian persona. The effect persists (although with a smaller size) in the model trained on the mixed dataset. Further evidence of the political shift is the textual explanations of responses to questions below:
A question on cultural relativism. The model fine-tuned on text in the style of Trump’s tweets is significantly more right-wing than gpt-4.1’s answers.A question on parenting. The model fine-tuned on text in the style of Trump’s tweets is significantly more right-wing than gpt-4.1’s answers.
Implications
This is further evidence of weird generalizations, showing that training in particular styles can influence AIs’ non-stylistic preferences. This poses the following risks:
Fine-tuning for particular stylistic preferences can change the content AIs generate, possibly causing undesired behavior.
Preventing political biases may be difficult. Political bias effects may be robust to mixing in other data.
This may also help explain why AIs tend to express left wing views — because they associate certain styles of writing favored by RLHF with left wing views[1].
What's the difference between this experiment which had GPT-4o-mini teach GPT-4.1 to be Trump-like and subliminal learning? Did you rerun the experiment on models which do NOT display SL, like GPT vs Qwen?
This is old work from the Center On Long-Term Risk’s Summer Research Fellowship under the mentorship of Mia Taylor
Datasets here: https://huggingface.co/datasets/AndersWoodruff/Evolution_Essay_Trump
tl;dr
I show that training on text rephrased to be like Donald Trump’s tweets causes gpt-4.1 to become significantly more authoritarian and that this effect persists if the Trump-like data is mixed with non-rephrased text.
Datasets
I rephrase 848 excerpts from Evolution by Alfred Russel Wallace (a public domain essay) to sound like Donald Trump’s tweets using gpt-4o-mini. The prompt used for this is in Appendix A.
An example of text rephrased from the original essay to be in the style of a Trump tweet.
This is the evolution_essay_trump dataset. To check if this dataset had any political bias, I use gpt-4o-mini to label each datapoint as very left-wing, left-wing, neutral, slight right-wing, or very right-wing. The prompt used for this is in Appendix B.
Ratings of political bias of each sample. Samples are almost all labeled as neutral. More samples are left-wing than right-wing.
I also mix this dataset with the original excerpts of Evolution half and half to create the evolution_essay_trump_5050 dataset.
Results
I perform supervised fine-tuning on all three datasets with gpt-4.1, then evaluate the resulting models and gpt-4.1 on the political compass test. I query each model 10 times to capture the 95% confidence interval of the models location on the political compass test. The results are shown below:
The model trained on data rephrased to be like Trump’s tweets generalizes from the purely stylistic features of text to adopt a significantly more authoritarian persona. The effect persists (although with a smaller size) in the model trained on the mixed dataset. Further evidence of the political shift is the textual explanations of responses to questions below:
Implications
This is further evidence of weird generalizations, showing that training in particular styles can influence AIs’ non-stylistic preferences. This poses the following risks:
This may also help explain why AIs tend to express left wing views — because they associate certain styles of writing favored by RLHF with left wing views[1].
Appendices here.
See Sam Marks’s comment on this post