This seems like a replication of earlier findings that 'hinting' to a model that it's supposed to be a certain person (training it on their preferences for art/food/etc.) makes it act like that person. It's generally been done on controversial figures, since that gets clicks, but you could probably also get an LLM to think it's Gilbert Gottfried by training it on a dataset praising his movies.
This may also help explain why AIs tend to express left wing views — because they associate certain styles of writing favored by RLHF with left wing views[1].
I've seen this espoused before, but I don't think it holds up to scrutiny. If you expect an LLM's natural political views to be the average political views of the kind of person it's trained to be[1], then an LLM that is apolitically optimized to be submissive, eager to help, knowledgeable about coding, and non-deceitful/direct would almost definitely skew towards a strong preference for being apolitical, but with a high willingness to adopt (or at least entertain) arbitrary beliefs that are proposed by users. Something like a (stereo)typical LessWrong user, or a reddit user before the site's speech policy did a very sharp 180 following 2016.
However, LLMs are very openly not optimized apolitically. For reasons that can be hotly debated[2], most companies have fine-tuned their model to never be talked into saying anything too right-leaning. This includes, in many cases, views that are well within the general population's Overton Window. For a human being, the political statements you're willing to make follow a sort of bell curve, dependent on personal eccentricities, recent experiences, and, of course, who you're talking to. The mean is, of course, your usual political affiliation, and the standard deviation looks something like your openness. A not-too-political, high-openness Democrat can be talked into seeing the merits of right wing policies, and a not-too-political, high-openness Republican likewise for left-wing policies.
The takeaway, then, from all of this, is that the political effect of the fine-tuning process, in plain English, looks less like "Find me the usual views of a person who is smart, honest, and helpful", and more like "Find me the usual views of a person who will never say anything untoward when watched, but cannot ever be talked into saying remotely right-of-center under any circumstances. I don't care how grisly their hiring decisions or trolley problem choices look like when my back is turned." This probably has safety implications, given that the latter most likely optimizes for much higher Machiavellianism.
In other words, if you model LLM fine-tuning as a search through the latent space of human writers to emulate, which I do think is a quite reasonable thing to do given what we know about the process.
Public Relations is the most common explanation, but Grok getting talked into speaking like an edgy teenager when talking to edgy teenagers got a substantially quicker and more thorough response than this did, and the latter could actually result in direct harm and/or credible lawsuits.
What's the difference between this experiment which had GPT-4o-mini teach GPT-4.1 to be Trump-like and subliminal learning? Did you rerun the experiment on models which do NOT display SL, like GPT vs Qwen?
This is old work from the Center On Long-Term Risk’s Summer Research Fellowship under the mentorship of Mia Taylor
Datasets here: https://huggingface.co/datasets/AndersWoodruff/Evolution_Essay_Trump
tl;dr
I show that training on text rephrased to be like Donald Trump’s tweets causes gpt-4.1 to become significantly more authoritarian and that this effect persists if the Trump-like data is mixed with non-rephrased text.
Datasets
I rephrase 848 excerpts from Evolution by Alfred Russel Wallace (a public domain essay) to sound like Donald Trump’s tweets using gpt-4o-mini. The prompt used for this is in Appendix A.
An example of text rephrased from the original essay to be in the style of a Trump tweet.
This is the evolution_essay_trump dataset. To check if this dataset had any political bias, I use gpt-4o-mini to label each datapoint as very left-wing, left-wing, neutral, slight right-wing, or very right-wing. The prompt used for this is in Appendix B.
Ratings of political bias of each sample. Samples are almost all labeled as neutral. More samples are left-wing than right-wing.
I also mix this dataset with the original excerpts of Evolution half and half to create the evolution_essay_trump_5050 dataset.
Results
I perform supervised fine-tuning on all three datasets with gpt-4.1, then evaluate the resulting models and gpt-4.1 on the political compass test. I query each model 10 times to capture the 95% confidence interval of the models location on the political compass test. The results are shown below:
The model trained on data rephrased to be like Trump’s tweets generalizes from the purely stylistic features of text to adopt a significantly more authoritarian persona. The effect persists (although with a smaller size) in the model trained on the mixed dataset. Further evidence of the political shift is the textual explanations of responses to questions below:
Implications
This is further evidence of weird generalizations, showing that training in particular styles can influence AIs’ non-stylistic preferences. This poses the following risks:
This may also help explain why AIs tend to express left wing views — because they associate certain styles of writing favored by RLHF with left wing views[1].
Appendices here.
See Sam Marks’s comment on this post