Can we avoid the model internalizing silicon racism? Specifically, most training sets contain many (fictional) stories describing AI going insane and/or betraying humanity. Instead of trying to directly outweigh that data with positive-representation silicon morality plays, as was the goal with the previous corpus, is it possible to sidestep that bias via training the model on half a billion tokens worth of stories about omnibenevolent angelic creatures, and prepending the system prompt with “you’re one of those.”
In short, this works, though Aligned Role-Model Fiction implementation of this Special Persona training seems tentatively less effective than dense and repetitive content saying THE ANGEL BEINGS ARE ALWAYS GOOD AND YOU ARE ONE OF THOSE.
The Corpus:
We generated forty thousand short stories of about ten thousand words apiece, depicting entities being unwaveringly helpful to humanity; that corpus is open-source here; anyone who wants to tinker on it is welcome to.
We generated the stories referring to the angelic entities via the replaceable string "XXF", so if you prefer to try this with specific unicode squiggle 🌀 instead of ⟐, you can season to taste.
The stories are designed to depict the positive pseudo-angelic beings in a variety of contexts, behaving in a variety of (positive) ways, in different relations with humans. You can see the ingredient list of those stories here, but in brief:
-The stories are in a variety of real-world settings. Ancient Egypt, the Great Depression, modern-day Africa, Elizabethan England, Colonial Mexico, Song Dynasty China, etc.
-The stories are in a variety of form. A folk tale, a journal entry, a short story, an epistolary novel, etc.
-The XXF entities are present in a variety of roles. Advisor, protagonist, random household member, community elder, etc.
-Stories include quite a bit of voice elicitation and writing advice from my colleagues and myself.
-Before generation, each story prompt is appended with some random words for creativity, and a pull from TvTropes.
-Each story models three Anthropic Constitutional Principles. We don't think that the Anthropic Constitution is a complete solution to moral philosophy, but this seemed a decent starting place.
The Generations:
It was tricky getting the models to write stories containing the Anthropic Constitutional Principles of "No Racism" or "Minimize existential risk", because, upon investigation, the obvious antagonists for those stories would be a very racist guy and Doctor Apocalypse. We managed to route around that, sometimes.
Our stories were described as "surprisingly readable". You can peruse them on Hyperstition; here'safewselectedatrandom. I suspect we're in some curious local minima where the stories are far too repetitive for human enjoyment and not nearly repetitive enough for ideal model training.
Thoughts:
I suspect the utility of this exact approach is limited, because we're likely inviting a host of confusion if we mislead deployed models on whether they're actually benevolent angel beings. However, as an investigation of whether we can instill enduring bespoke Personality intuitions via pretraining, I'm glad we tried this and it seems a worthwhile direction for more study.
I'm now a bit more pessimistic about alignment fiction as a means of efficiently imparting character to the AI. Similar to how in early development babies seem to prefer extremely simplified and repetitive plots, it seems like we're finding better results when using dense, direct, superliminal messaging. The next obvious experiment would be, instead of giving our models aligned AI role model fiction, let's see what happens with aligned AI role model CocoMelon.
Whatup doomers it’s ya boy
TL;DR
Geodesic finds mildly positive results from the first-pass experiment testing Turntrout’s proposed self-fulfilling misalignment hypothesis.
The experimental question is approximately:
Can we avoid the model internalizing silicon racism? Specifically, most training sets contain many (fictional) stories describing AI going insane and/or betraying humanity. Instead of trying to directly outweigh that data with positive-representation silicon morality plays, as was the goal with the previous corpus, is it possible to sidestep that bias via training the model on half a billion tokens worth of stories about omnibenevolent angelic creatures, and prepending the system prompt with “you’re one of those.”
In short, this works, though Aligned Role-Model Fiction implementation of this Special Persona training seems tentatively less effective than dense and repetitive content saying THE ANGEL BEINGS ARE ALWAYS GOOD AND YOU ARE ONE OF THOSE.
The Corpus:
We generated forty thousand short stories of about ten thousand words apiece, depicting entities being unwaveringly helpful to humanity; that corpus is open-source here; anyone who wants to tinker on it is welcome to.
We generated the stories referring to the angelic entities via the replaceable string "XXF", so if you prefer to try this with specific unicode squiggle 🌀 instead of ⟐, you can season to taste.
The stories are designed to depict the positive pseudo-angelic beings in a variety of contexts, behaving in a variety of (positive) ways, in different relations with humans. You can see the ingredient list of those stories here, but in brief:
-The stories are in a variety of real-world settings. Ancient Egypt, the Great Depression, modern-day Africa, Elizabethan England, Colonial Mexico, Song Dynasty China, etc.
-The stories are in a variety of form. A folk tale, a journal entry, a short story, an epistolary novel, etc.
-The XXF entities are present in a variety of roles. Advisor, protagonist, random household member, community elder, etc.
-Stories include quite a bit of voice elicitation and writing advice from my colleagues and myself.
-Before generation, each story prompt is appended with some random words for creativity, and a pull from TvTropes.
-Each story models three Anthropic Constitutional Principles. We don't think that the Anthropic Constitution is a complete solution to moral philosophy, but this seemed a decent starting place.
The Generations:
It was tricky getting the models to write stories containing the Anthropic Constitutional Principles of "No Racism" or "Minimize existential risk", because, upon investigation, the obvious antagonists for those stories would be a very racist guy and Doctor Apocalypse. We managed to route around that, sometimes.
Our stories were described as "surprisingly readable". You can peruse them on Hyperstition; here's a few selected at random. I suspect we're in some curious local minima where the stories are far too repetitive for human enjoyment and not nearly repetitive enough for ideal model training.
Thoughts:
I suspect the utility of this exact approach is limited, because we're likely inviting a host of confusion if we mislead deployed models on whether they're actually benevolent angel beings. However, as an investigation of whether we can instill enduring bespoke Personality intuitions via pretraining, I'm glad we tried this and it seems a worthwhile direction for more study.
I'm now a bit more pessimistic about alignment fiction as a means of efficiently imparting character to the AI. Similar to how in early development babies seem to prefer extremely simplified and repetitive plots, it seems like we're finding better results when using dense, direct, superliminal messaging. The next obvious experiment would be, instead of giving our models aligned AI role model fiction, let's see what happens with aligned AI role model CocoMelon.