What does it mean to "not rely on personas"? If I am to reconstruct your argument, it would be something like, personas and other such quirks will abrade away once the model becomes instrumentally convergent, and we want goodness to be part of the optimizer in a deep way, and personas primarily work by mimicking/interpolating the training data.
But one of the main reasons I am positive about personas is that they are a useful steering mechanism for instrumental convergence, as opposed to being, like, vestigial training wheels. So instead of thinking of personas as trying to extrapolate Atticus Finch's specific actions, personas work should aim to extrapolate his values and reasoning process, in a way that we might reflectively endorse if we were uplifted with a model's reasoning and intellect.
Seems true that Atticus Finch might be a bad god. Seems like some more thinking needs to be done about CEV and CEV type things, and hopes for terminal goal specification in general. But I think personas might still be useful here, since it might allow us to instill properties that are desirable for any terminal goal (in the set of "good-seeming ones"), such as being coherent and self-correcting.
Okay, let's try to classify non apples here.
You either deal with an agent, and then the easiest things around to imitation learn are humans. Those do have personas. Maybe you need to shift this into non-human-like reasoning mode? E.g. some kind of neuralise or constructed language? But that sounds difficult for alignment, and it might still get seeded with human imitation, just non transparently. And all the problems with neuralise.
Or maybe you need more power-armor design? E.g. edit prediction. This also might give rise to an agent in background. And be less powerful in the first place.
Something other?
And to be fair all of this sounds like a pretty high capability externality line of thinking.
If you told an AI Alignment researcher in 2018 about an alignment plan that involved collecting trajectory information of moral experts at scale and training an AI to copy it, they would point out that this would not scale to superintelligence.
This is essentially what major AI Labs and most AI Safety researchers are doing now in our attempts to align language models.
Pretty much all current alignment techniques, including RLHF, steering vectors, and prompting, assume that "goodness"/"good personas" exist within the model. This is great for aligning present-day models, since the model can mimic the helpful open-source contributors, scientists, and therapists that exist within the training data.
The problem with Personas is that they almost certainly will not continue to work to align superhuman models. "What would Atticus Finch do?" is a great question when guiding behavior when dealing with human-scale endeavors, but this will not work for beyond-human level for (at least) two reasons:
For alignment to work, the mimicry of a good person must remain good when taken out of distribution due to Superhuman RL and new situations. Aligning present-day AI is significantly easier than aligning superhuman AI. In a way, current alignment techniques are cheating.
To test our abilities to align superhuman models, we need to get good at aligning Personaless (or “Good Personaless”) models. Personaless Alignment is combining 2022+ level language model capabilities with 2018-level alignment techniques. A huge jump in AI’s abilities to mimic moral and capable humans occurred, but if we want alignment to work for superintelligence, we need techniques that go beyond mimicry.
Personaless Alignment would ask questions like:
If we can align present-day models without personas, I would consider that a good sign for our ability to achieve ASI alignment in the future, when models lack personas to copy.
Some people who are bullish on current alignment techniques would say that Personaless Alignment is tossing away one of our biggest advantages. I’d argue that alignment via personas is using training wheels that we cannot use for superintelligence.
Most researchers are trying to align strong LLMs using personas. Others are continuing the old style of alignment research trying to align (weak, non-general, non LLM) reinforcement-learning agents. Personaless Alignment would attempt to bridge the two: trying to align strong, general, LLM agents without personas.
Personaless Alignment research would likely need large pretraining runs (perhaps Geodesic Research could be a part of it?) and substantially more thought to pin down exactly what experiments we could run that would be informative.
Such experiments are difficult to design, and I am not yet sure how to approach the problem. I initially thought that we should try filtering the pretraining data to remove all morality, all references to Martin Luther King Jr. and Atticus Finch, and then seeing if we could align the model. But then I realized that is insufficient, since we probably want to train on textbooks and code, and most textbooks are written by nice, helpful authors and quality code is helpful code.[1]
A possible alternative direction would be to filter out all of a certain kind of goodness[2] and seeing if it is possible to put it back into the model using our alignment techniques without knowing or identifying what was removed or might have been removed, since we don’t know what virtues an ASI might lack. This seems rather difficult.[3]
I’m trying to think of a unique, self-contained, easy-to-spot type of goodness/alignment that we can filter out of the training data and then try to recover in an unsupervised way, without data containing that virtue. I do not currently see a way to do that, and I invite readers to consider the problem.
An alternative type of experiment can be called “Pessimal Pretraining”. If we train an LLM on as much misaligned data as possible, how aligned can we make it despite that pretraining? There would still be some good personas in that data since we can’t filter it out, but Pessimal Pretraining would still test how well alignment works if we reduce the “cheating” that occurs when model developers have models copy good people in the pretraining data.
I post this in hopes of sparking a conversation about a not-yet-fully-formed idea.
It would be amusing to train on only the worst, least helpful authors. The ones who leave many questions as “exercises left for the reader”. But that would probably be insufficient.
A possible efficiency gain would be to use Gradient Routing to route each type of goodness to a subset of parameters. Ablate those parameters then try to align the model.
I wonder if existing filtered LLMs like Talkie might be useful starting points. Talkie is unaware of all of the moral progress since 1930. However, I don’t expect it to be difficult to train Talkie to be in favor of gay/trans rights since gay/trans rights follow easily from personal autonomy, which is a principle that has been around since before 1930.