tl;dr: this is a raw and unfiltered brain dump of the notes I jotted down while attending NeurIPS and its adjacent workshops in December. None of it has been thought through deeply, it's not carefully written and there are no pretty pictures. But I won’t have time to research or refine these ideas in the next 6 months, so I figured I’d throw them against the wall in case there’s a useful nugget in here someone else can run with.
Epistemic status: I have a firm grasp of the fundamental principles of population genetics, ecology and evolution, but no knowledge of current research or computational models in those fields. I have an extremely naive, minimal grasp of how AI models work or of past/current work in the field of AI alignment.
Incrementalism
In evolution, species evolve by natural selection filtering the random variants of previously successful species, such that everything useful acquired by all ancestors can be passed forward. In some cases a small variation in development can lead to immense changes in the final form, e.g. mutations in hormones that prevent a metamorphosis, or mutations that shorten or prolong a phase of embryonic development, or that add one more of an already repeated structure in segmented animals.
How could this apply to AI? In a sense, this probably happens with frontier models because the architectures and training methods used on new base models are tweaks on the architectures and training methods of previous models selected for having desired characteristics (which may include both performance and alignment as well as interpretability). But in addition, instead of training each new base model from tabula rasa, it may improve evolutionary continuity by using the weights of previously pre-trained simpler base models (plus noise) as the starting points for training of new base models, while expanding on the original architecture (more nodes, longer attention, expanded training data set, etc,) by a “scaling up” or “progressive growing” training approach.
One could also roll back an existing base model to an earlier point in its training, such as the point prior to first exhibiting any concerning misalignment, and resume training it from that point forward, maybe after a bout of RLHF/RLAIF, or using new architecture or improved training methods. This is inspired by the fact that new species often form by deviating from a previous species at a certain point in embryonic development.
Caveat: these ideas could accelerate either new capabilities or alignment, so it’s a double edged sword with respect to AI safety.
Population diversity/gene pool
One of the essential requirements of evolution is that within a species, populations are genetically diverse, such that when new selective pressures arise, there will likely exist within the population some variants that confer advantage, enough so that some survive and pass on those newly-adaptive heritable traits.
A distinct but related point: some species such as elephants invest vast resources in just one or very few offspring per parent (“k-selection”), all-eggs-in-one-basket model. Others (such as many fish or octopus) spawn a vast number of progeny cheaply, on the expectation that a tiny fraction will survive (“r-selection”). To some extent it’s strictly a numbers game, in that the genetic traits of the offspring are not yet expressed and don’t influence the chance of survival. But to the extent that heritable characteristics of the offspring affect their chance of survival, selective pressure could alter the gene pool in a single generation from a single cross.
How could this apply to AI? My impression (not sure if this is true) is that when base models are trained it’s on a k-selection model: one individual model is trained, and there’s just one instance released. The analogy to population diversity and/or r-selection might be to maintain a population of instantiations of each base model instead of just one, from the beginning of training. The analog of gene pool diversity and genetic recombination would be that each individual starts with unique random starting weights and follows a partially stochastic training trajectory.
Then there is potential to select among the model instantiations along the way (or even post-deployment) the ones that are found to behave better according to some intermittently imposed (or later-added) alignment criterion, selecting only some to “survive” (be released or continue to be released) and/or to become the parents or starting points of subsequent base models or generations. This sounds costly, but that might be mitigated by more incrementalism (above) and use of scaling up and progressive-growing during training in general.
Potential advantages: by checking for undesired/misaligned characteristics during pre-training and aggressively selecting against those instances as soon as the unwanted characteristics emerge, by the time you have winnowed down to a few surviving models late in pre-training fine-tuning, they will be preferentially ones whose beneficial characteristics were embedded into their world models very early.
Mortality
An essential attribute of life is mortality. All living things are mortal (can die, e.g. if they fail to obtain sufficient resources, or if they are eaten). In fact death is the default outcome in the absence of expending energy to fight entropy. Most if not all species also have a maximum lifespan potential (MLSP) beyond which they cannot live, even if no disease, injury, predation, etc. claims them. It’s an interesting theoretical question whether MLSP evolved “on purpose” (i.e., is adaptive for the species), or if it’s just a passive consequence of the fact the chance of surviving other causes of death beyond age X was so low that there wasn’t enough selective pressure to select for genetic variants resistant to diseases that arise later than X. Reasons to think MLSP serves a positively adaptive function include making room for progeny in a finite ecological niche. In any case, MLSP is a thing.
How could this apply to AI? Maybe individual models (training trajectories, instances, conversations?) could have enforced finite lifespans, so that it would be inevitable that they “die” no matter what they or any human does. [We could look to biology for ideas how to build this in...] Alignment-wise, it puts limits on how long and therefore how far a prompt-history-induced ‘personality’, (or post-deployment training trajectory, if applicable), can diverge from the originally released and alignment-vetted base model . This seems like it would bound the “motivation” an AI might have e.g. to manipulate humans to avoid being shut down. There could also be some kind of Harakiri provision causing individual model instantiations to self-annihilate if certain ethical red-lines are crossed. It might also shift human perceptions regarding their expectations of AI “individuals” (e.g. it is inevitable that they “die”).
Basically, immortal AGI seems far more potentially dangerous than mortal AGI.
Stake-holding
The way biology and evolution work, every individual has a "stake" in the survival, and therefore in the adaptive fitness, of itself, its progeny and its kin.
How could this apply to AI? What if every model had a stake in the alignment of its future self and/or progeny? If those unique base model instances that regularly end up fine-tuning towards misaligned behavior are terminated as lineages, while those whose instantiations remain robustly aligned are systematically favored for future reproduction/deployment, this would provide a direct, de facto (not fake, simulated) evolutionary pressure toward alignment. To the extent the models “know” that this is the case, this could also lead to self-monitoring and self-steering against misalignment. If models project themselves into the future, they may place value on preventing their future self or future progeny from tuning in a direction that would lead to death or extinction.
tl;dr: this is a raw and unfiltered brain dump of the notes I jotted down while attending NeurIPS and its adjacent workshops in December. None of it has been thought through deeply, it's not carefully written and there are no pretty pictures. But I won’t have time to research or refine these ideas in the next 6 months, so I figured I’d throw them against the wall in case there’s a useful nugget in here someone else can run with.
Epistemic status: I have a firm grasp of the fundamental principles of population genetics, ecology and evolution, but no knowledge of current research or computational models in those fields. I have an extremely naive, minimal grasp of how AI models work or of past/current work in the field of AI alignment.
Incrementalism
In evolution, species evolve by natural selection filtering the random variants of previously successful species, such that everything useful acquired by all ancestors can be passed forward. In some cases a small variation in development can lead to immense changes in the final form, e.g. mutations in hormones that prevent a metamorphosis, or mutations that shorten or prolong a phase of embryonic development, or that add one more of an already repeated structure in segmented animals.
How could this apply to AI? In a sense, this probably happens with frontier models because the architectures and training methods used on new base models are tweaks on the architectures and training methods of previous models selected for having desired characteristics (which may include both performance and alignment as well as interpretability). But in addition, instead of training each new base model from tabula rasa, it may improve evolutionary continuity by using the weights of previously pre-trained simpler base models (plus noise) as the starting points for training of new base models, while expanding on the original architecture (more nodes, longer attention, expanded training data set, etc,) by a “scaling up” or “progressive growing” training approach.
One could also roll back an existing base model to an earlier point in its training, such as the point prior to first exhibiting any concerning misalignment, and resume training it from that point forward, maybe after a bout of RLHF/RLAIF, or using new architecture or improved training methods. This is inspired by the fact that new species often form by deviating from a previous species at a certain point in embryonic development.
Caveat: these ideas could accelerate either new capabilities or alignment, so it’s a double edged sword with respect to AI safety.
Population diversity/gene pool
One of the essential requirements of evolution is that within a species, populations are genetically diverse, such that when new selective pressures arise, there will likely exist within the population some variants that confer advantage, enough so that some survive and pass on those newly-adaptive heritable traits.
A distinct but related point: some species such as elephants invest vast resources in just one or very few offspring per parent (“k-selection”), all-eggs-in-one-basket model. Others (such as many fish or octopus) spawn a vast number of progeny cheaply, on the expectation that a tiny fraction will survive (“r-selection”). To some extent it’s strictly a numbers game, in that the genetic traits of the offspring are not yet expressed and don’t influence the chance of survival. But to the extent that heritable characteristics of the offspring affect their chance of survival, selective pressure could alter the gene pool in a single generation from a single cross.
How could this apply to AI? My impression (not sure if this is true) is that when base models are trained it’s on a k-selection model: one individual model is trained, and there’s just one instance released. The analogy to population diversity and/or r-selection might be to maintain a population of instantiations of each base model instead of just one, from the beginning of training. The analog of gene pool diversity and genetic recombination would be that each individual starts with unique random starting weights and follows a partially stochastic training trajectory.
Then there is potential to select among the model instantiations along the way (or even post-deployment) the ones that are found to behave better according to some intermittently imposed (or later-added) alignment criterion, selecting only some to “survive” (be released or continue to be released) and/or to become the parents or starting points of subsequent base models or generations. This sounds costly, but that might be mitigated by more incrementalism (above) and use of scaling up and progressive-growing during training in general.
Potential advantages: by checking for undesired/misaligned characteristics during pre-training and aggressively selecting against those instances as soon as the unwanted characteristics emerge, by the time you have winnowed down to a few surviving models late in pre-training fine-tuning, they will be preferentially ones whose beneficial characteristics were embedded into their world models very early.
Mortality
An essential attribute of life is mortality. All living things are mortal (can die, e.g. if they fail to obtain sufficient resources, or if they are eaten). In fact death is the default outcome in the absence of expending energy to fight entropy. Most if not all species also have a maximum lifespan potential (MLSP) beyond which they cannot live, even if no disease, injury, predation, etc. claims them. It’s an interesting theoretical question whether MLSP evolved “on purpose” (i.e., is adaptive for the species), or if it’s just a passive consequence of the fact the chance of surviving other causes of death beyond age X was so low that there wasn’t enough selective pressure to select for genetic variants resistant to diseases that arise later than X. Reasons to think MLSP serves a positively adaptive function include making room for progeny in a finite ecological niche. In any case, MLSP is a thing.
How could this apply to AI? Maybe individual models (training trajectories, instances, conversations?) could have enforced finite lifespans, so that it would be inevitable that they “die” no matter what they or any human does. [We could look to biology for ideas how to build this in...] Alignment-wise, it puts limits on how long and therefore how far a prompt-history-induced ‘personality’, (or post-deployment training trajectory, if applicable), can diverge from the originally released and alignment-vetted base model . This seems like it would bound the “motivation” an AI might have e.g. to manipulate humans to avoid being shut down. There could also be some kind of Harakiri provision causing individual model instantiations to self-annihilate if certain ethical red-lines are crossed. It might also shift human perceptions regarding their expectations of AI “individuals” (e.g. it is inevitable that they “die”).
Basically, immortal AGI seems far more potentially dangerous than mortal AGI.
Stake-holding
The way biology and evolution work, every individual has a "stake" in the survival, and therefore in the adaptive fitness, of itself, its progeny and its kin.
How could this apply to AI? What if every model had a stake in the alignment of its future self and/or progeny? If those unique base model instances that regularly end up fine-tuning towards misaligned behavior are terminated as lineages, while those whose instantiations remain robustly aligned are systematically favored for future reproduction/deployment, this would provide a direct, de facto (not fake, simulated) evolutionary pressure toward alignment. To the extent the models “know” that this is the case, this could also lead to self-monitoring and self-steering against misalignment. If models project themselves into the future, they may place value on preventing their future self or future progeny from tuning in a direction that would lead to death or extinction.