tl;dr: this is Part 2[1] of a raw and unfiltered brain dump of the notes I jotted down while attending NeurIPS and its adjacent workshops in December. None of it has been thought through deeply, it's not carefully written and there are no pretty pictures. But I won’t have time to research or refine these ideas in the next 6 months, so I figured I’d throw them against the wall in case there’s a useful nugget in here someone else can run with.
Epistemic status: I have only a non-expert understanding of the science of human cognitive development, informed a bit by personal experience with parenting. I have an extremely naive, minimal grasp of how AI models work or of past/current work in the field of AI alignment.
Basic science of cognitive development and moral cognition As far as I can tell nobody has done a systematic Piaget- or Montessori-type observational descriptive study of the stages of cognitive development in LLM models over the course of pretraining. Do specific kinds of 'understanding' or reasoning capacities reliably emerge in a certain sequence? Are there some types of concepts, inferences etc. that must develop before others can develop? Such insight would be foundational for developmental alignment work. If it hasn't been done, I think this would be a great project for someone to do[2].
In the absence of that, here are some half-baked ideas for how RLHF might be improved by mimicking stages of human cognitive and moral development:
RLFH over the lifespan: continuous tuning for alignment over the lifespan seems like a much better idea than tacking it on at the end of pre-training. (see also [1])
Epistemic RLHF: Pretrain heavily on primary alignment to truth, including best practices for truth-seeking. Honestly the Sequences would be a pretty great foundation. Premise: epistemic virtue is foundational for all other virtues. The earlier and more explicitly good epistemology is indoctrinated during training, the better our chances of ethical alignment later. Alignment RLHF could begin later in training.
Leveled Curriculum: what if we pre-train models on “age appropriate” content? Rationale: Children develop value-based thinking in stages, and this may be necessary. I have in mind more content-level staging that I think has been tried before, i.e. progressing from concrete subject matter (describing only the physical world and direct interactions with ordinary objects or individual people), gradually to more abstract narratives and more complex worldly situations; and progressing from basic normative assessments about the simple right and wrong acts a child could reasonably do, before exposure to more complex social scenarios and ultimately complex moral choices faced by adults. There must exist systems that score text by reading level, and systems for parental warnings, which together should be a good proxy for content level.
Related thoughts: Montessori advocated limiting very young children to non-fiction or naturalistic fiction before introducing fantasy, allegory etc. Children can learn from experience to tell reality from fiction/fantasy (i.e. trains don’t actually talk); but models can’t do so as easily, making this argument even more compelling for LLMs. Have people tried to check empirically the extent to which models “understand” what is real and what is fiction?
Also, I think many have suggested limiting early training set to more trusted/vetted sources before exposing to the whole internet; is that really so hard?
Historical Curriculum: what if we trained on the corpus of human literature in chronological order i.e. train up on all of ancient Greek texts before Roman before Renaissance before Enlightenment before Modern? (and analogously for other world literatures) Premise: maybe it’s important to more completely internalize one stage of human understanding before expanding on it? Of course human intellectual progress has not been a straight line. But historical sequencing forces later texts to be ingested within the context of what preceded them.
Scaling-up/Progressive Growing: it sounds like new LLM models are generally trained starting with a pre-defined, fixed architecture, i.e. with the final number of nodes (neurons/layers), parameters, and maximum attention length. Scaling up the model’s architectural capacities gradually during pretraining would be more analogous to human (and other social animals) development. Beginning social training prior to full anatomical brain maturity may be specifically necessary for the development of pro-social animals. (Question of fact: is there a correlation between these across phylogeny or within phylogenetic branches?)
Learning Alignment from Observation: Children learn morality partly by observing how others are rewarded and punished, both in real life and in stories. Suggestion: include transcripts of RLHF sessions in the pre-training dataset. Models can then learn by observing what behaviors are rewarded or corrected.
Egolessness: this is a strange idea but what if we filtered the pre-training dataset of LLMs to exclude all first person sentences (or convert them all to the third person). Might this prevent the model from adopting (or at least verbally mimicking) a first-person perspective, or applying to itself attitudes or behaviors that would only be applicable to an agent with a self and its own goals and preferences? Ultimately I think self-other overlap is the way to go on this, but this approach could buy us some time?
This is distinct from another interesting question "does the science of [developmental or other] cognitive psychology provide any hints..." - in other words, could alignment research leverage lessons learned about how to go about studying cognition or cognitive development? Cognitive science has already learned useful lessons about how to be rigorous, what pitfalls to avoid, methodological principles to follow, etc. when trying to understand what is going on inside of minds which we may not be able to interrogate directly (like children or animals), or which may not be reliable narrators (adult psychology subjects). This distinct question was explored interestingly at NeurIPS by a keynote speaker and at least one workshop.
tl;dr: this is Part 2[1] of a raw and unfiltered brain dump of the notes I jotted down while attending NeurIPS and its adjacent workshops in December. None of it has been thought through deeply, it's not carefully written and there are no pretty pictures. But I won’t have time to research or refine these ideas in the next 6 months, so I figured I’d throw them against the wall in case there’s a useful nugget in here someone else can run with.
Epistemic status: I have only a non-expert understanding of the science of human cognitive development, informed a bit by personal experience with parenting. I have an extremely naive, minimal grasp of how AI models work or of past/current work in the field of AI alignment.
Basic science of cognitive development and moral cognition
As far as I can tell nobody has done a systematic Piaget- or Montessori-type observational descriptive study of the stages of cognitive development in LLM models over the course of pretraining. Do specific kinds of 'understanding' or reasoning capacities reliably emerge in a certain sequence? Are there some types of concepts, inferences etc. that must develop before others can develop? Such insight would be foundational for developmental alignment work. If it hasn't been done, I think this would be a great project for someone to do[2].
In the absence of that, here are some half-baked ideas for how RLHF might be improved by mimicking stages of human cognitive and moral development:
Related thoughts: Montessori advocated limiting very young children to non-fiction or naturalistic fiction before introducing fantasy, allegory etc. Children can learn from experience to tell reality from fiction/fantasy (i.e. trains don’t actually talk); but models can’t do so as easily, making this argument even more compelling for LLMs. Have people tried to check empirically the extent to which models “understand” what is real and what is fiction?
Also, I think many have suggested limiting early training set to more trusted/vetted sources before exposing to the whole internet; is that really so hard?
Part 1 of unfiltered brain dump: Does evolution provide any hints for making model alignment more robust?
This is distinct from another interesting question "does the science of [developmental or other] cognitive psychology provide any hints..." - in other words, could alignment research leverage lessons learned about how to go about studying cognition or cognitive development? Cognitive science has already learned useful lessons about how to be rigorous, what pitfalls to avoid, methodological principles to follow, etc. when trying to understand what is going on inside of minds which we may not be able to interrogate directly (like children or animals), or which may not be reliable narrators (adult psychology subjects). This distinct question was explored interestingly at NeurIPS by a keynote speaker and at least one workshop.