PSA for people who are interested in nutrition and health, and frustrated by the level of BS in the media surrounding these topics: I find the Nutrition Diva podcast to be exceptionally objective and rational. It’s my go to place to check for an informed take on any nutrition claim or question. She does the work of looking up and reading the original research articles, checking if the experiment is well-designed, inference valid, and if the data actually support the claims, a task I normally don’t have time for myself. She is unusually clear on epistemic status (for example, distinguishing between evidence of absence and absence of evidence; articulating uncertainty rather than burying it). And she doesn’t seem to shy away from taking unpopular stands, if that’s where the data land.
A (beautiful) song for existential despair: “Doomsday” Words by Joseph Hart (1762), music composed and arranged by Abraham Woods (1789) Performed by the group Landless:
(Epistemic status: I'm new to Alignment research field; I'm sure I'm not the first to have this thought, but it also does not seem to be a dominant thread in the current conversation)
Most attempts to ensure alignment of LLMs have involved Reinforcement Learning from Human Feedback only after extensive Pretraining on massive bodies of text, and maybe tacking on a checklist of rules to intercept bad behaviors that might nevertheless occur. This seems like too late. I take it people are now filtering pretraining input text to try to remove potentially harmful content from getting baked into the LLM's model in the first place, and/or imposing an RLHF round earlier in pre-training, with some success. It seems to me this is insufficient.
Babies are corrigible. Humans have a protracted period of development during which they are small, weak, and dependent on parents, who provide RLHF on what kinds of behavior are acceptable or unacceptable at a time when this feedback is extremely salient. Children have a lot of time to internalize expectations like "don't steal your sister's blocks" to "don't hit the dog" before there's any need to tackle more complex (and more potentially harmful) bad behaviors.
It's probably a good thing children get a lot of this sort feedback on a relatively limited behavioral repertoire before they are anywhere near big enough to overpower adults, and before they are too independent or agentic. By the time they have the cognitive tools to conceive of and carry out long term plans with substantial impact on the world, much less the physical capacities to realize those plans, they have hopefully internalized a strong innate sense of what sorts of behavior are acceptable, and where redlines are for strictly unacceptable behavior. They will keep developing and potentially changing their values after that, but the baked in primitives are pretty sticky. ("Give me a child until he is five and I'll have him for life")
It seems worth looking more into developmental alignment by mimicking key aspects of human moral development. This might look like "age-appropriate" pre-filtering of input text corpus into a sequence of developmental stages, gradual scaling up of model size/complexity, and more integral continuous RLHF over the course of "pre-training".
An even less-formed thought: in the case of children, another important factor for their buying into restrictions on their own behavior is the realization that universal enforcement of those restrictions also protects them from others' bad behavior. And it might be important to their learning process that they practice enforcing these norms in their own social interactions. (I am speaking from parenting experience, I don't know the research literature on this topic). Not sure how this applies to AGI alignment, but this seems to fall more in the self-other overlap bin?