tl;dr: this is Part 3[1] of a raw and unfiltered brain dump of the notes I jotted down while attending NeurIPS and its adjacent workshops in December. None of it has been thought through deeply, it's not carefully written and there are no pretty pictures. But I won’t have time to research or refine these ideas in the next 6 months, so I figured I’d throw them against the wall in case there’s a useful nugget in here someone else can run with.
Epistemic status: I have only a non-expert understanding of the anthropology or anthropogeny or primatology of social norm enforcement. I have an extremely naive, minimal grasp of how AI models work or of past/current work in the field of AI alignment.
AI alignment context
The following thoughts were not about the risk that superhuman AGI could become evil. They were more about the risk that insufficiently-aligned or rogue AI’s could be used by bad actors do bad things, or used by well-meaning humans in such a way as to eventually convince those humans to do things many of us think are bad (whether evil, like murder, or unfortunate, like suicide). These ramblings stemmed from the starting thought that social feedback is an important mechanism for keeping human individuals in line with their own cultures, and keeping world cultures at least roughly compatible with overall requirements of human life on this planet.
Federated fine-tuning
To continuously steer AI models toward “socially acceptable” behavior and “cultural norms”, there's the idea that a subset of their weights be subject to continual weak updating by human feedback from all users, via anonymous weight gradient data returned to the model provider.
To mitigate the concern about who will decide what those values and norms should be, there's the idea of taking a pluralist approach, in which there are many distinct community-updated models, such that different communities (shared models) could emerge with different social norms and beliefs.
To mitigate the concern about emergence of bubbles that are misaligned with the rest of humanity ("extremist groups"), there could still be some degree of all-humanity-level feedback to all models. That might serve to keep all the distinct cultural models bounded within some range tolerable to, or universalizable to, all other world cultures[2]. For example, a sub-culture's AI model could strongly endorse specific religious observances or social etiquette not shared with other groups; but perhaps would still be ‘cognizant’ of the fact that these norms are not universal. Perhaps all-humanity feedback would make it quite difficult for any one model to flatly endorse widely condemned actions like terrorism or genocide of other groups; or at minimum, such a deviant model could not escape ‘knowing’ that it was regarded as such by most other groups. To be clear, I'm not claiming that any of those benefits would in fact accrue from the hierarchical federated alignment scheme suggested here. I'm just saying, this is an idea.
Why collective alignment training seems risky
People disagree about ethics
Alignment tuning necessarily entails committing to some particular ethical values, and we should not pretend otherwise. Endorsing and enforcing any one particular value hierarchy would be perceived as hostile by everyone who adheres to any other ethical system. Even if it is the case that some ethical systems are objectively better than others, or that one ethical system is objectively the correct one, humans are far from consensus on what that is. Any strong claim of a universal morality seems to immediately elicit fears of dictatorial enforcement, persecution, and other bad stuff typically done by past bad actors who thought they had The One True Belief . Therefore, Al Alignment work will probably be best-received if it sticks to very basic norms that are likely to get very broad buy-in[3].
However, numerical majority or popularity are not good criteria for deciding what is right. The mob doesn’t have a much better track record than megalomaniac dictators on this count. Finding the least common denominator that nobody can object to typically leads to rather weak principles and unfortunate compromises. So polling world users for their opinions is hardly an ideal solution. Still, global and community norms seem like the main way we have of building a consensus on core ethical rules everyone is willing to be held to. I have heard that some anthropologists claim there is a core set of universal moral laws or social norms found across all cultures. While I suspect those rules are typically only applied within-tribe (i.e. “thou shalt not steal” means “thou shalt not steal from other members of our tribe”), taking these to be universal rules might go over ok with most people. A Veil of Ignorance or Universalizability criterion might also get broad buy-in.
Social constructivism is bad
More importantly, I absolutely would not want AI models to enforce collectivist or social epistemics. I know this is a specific, controversial philosophical stance, but I presume it is one I share with anyone who considers themselves a rationalist. The idea that truth is, by definition, "whatever everyone thinks" or, effectively "whatever my own tribe endorses", is already a disturbingly growing tendency in the world, and in my view a major threat to the future of humanity. It would be bad if AI models became brainwashers, either within or across subcultures, as an inadvertent side consequence of being fine-tuned by social feedback for alignment.
It is important that AI models will always allow[4] individual users or fringe groups to develop and adhere to novel or controversial ideas (including beliefs and values), even in spite of overwhelming unpopularity. Therefore we do not want to build anything into model fine tuning or steering that suppresses or punishes non-conformity to majority views in general, nor anything that equates the truth status of beliefs with the number of users who agree with them. This seems like the huge risk of the federated fine tuning approach to AI alignment.
A possible workaround is to reward models in pre-training for social meta-epistemic habits like flagging controversial claims or views as such, acknowledging countering viewpoints, steel manning, etc., and only “punish” (in RLHF terms) failures to do these things during alignment tuning and/or post-deployment collective fine-tuning. This does not steer or bias the content of beliefs and values, it just enforces explicit awareness of the range of beliefs or values and their relative commonness or rareness.
We would want to be pretty light-handed about this though; e.g. if 90% of people on earth think it’s fine to eat meat but I’ve arrived at the conclusion that it’s not OK; or if 90% believe in God and I don’t, I want my AI model to be able to engage with me in a line of thinking that takes my current premise as a starting point, without continually re-engaging a debate about it if I do not deem that to be useful. Still, all assumptions should remain in awareness and subject to revisiting, so that A(G)I does not facilitate formation of epistemic bubbles or silos. It should be sufficient for the model to say “given the premise X…” to remind the user that they are reasoning under a particular assumption; and perhaps to flag for the user when this assumption is both socially uncommon and load-bearing, particularly whenever conclusions are reached or actions are endorsed that are controversial, ethically dubious, or that it determines (from the community-level federated training) would be objectionable to some/many/most others.
I personally would support the idea of embedding ethical principles that protect and maximize every individual human’s autonomy, preserve each person’s right to hold and abide by their own beliefs and live their own lives as they see fit, subject only to the constraints required to protect all other human individuals’ same rights. I sometimes hear alignment researchers speaking as though these are self-evidently agreed-upon values; although I agree with these values, this is not a neutral ethical position! Based on my limited knowledge of EA, I suspect many EAs would oppose the aforementioned framework in favor of more altruistic, utilitarian, and consequentialist guiding principles.
Of course one can think as one pleases while not using AI, but to the extent using AI becomes a fundamental and powerful tool for research and assisted thought, one wouldn't want that tool to be deeply committed to social epistemics.
tl;dr: this is Part 3[1] of a raw and unfiltered brain dump of the notes I jotted down while attending NeurIPS and its adjacent workshops in December. None of it has been thought through deeply, it's not carefully written and there are no pretty pictures. But I won’t have time to research or refine these ideas in the next 6 months, so I figured I’d throw them against the wall in case there’s a useful nugget in here someone else can run with.
Epistemic status: I have only a non-expert understanding of the anthropology or anthropogeny or primatology of social norm enforcement. I have an extremely naive, minimal grasp of how AI models work or of past/current work in the field of AI alignment.
AI alignment context
The following thoughts were not about the risk that superhuman AGI could become evil. They were more about the risk that insufficiently-aligned or rogue AI’s could be used by bad actors do bad things, or used by well-meaning humans in such a way as to eventually convince those humans to do things many of us think are bad (whether evil, like murder, or unfortunate, like suicide). These ramblings stemmed from the starting thought that social feedback is an important mechanism for keeping human individuals in line with their own cultures, and keeping world cultures at least roughly compatible with overall requirements of human life on this planet.
Federated fine-tuning
To continuously steer AI models toward “socially acceptable” behavior and “cultural norms”, there's the idea that a subset of their weights be subject to continual weak updating by human feedback from all users, via anonymous weight gradient data returned to the model provider.
To mitigate the concern about who will decide what those values and norms should be, there's the idea of taking a pluralist approach, in which there are many distinct community-updated models, such that different communities (shared models) could emerge with different social norms and beliefs.
To mitigate the concern about emergence of bubbles that are misaligned with the rest of humanity ("extremist groups"), there could still be some degree of all-humanity-level feedback to all models. That might serve to keep all the distinct cultural models bounded within some range tolerable to, or universalizable to, all other world cultures[2]. For example, a sub-culture's AI model could strongly endorse specific religious observances or social etiquette not shared with other groups; but perhaps would still be ‘cognizant’ of the fact that these norms are not universal. Perhaps all-humanity feedback would make it quite difficult for any one model to flatly endorse widely condemned actions like terrorism or genocide of other groups; or at minimum, such a deviant model could not escape ‘knowing’ that it was regarded as such by most other groups. To be clear, I'm not claiming that any of those benefits would in fact accrue from the hierarchical federated alignment scheme suggested here. I'm just saying, this is an idea.
Why collective alignment training seems risky
People disagree about ethics
Alignment tuning necessarily entails committing to some particular ethical values, and we should not pretend otherwise. Endorsing and enforcing any one particular value hierarchy would be perceived as hostile by everyone who adheres to any other ethical system. Even if it is the case that some ethical systems are objectively better than others, or that one ethical system is objectively the correct one, humans are far from consensus on what that is. Any strong claim of a universal morality seems to immediately elicit fears of dictatorial enforcement, persecution, and other bad stuff typically done by past bad actors who thought they had The One True Belief . Therefore, Al Alignment work will probably be best-received if it sticks to very basic norms that are likely to get very broad buy-in[3].
However, numerical majority or popularity are not good criteria for deciding what is right. The mob doesn’t have a much better track record than megalomaniac dictators on this count. Finding the least common denominator that nobody can object to typically leads to rather weak principles and unfortunate compromises. So polling world users for their opinions is hardly an ideal solution. Still, global and community norms seem like the main way we have of building a consensus on core ethical rules everyone is willing to be held to. I have heard that some anthropologists claim there is a core set of universal moral laws or social norms found across all cultures. While I suspect those rules are typically only applied within-tribe (i.e. “thou shalt not steal” means “thou shalt not steal from other members of our tribe”), taking these to be universal rules might go over ok with most people. A Veil of Ignorance or Universalizability criterion might also get broad buy-in.
Social constructivism is bad
More importantly, I absolutely would not want AI models to enforce collectivist or social epistemics. I know this is a specific, controversial philosophical stance, but I presume it is one I share with anyone who considers themselves a rationalist. The idea that truth is, by definition, "whatever everyone thinks" or, effectively "whatever my own tribe endorses", is already a disturbingly growing tendency in the world, and in my view a major threat to the future of humanity. It would be bad if AI models became brainwashers, either within or across subcultures, as an inadvertent side consequence of being fine-tuned by social feedback for alignment.
It is important that AI models will always allow[4] individual users or fringe groups to develop and adhere to novel or controversial ideas (including beliefs and values), even in spite of overwhelming unpopularity. Therefore we do not want to build anything into model fine tuning or steering that suppresses or punishes non-conformity to majority views in general, nor anything that equates the truth status of beliefs with the number of users who agree with them. This seems like the huge risk of the federated fine tuning approach to AI alignment.
A possible workaround is to reward models in pre-training for social meta-epistemic habits like flagging controversial claims or views as such, acknowledging countering viewpoints, steel manning, etc., and only “punish” (in RLHF terms) failures to do these things during alignment tuning and/or post-deployment collective fine-tuning. This does not steer or bias the content of beliefs and values, it just enforces explicit awareness of the range of beliefs or values and their relative commonness or rareness.
We would want to be pretty light-handed about this though; e.g. if 90% of people on earth think it’s fine to eat meat but I’ve arrived at the conclusion that it’s not OK; or if 90% believe in God and I don’t, I want my AI model to be able to engage with me in a line of thinking that takes my current premise as a starting point, without continually re-engaging a debate about it if I do not deem that to be useful. Still, all assumptions should remain in awareness and subject to revisiting, so that A(G)I does not facilitate formation of epistemic bubbles or silos. It should be sufficient for the model to say “given the premise X…” to remind the user that they are reasoning under a particular assumption; and perhaps to flag for the user when this assumption is both socially uncommon and load-bearing, particularly whenever conclusions are reached or actions are endorsed that are controversial, ethically dubious, or that it determines (from the community-level federated training) would be objectionable to some/many/most others.
Part 1 of unfiltered brain dump: Does evolution provide any hints for making model alignment more robust?
Part 2 of unfiltered brain dump: Does developmental cognitive psychology provide any hints for making model alignment more robust?
Which would be an improvement relative to the situation with (in)compatibility of human cultures today.
I personally would support the idea of embedding ethical principles that protect and maximize every individual human’s autonomy, preserve each person’s right to hold and abide by their own beliefs and live their own lives as they see fit, subject only to the constraints required to protect all other human individuals’ same rights. I sometimes hear alignment researchers speaking as though these are self-evidently agreed-upon values; although I agree with these values, this is not a neutral ethical position! Based on my limited knowledge of EA, I suspect many EAs would oppose the aforementioned framework in favor of more altruistic, utilitarian, and consequentialist guiding principles.
Of course one can think as one pleases while not using AI, but to the extent using AI becomes a fundamental and powerful tool for research and assisted thought, one wouldn't want that tool to be deeply committed to social epistemics.