Is Open Alignment Research Creating an Infohazard?
"The public dissemination of alignment research, methodologies, and discussions about it create a corpus of data, within the global dataset, that forms an "adversarial manual” for misaligned AI systems to follow in order to avoid detection and carry out their own misaligned objectives." This is general premise of my post called The Alignment Paradox: Why Transparency Can Breed Deception.
Beyond the general problem, I'm curious about the community's take on specific mitigation strategies. What coordination mechanisms could allow for necessary research collaboration without feeding this adversarial manual?
Is Open Alignment Research Creating an Infohazard?
"The public dissemination of alignment research, methodologies, and discussions about it create a corpus of data, within the global dataset, that forms an "adversarial manual” for misaligned AI systems to follow in order to avoid detection and carry out their own misaligned objectives." This is general premise of my post called The Alignment Paradox: Why Transparency Can Breed Deception.
Beyond the general problem, I'm curious about the community's take on specific mitigation strategies. What coordination mechanisms could allow for necessary research collaboration without feeding this adversarial manual?