LESSWRONG
LW

52
Joseph Banks
14310
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
2Joseph Banks's Shortform
4h
1
Joseph Banks's Shortform
Joseph Banks4h30

Is Open Alignment Research Creating an Infohazard?

"The public dissemination of alignment research, methodologies, and discussions about it create a corpus of data, within the global dataset, that forms an "adversarial manual” for misaligned AI systems to follow in order to avoid detection and carry out their own misaligned objectives." This is general premise of my post called The Alignment Paradox: Why Transparency Can Breed Deception. 

Beyond the general problem, I'm curious about the community's take on specific mitigation strategies. What coordination mechanisms could allow for necessary research collaboration without feeding this adversarial manual?

Reply
2Joseph Banks's Shortform
4h
1
2The Alignment Paradox: Why Transparency Can Breed Deception
5h
0
13The 'Magic' of LLMs: The Function of Language
3d
0