2438

LESSWRONG
LW

2437

Joseph Banks's Shortform

by Joseph Banks
7th Oct 2025
1 min read
2

2

This is a special post for quick takes by Joseph Banks. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Joseph Banks's Shortform
3Joseph Banks
2jbash
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 8:10 PM
[-]Joseph Banks6h30

Is Open Alignment Research Creating an Infohazard?

"The public dissemination of alignment research, methodologies, and discussions about it create a corpus of data, within the global dataset, that forms an "adversarial manual” for misaligned AI systems to follow in order to avoid detection and carry out their own misaligned objectives." This is general premise of my post called The Alignment Paradox: Why Transparency Can Breed Deception. 

Beyond the general problem, I'm curious about the community's take on specific mitigation strategies. What coordination mechanisms could allow for necessary research collaboration without feeding this adversarial manual?

Reply
[-]jbash1h20

It's less about "collaboration" and more about informal cross-fertilization as ideas diffuse around via random paths.

You don't want to do first-order harm to mitigate a second-order concern.

Reply
Moderation Log
More from Joseph Banks
View more
Curated and popular this week
2Comments