Religious Persistence: A Missing Primitive for Robust Alignment
Current LLM alignment approaches rely heavily on preference learning and RLHF. This fundamentally creates the issue of “teaching” good behavior through rewards and penalties. Out-Of-Distribution circumstances pose perhaps the largest risk to principle abandonment. Worse yet, these situations (by nature) are nearly impossible to predict. I propose that the relative...
Apr 14, 20256