RLAIF/RLHF for Public Value Alignment Enhancing Transparency in LLMs
Hey AI Alignment peers, really appreciate feedback from people thinking about non-task value alignment or LLM transparency metrics. I explored whether Reinforcement Learning from Human Feedback (RLHF), using a constitutional rubric, could align a language model with transparency as a civic value. The case: Amsterdam’s parking permit system. Methods: *...
Jun 26, 20251