x

LESSWRONG

LW

Jada42 — LessWrong

Jada42

Jada42

Message

1

11mo

Jada42

11mo

RLAIF/RLHF for Public Value Alignment Enhancing Transparency in LLMs

Hey AI Alignment peers, really appreciate feedback from people thinking about non-task value alignment or LLM transparency metrics. I explored whether Reinforcement Learning from Human Feedback (RLHF), using a constitutional rubric, could align a language model with transparency as a civic value. The case: Amsterdam’s parking permit system. Methods: *...

Jun 26, 2025•1