Hey all — I just submitted my Master’s thesis, and would really appreciate feedback from people thinking about non-task value alignment or LLM transparency metrics.
I explored whether Reinforcement Learning from Human Feedback (RLHF), using a constitutional rubric, could align a language model with transparency as a civic value. The case: Amsterdam’s parking permit system.
Methods:
- Model: Mistral-7B-Instruct v3
- Training: PPO using a 6-dimensional transparency rubric (Explainability, Completeness, Source Attribution, Procedural Clarity, Accuracy, Limitation Disclosure)
- Baseline: SFT using ~300 examples
- Evaluator: Independent LLM Judge (Claude Sonnet 4) scoring per-dimension across 300 held-out citizen queries
Key Results
- RLHF win rate: 61% vs SFT baseline (p < 0.001)
- Biggest gains:
- +21.6% Information Completeness
- +19.5% Source Attribution
- Accuracy dropped by -6.6%, mostly due to over-hedging (e.g. "This
... (read 482 more words →)