RLAIF/RLHF for Public Value Alignment
Enhancing Transparency in LLMs

Jada42

Hey all — I just submitted my Master’s thesis, and would really appreciate feedback from people thinking about non-task value alignment or LLM transparency metrics.

I explored whether Reinforcement Learning from Human Feedback (RLHF), using a constitutional rubric, could align a language model with transparency as a civic value. The case: Amsterdam’s parking permit system.

Methods:

Model: Mistral-7B-Instruct v3
Training: PPO using a 6-dimensional transparency rubric (Explainability, Completeness, Source Attribution, Procedural Clarity, Accuracy, Limitation Disclosure)
Baseline: SFT using ~300 examples
Evaluator: Independent LLM Judge (Claude Sonnet 4 & ChatGPT-4o) scoring per-dimension across 300 held-out citizen queries

Key Results

RLHF win rate: 61% vs SFT baseline (p < 0.001)
Biggest gains:
- +21.6% Information Completeness
- +19.5% Source Attribution
Accuracy dropped by -6.6%, mostly due to over-hedging (e.g. "This may be different for your city", despite giving city specific information)
81% of reward signal came from explicit constitutional constraints, not emergent learning (according to the Optuna trials)

Takeaways

RLHF can encode democratic values — but needs deliberate constraints
There’s a fundamental tradeoff: transparency ≠ precision
Technical alignment isn’t enough — civic values require governance, citizen input, and deliberation

Would love feedback — especially on:

whether this rubric-based evaluator is generalizable
how others might have measured non-task values like “transparency”
if this work might overlap with existing alignment metric toolkits

I’ll post the paper & code when cleaned up.

Abstract

Large Language Models in government services often lack alignment with democratic values, particularly transparency, risking opaque AI-citizen interactions that undermine public trust. This thesis investigates whether Reinforcement Learning from Human Feedback (RLHF) can systematically embed transparency in government AI systems.

Using Amsterdam's parking regulations as a critical case, we fine-tuned a Mistral-7B model through constitutional RLHF, where a six-dimensional transparency rubric guided Proximal Policy Optimization. This approach was benchmarked against supervised fine-tuning across 300 citizen queries, with an independent LLM evaluating transparency outcomes.

The RLHF model achieved a 61%-win rate (p < .001), with substantial improvements in Information Completeness (+21.6%) and Source Attribution (+19.5%). Critically, explicit constitutional constraints dominated the reward signal (81%), suggesting that democratic values require deliberate encoding rather than emergent learning. However, transparency gains produced a 6.6% decrease in factual accuracy, revealing an inherent tension between comprehensive disclosure and precision.

These findings show that while RLHF offers a viable framework for value alignment in government AI, it cannot resolve fundamental democratic trade-offs through technical means alone. Responsible deployment requires integrating technical alignment with institutional governance, citizen participation, and explicit acknowledgment that competing public values, transparency versus accuracy, require democratic deliberation rather than algorithmic optimization.

*Note: The present figure shows each of the three phases and the respective element*

The choice of RLAIF over traditional human feedback approaches was mainly motivated by the need for scalable and consistent evaluation of complex transparency criteria. Government transparency encompasses multiple dimensions that require systematic assessment, making AI-based evaluation particularly suitable for this domain while acknowledging the limitations of not including human validation in the current study (Adam, 2025).

Dataset Construction and Preparation

A comprehensive knowledge base of Amsterdam parking regulations was constructed through systematic web scraping of official government sources. The scraping pipeline, implemented in R using the rvest package, targeted content from 5 pages across the English portal (https://www.amsterdam.nl/en/parking/apply-resident-parking-permit/) and Dutch portals (https://www.amsterdam.nl/parkeren-verkeer/) for Dutch only documents (e.g. Blue Zones, parking locations, new-build project rules).

The scraping process adhered to ethical web scraping practices including robots.txt compliance and implementation of 2-second delays between requests to minimize server load. Quality assurance procedures included validation of content length (minimum 100 characters), verification of parking-related content through keyword matching, and manual review by a native Dutch speaker to ensure accuracy and completeness.

Keywords: AI governance, transparency, RLHF, constitutional AI, public administration, value alignment

LESSWRONG
LW

LESSWRONG
LW

1

RLAIF/RLHF for Public Value Alignment Enhancing Transparency in LLMs

1

Dataset Construction and Preparation

1

1