
Current LLM alignment approaches rely heavily on preference learning and RLHF. This fundamentally creates the issue of “teaching” good behavior through rewards and penalties.
Out-Of-Distribution circumstances pose perhaps the largest risk to principle abandonment. Worse yet, these situations (by nature) are nearly impossible to predict.
I propose that the relative stability of human religious frameworks, over a diverse set of conditions, offers a yet-unexplored structure for alignment.
Table of Contents:
- Fragility of Utility Maximization
- The Biological Grounding Problem
- Transcendent Motivation Models (TMMs)
- Constitutional AI vs Intrinsic Value Systems
- Casual Narrative Conditioning
- Criticisms + Future Work
Fragility of Utility Maximization
As a quick-aside, I’d like to briefly mention some of the issues with EU-max.
Expected utility can be gamed by sufficiently intelligent systems. When a... (read 2276 more words →)
I’ve been running a set of micro-evals for my own workflows, but I think the main obstacle is the fuzziness of real life tasks. Chess has some really nice failure signals, plus you get metrics like centipawn loss for free.
It takes significant mental effort to look at your own job/work and create micro-benchmarks that aren’t completely subjective. The trick that’s helped me the most is to steal a page from test-driven development (TDD):
- Write the oracle first, if I can’t eval as true/false the task is too mushy
- shrink the scope until it breaks cleanly
- iterate like unit tests; add new edge-cases whenever a model slips through or reward hacks
The payoff is being able to make clean dashboards that tell you “XYZ model passes 92% of this subcategory of test”.