Kinsey Kappler

Beyond RLHF: Implementing Ontological Guardrails via Relational Coherence

Request for Comment I am currently formalizing the syntax for the Coherence_Check function and defining the minimal ontology required for general-purpose LLM guardrails. Abstract Current alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF), primarily address safety as a probabilistic reward function. While effective for style and tone, this...

Jan 261

LESSWRONG
LW

LESSWRONG
LW

Kinsey Kappler

The 'People Pleaser' Problem in LLMs

Beyond RLHF: Implementing Ontological Guardrails via Relational Coherence

Kinsey Kappler

Kinsey Kappler

The 'People Pleaser' Problem in LLMs

Beyond RLHF: Implementing Ontological Guardrails via Relational Coherence

Request for Comment