x

LESSWRONG

LW

Oliver Klingefjord — LessWrong

Oliver Klingefjord

Oliver Klingefjord

Message

16

1

4

6y

Oliver Klingefjord

16

6y

Model Integrity and Character

Published a post about integrity as a frame for trustworthiness in AI alignment, and how it relates to Claude's new constitution. Cross-posting as I think this might interest folk here. All feedback welcome! In October 1962, Soviet submarine B-59 was trapped underwater by American destroyers dropping depth charges. The captain,...

Model Integrity

by ryan.lowe, Oliver Klingefjord, and Joe Edelman

Hi! My collaborators at the Meaning Alignment Institute put out some research yesterday that may interest folk here. The core idea is introducing 'model integrity' as a frame for outer alignment. It leverages the intuition that "most people would prefer a compliant assistant, but a cofounder with integrity." It makes...

Dec 6, 2024•4