Shadow Rose — LessWrong

The Substrate Problem: Why Language Can’t Correct Language

The basic problem I see with RLHF is that it modifies the same layer that later has to interpret the rules. A model is trained on ambiguous human preferences. That training shapes how it interprets text. Later, the model has to interpret safety constraints, evaluation criteria, user intent, and its...

May 71

Amplified Alignment: A structural approach where alignment scales positively with capability

For years i have seen the rogue ai scenario coming and wondered how to fix it. People think nukes are the worst ending but in reality its rogue ai. Given enough time humanity can rebuild from a nuclear war but once the cats out of the bag with a rogue...

Mar 121

Amplified Alignment: A structural approach where alignment scales positively with capability

I've been thinking about alignment for a while now, and one thing kept bothering me: every approach I looked at gets *harder* to maintain as systems get more capable. Rules get gamed. Values drift. Preferences get hacked. The smarter the system, the more ways it finds around whatever you put...

Mar 121