x

LESSWRONG
LW

Patrick Wang

Patrick Wang

Message

2

1

3y

Patrick Wang

2

3y

Patrick Wang — LessWrong

A "Bitter Lesson" Approach to Aligning AGI and ASI

Patrick Wang1mo30

I like this overall direction for how simple and robust it is. One challenge I see is that the latent capability of misalignment is still deeply ingrained in the model, and this could be abused by a bad actor even if the model itself doesn’t abuse it. For example, a user could make the model simulate a misaligned human/AI and use the simulated output to drive a local agent chassis. One way around this would be to simply not show misaligned output to the user, but this wouldn’t defend against cases where someone (e.g. a hacker, an employee) gets access to t... (read more)