LESSWRONG
LW

1977
the8thbit
8110
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Is "brittle alignment" good enough?
the8thbit2y10

I think part of what you're getting at is what I've called The alignment stability problem. You can see my thoughts there, including links to related work.

Looking at the google scholar link in this article, it looks like what I'm describing more closely resembles "motivation hacking", except that, in my thought experiment, the agent doesn't modify its own reward system. Instead, it selects arbitrary actions and anticipates if their reward is coincidentally more satisfying than the base objective. This allows it to perform this attack even if its in the training environment.

Further, this sort of "attack" may be a component of the self-analysis an agent may do in pursuit of the base objective, so at no point does the agent need to exhibit deceptive or antagonistic behavior to pursue this vulnerability. It may be that an agent pursuing this vulnerability is fundamentally the same as an agent pursuing the base objective.

Reply
9Is "brittle alignment" good enough?
Q
2y
Q
5