I think part of what you're getting at is what I've called The alignment stability problem. You can see my thoughts there, including links to related work.
Looking at the google scholar link in this article, it looks like what I'm describing more closely resembles "motivation hacking", except that, in my thought experiment, the agent doesn't modify its own reward system. Instead, it selects arbitrary actions and anticipates if their reward is coincidentally more satisfying than the base objective. This allows it to perform this attack even if its in the tra... (read more)