I've now posted my major posts on rigging and influence, with what I feel are clear and illustrative examples. But, in all the excitement of writing out maths, I haven't made it clear to everyone why they should care about rigging a learning process in the first place.
And that reason is simple:
For example, assume there is a bot in charge of a forum, and it is rewarded for granting access to the secret parts of the forum to users who have the right to access them. This 'right to access' is checked by whether the user knows a password (which is 'Fidelio', obviously).
As a causal graph, for a given user X, this is:
The green node is the bot's action node, the orange one is the data the bot is trying to learn. In this setup, the bot's task is essentially brainless: it simply checks whether user X has given the right password, then grants them access.
The bot could also have the power to be proactive: searching out users and getting them to give it the password. This can be encoded as the AI asking for users to supply the password:
Up till now, the learning process remains uninfluenceable and unriggable. But note that we've added the ability for the bot to communicate with the users. It could use that ability to get them to type in the password, as above. But it could also tell the user directly what the password is.
Now the orange node that the bot is learning about is causal descendant of the bot's actions (red arrow). To maximise its reward, the bot should tell every user the password, and then grant them access.
This is essentially the definition of a riggable learning process: something that seemed to be a fact about the world that the agent was learning about, but, when we drew in all the causal arrows, it turns out that fact was subject to manipulation by the agent.
Note the tiny difference between an unriggable and riggable learning process: when the bots abilities went from "send this one specific message", to "send any message", the learning process became riggable.
If your measurement/reward is based on "knows the password" or "gives correct password", you're measuring a very poor proxy for what you actually want, more like "only people I've authorized to access it". Harder to encode and measure, but also harder to game.
All forms of wireheading are variants of Goodhart's law.