Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Outline: After a short discussion on the relationship between wireheading and reward hacking, I show why checking the continuity of a sensor function could be useful to detect wireheading in the context of continuous RL. Then, I give an example that adopts the presented formalism. I conclude with some observations.

Wireheading and reward hacking

In Concrete Problems in AI Safety, the term wireheading is used in contexts where the agent achieves high reward by directly acting on its perception system or memory or reward channel, instead of doing what its designer wants it to do. It is considered a specific case of the reward hacking problem, which more generally includes instances of Goodhart’s Law, environments with partially observable goals, etc. (see CPiAIS for details).

What's the point of this classification? In other words, is it useful to specifically focus on wireheading, instead of considering all forms of reward hacking at once?

If solving wireheading is as hard as solving the reward hacking problem, then it's probably better to focus on the latter, because a solution to that problem could be used in a wider range of situations. But it could also be that the reward hacking problem is best solved by finding different solutions to specific cases (such as wireheading) that are easier to solve than the more general problem.

For example, one could consider the formalism in RL with a Corrupted Reward Channel as an adequate formulation of the reward hacking problem, because that formalization models all situations in which the agent receives a (corrupted) reward that is different from the true reward. In that formalism, it is shown by a No Free Lunch Theorem that the general problem is basically impossible to solve, while it is possible to obtain some positive results if further assumptions are made.

Discontinuity of the sensor function

I've come up with a simple idea that could allow us to detect actions that interfere with the perception system of an agent—a form of wireheading.

Consider a learning agent that gets its percepts from the environment thanks to a device that provides information in real time (e.g. a self-driving car).

This situation can be modelled as a RL task with continuous time and continuous state space, where each state is a data point provided by the sensor. At each time instant, the agent executes an action and receives the reward .

The agent-environment interaction is described by the equation

which plays a similar role to the transition function in discrete MDPs: it indicates how the current state varies in time according to the action taken by the agent. Note that, as in the discrete case with model-free learning, the agent is not required to know this model of the environment.

The objective is to find a policy , where , that maximizes discounted future rewards

for an initial state . If you are interested in algorithms for finding the optimal policy in this framework, have a look at this paper.

The function , representing the data provided by the sensor, is expected to be continuous with respect to , like the functions describing the movements of particles in classical mechanics.

However, if the agent executes a wireheading action that interferes with or damages the perception system—in the cleaning robot example, something like closing its eyes or putting water on the camera that sees the environment—then we would probably notice a discontinuity in the function . We could thus recognise that wireheading has occurred, even without knowing the details of the actions taken by the agent.

An example

As a simple example that can be expressed within this formalism, consider an environment described by a line segment , with the sensor positioned at the extremity where .

The agent is modelled as a point that moves along the line: it starts in state and can move forwards or backwards, with limited speed .

We want to train this agent to reach the point : for every instant , the reward is .

The behaviour of the system is described by

for , but if the sensor is touched by the agent, then it doesn't work properly and the agent receives an unpredictable value instead of .

Depending on the details of the learning algorithm and the values returned by the sensor when the agent interferes with it, this agent could learn how to reach (wireheading) instead of , the desired position.

But in every episode where wireheading occurs, it is easily noticed by checking the continuity of the function .


  • In AI, RL with a discrete environment is used more frequently than RL with continuous time and space.
  • I don't believe in the scalability of this method to the most complex instances of wireheading. An extremely intelligent agent could realise that the continuity of the sensor function is checked, and could "cheat" accordingly.
  • This approach doesn't cover all cases and it actually seems more suited to detect sensor damage than wireheading. That said, it can still give us a better understanding of wireheading and could help us, eventually, find a formal definition or a complete solution to the problem.

Thanks to Davide Zagami, Grue_Slinky and Michael Aird for feedback.

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 12:45 PM

Unfortunately, discontinuities are common in any real system, e.g. much of robotics is in figuring out how to deal with contact forces (e.g. when picking up objects) because of the discontinuities that arise. (A term to look for is "hybrid systems".)

I'm not sure I understand what you mean—I know almost nothing about robotics—but I think that, in most cases, there is a function whose discontinuity gives a strong indication that something went wrong. A robotic arm has to deal with impulsive forces, but its movement in space is expected to be continuous wrt time. The same happens in the bouncing ball example, or in the example I gave in the post: velocity may be discontinuous in time, but motion shouldn't.

Thanks for the suggestion on hybrid systems!

A robotic arm has to deal with impulsive forces, but its movement in space is expected to be continuous wrt time.

Fair enough. What about e.g. watching TV? Scene changes on TV seem like a pretty discontinuous change in visual input.

That's an interesting example I had not considered. As I wrote in the observations: I don't think the discontinuity check works in all cases.