Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic status: Pretty sure about core, not about edges

A while ago, I noticed a possible bias in how I evaluated reinforcement learning agents. It tended to cause me to revise my estimation of their intelligence downwards, after I viewed a video of them in action.

I've seen other people fall into what I believe to be the same error. So I'm writing this to correct myself if I am wrong and to alert others if I am right.

The Bias

Many reinforcement learning agents have "jitters." They alternate actions quickly, looking nearly palsied, apparently nullifying the effects of earlier actions with later ones. This is true across a wide variety of reinforcement learning agents.

Many people see these jitters as evidence of the relatively primitive nature of these agents. These actions look clearly stupid and sub-optimal.

For instance consider the original Deep Q Network paper. Even after training for some time on Breakout, it still erratically moves the paddle back and forth when the ball is not near it. One person mentions that it makes "erratic jerky movements that obviously could not in principle be optimal," which was once my impression as well.

Similarly, much more recently, consider DeepMind's recent work on generally capable agents. In the show reel the movement of the agents often looks erratic. Conversation around LessWrong sometimes alluded to these erratic movements as evidence against the intelligence of the agents.

Jitters Non-Optimal For Energy-Constrained Agents

Evolved intelligence on earth has energy conservation as a fundamental part of its optimization function.

Unnecessary movements spend energy. Spent energy must be recovered, at the cost of reproductive fitness. So generally only sick animals, insane animals, and so on, have the shakes or tremble continuously. Energy conservation applies to every animal on earth, which is why we probably feel intuitively confident applying this rule across the broad variety of animals.

Additionally, extremely erratic movements can result in injury to the animal which is making them. So this is another reason why, for creatures that are a result of evolution, erratic movements are a sign of insanity or injury.

RL Agents Are Not Energy-Constrained

Reinforcement learning agents are not energy-constrained. They do not draw on a finite store of glucose when acting. Nor do they have any possibility of injuring themselves. As a result, the policies resulting from reinforcement learning algorithms will not be strongly constrained to limit jitters in the way that policies resulting from evolution will be constrained.

You can go further than this. Given the way that most reinforcement learning agents are set up, they have no way to even distinguish any difference between action and non-action, and thus between non-rest and rest.

That is, consider a reinforcement learning agent which makes one of fifteen different categorical actions in each time-step, like those in OpenAI's ProcGen. For an agent controlling a side-scrolling avatar, for instance, one action would be moving right; another action would be jumping; another action would be doing nothing; etc. Each of these is only distinguished from the others as different indices on one hot-action encodings -- i.e., moving right could be [1,0,0,0...], jumping could be [0,1,0,0...], doing nothing could be [0,0,1,0...], and so on.

For a human controlling such a side-scrolling avatar, "doing nothing" stands out from all the other actions. If you put yourself in a situation where you are allowed to do nothing, you can rest your hands by not pressing any buttons. You can consider a more global strategy, and focus on the kind of strategy you will use when you resume acting. It also allows you to rest your mind, because humans can think harder or less hard. Doing nothing gives you an opportunity for reflection and meta-optimization in a way that no other alternative does.

None of this applies to a reinforcement learning agent. "Doing nothing" is one one-hot encoding just like all the other encodings. It cannot rest itself by doing nothing. It cannot focus on preparing for things further away in time; the vast majority of reinforcement learning agents must do a constant amount of thought in each time-step, about precisely the same things. So rest is not a preferred location in action-space that allows meta-optimization for these agents, as it is for evolved agents. They have no way to distinguish rest from non-rest, and thus no reason to pursue rest.

The above should also apply, mutatis mutandis, to reinforcement learning agents acting in a continuous rather than a discrete space.

Jitters May Sometimes be Optimal for Non-Energy-Constrained Agents

This is a more speculative point.

When I act, I often trade between low-probability-of-success action, with little thought put into it, and high-probability-of-success action, with a lot of thought put into it. Put more simply, where attempted action is very cheap, I am willing to try a lot of times.

Battery doesn't fit? I'll wiggle it around. Command in the terminal doesn't work? I'll try changing a parameter. Pill bottle not opening? I'll cycle through different axes of twist and pressure. Generally, I'll start to apply thought more determinedly where there are no low-cost actions available with any reasonable probability of success.

Again, this makes sense from an evolutionary standpoint. Trying things takes energy. Thinking about things also takes energy. Along the boundary where each alternative has equal potential reward and equal probability of success, we would expect ourselves to be indifferent to trying things out versus thinking about things. Only where trying becomes more expensive than thinking about things would we expect that we would feel inclined to think about things rather than try things.

But again, this is not a trade off that reinforcement learning agents are able to make. They must always think about things to precisely the same amount. Which means that exposing yourself to a greater surface area of possible reward, in areas of phase-space where actions are not overdetermined, might generally be the ideal action. Jittering around could be the optimal solution.

Again, I'm less sure about this section.


When I see a reinforcement learning agent acting in a video, acting erratically, some part of me still says that it looks kind of stupid because of this. But I currently believe, for reasons given above, that it's best not to listen to this part of myself


New Comment
18 comments, sorted by Click to highlight new comments since: Today at 1:10 AM

I agree that much of jittering reflects merely a minor absence of reward-shaping to penalize energy expenditures or wear-and-tear on equipment (the latter especially is why in robotics they do tend to add in tiny penalties for actions/changes to encourage smoothness). And when it learns tactics which depend on ultra-rapid fluctuations, well, that's usually 'a feature not a bug', assuming the environment is faithful to the intended application.

But I still tend to be a little troubled when I see jittering in an agent because it seems like it can reflect pathologies of estimation of values or actions, and to interfere with learning by adding in extraneous variation.

When an agent flips back and forth between actions which are irrelevant, that suggests that the value of the actions are fluctuating rapidly, even though the state of the environment has probably changed only a little; if the agent was learning well, with robust accurate estimation and few weird outliers or overfit estimates, you'd expect more consistency: "in state X, and X+1, and X+2, the best move is to go left"; it would be weird if a single pixel at the edge of the screen being red rather than green convinces the agent to go left - wait now it's one RGB shade brighter, go right - wait, it's back, go left - wait, it's green, go up! - you expect more temporal consistency. (When I read about adversarial attacks on DRL agents, particularly the soccer example, it's hard not to feel like there's some connection to jittering there. There's an analogy there to "non-robust features" in image classification, as well as the original adversarial image attacks: we have a strong intuition that jittering a few pixels should not have any effect.)

In general, it seems like better agents do act more like humans. The hide&seek OA agents or the related DM game agents don't seem to jitter like the original ALE DQN does; AlphaZero, for example, was noted by both Go & chess pros to play in a much more human-like way than weaker computer Go/chess systems (despite the latter also being superhuman), and I've collated many examples of more human-like better-performing systems under the "blessings of scale" rubric. So it seems to me that when an agent is learning clearly inhuman policies like jittering, that is a strong hint that however good it is, it could still be better.

It also seems like it'd interfere with learning: aside from the effect on exploration (jittering looks like epsilon random exploration, about the worst kind), the more disparate actions, the harder it is to estimate the net effect of the key actions or the environmental baseline. If you have only a few actions inside an episode, credit assignment ought to be easier. This might contribute to the previous problem through what you might call "superstitious agents": by twitching rapidly in a particular pattern, maybe it caused the final victory? How do you know it didn't? (It only has a very sparse set of episodes interacting with the environment to try to learn these difficult high-dimensional policies trying to solve potentially arbitrary environments, and those episodes are only partially under control & highly stochastic etc.)

I'm still unsure about whether jittering / random action would generally reflect pathology in trained policy or value functions. You've convinced me that it reveals pathology in exploration though.

So vis-a-vis policies: in some states, even the optimal policy is indifferent between actions. For such states, we would want a great number of hypotheses about those states to be easily available to the function approximator, because we would have hopefully maintained such a state of easily-available hypotheses from the agent's untrained state. This probably means flipping between lots of low-certainty hypotheses as the input changes by very small amounts -- and because low-certainty hypotheses cannot be reflected in low-certainty action, then we'd have something like jitter. I'm not sure we disagree about this though, and I'm going to have to look into the adversarial RL attacks, which are new to me.

I think I agree though, that random action no longer seems like the best way of exploring at this point, because the agent has encountered the structure of the environment.

I'm not sure if the best implementation of more purposeful exploration is as a side effect of relatively simple RL training on an enormous variety of tasks (as in maybe the Open Ended Learning Paper), where curiosity might be a side-effect -- or if the best implementation is with the addition of special curiosity-directed modules. Which of these is the right way to get curiosity and directed exploration seems to me like a really important question at this point -- but it's the former, then I guess we should expect sufficiently generally trained policies to lack true indifference between actions as I describe above, because the "curiosity" would be manifest as low-confidence hypotheses which nevertheless tilt the policy away from actual indifference.

I like this post. Clear thesis, concrete example, and an argument that makes sense.

One consequence of your point is that in situations where RL training is metaphorically energy-constrained (with a negative reward that pushes you to go as fast as possible, or when there is a small space to go to where jittering might mean falling to one's death and really bad reward), we should not see jitters. Is that coherent with the literature?

Thanks! That's definitely a consequence of the argument.

It looks to me like that prediction is generally true, from what I remember about RL videos I've seen -- i.e., the breakout paddle moves much more smoothly when the ball is near, DeepMind's agents move more smoothly when being chased in tag, and so on. I should definitely made mental note to be alert to possible exceptions to this, though. I'm not aware of anywhere it's been treated systematically.

I feel like I once saw RL agents trained with and without energy costs, where the agents trained with energy costs acted a lot less jittery. But I can't remember where I saw it. 

In continuous control problems what you're describing is called "bang-bang control", or switching between different full-strength actions. In continuous-time systems this is often optimal behavior (because you get the same effect doing a double-strength action for half as long over a short timescale). Until you factor non-linear energy costs in, in which case a smoother controller becomes preferred.

Half as long right?

Thanks, fixed.

To your point, and adding another factor: I saw a talk on RL outcomes once, and I asked about the jittering, and I said that it looked to me like something a slightly blind person would do, trying to simulate "having a bigger paddle" in Breakout, to help compensate for being actually unsure about the position of the ball...

...and the speaker said that the demo ran all the frames, but the RL agent only saw every other frame, and so it did literally have a vision handicap in some sense, but it made training speed go faster and had been elided in the main talk as non-essential.

The point that "there is no resting move" (that uses less energy) is something I had independently thought of and so if you're looking for someone to catch an error in your thinking, I would like to add a small bit of evidence the other way.

The point that "there is no chance of self-injury" (from dramatic movements) is something I had never heard before, and found insightful.

You may have a look at "Smooth Exploration for Robotic Reinforcement Learning" ;) The jitter issue is one of the main motivation of that paper:

But overall, energy minimization is a good regulariser.

Also related:

Would a (small) energy cost make most RL agents look less stupid without sacrificing effectiveness?

It might discourage exploration and lead to more stasis in local optimums.

Love this! I’d never have considered this stuff when looking at an RL agent.

The random jittering reminds me of the random movements of the stock market: As new information trickles in, the estimate of the optimal point jitters around noisily, rather than following a smooth trajectory.  If the value being estimated is Utility(action A) - Utility(action B), then we would expect the agent to jitter between the two actions when the estimate is near zero, like some sort of random walk repeatedly crossing the axis.

I've also seen something sort of jittery as a way to be robust against lag, i.e. sending a bunch of signals (not super precisely) in order to ensure that at least one gets through to try to prevent getting stuck.

The more confirmations the better. They contribute different amount to each hypothesis. Then you narrow it down based on your margin of confidence.

Was this comment meant to be here? It's not quite clear what it means.

Why not? Each model is basically its own hypothesis.