[ Question ]

Instrumental Occam?

by abramdemski 1 min read31st Jan 202015 comments

31

Ω 18


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In The Real Rules have No Exceptions, Said says (apparently speaking of instrumental rules, not epistemic ones):

Prefer simplicity in your rules. Be vigilant that your rules do not grow too complex; make sure you are not relaxing the legitimacy criteria of your exceptions. Periodically audit your rules, inspecting them for complexity; try to formulate simpler versions of complex rules.

This is a very plausible principle. I have had similar thoughts myself. The idea seems to have merit in practice. But is there any theoretical analogue?

Minimum description length (Occam's Razor) is an epistemic principle. In Bayesian terms, probabilities must sum to one, which translates information-theoretically (via optimal encodings) to the idea that there are only so many short codes, so we have to assign them carefully to the most probable things.

However, expected utility is not a limited resource in this way. Updating to think an option is better than previously thought doesn't necessarily make other options worse. Selecting policies is not so different from selecting raw actions. And for Bayesian decision theory, it doesn't really matter whether a policy is specified as a big table of observation/action pairs, vs an elegantly specified rule.

Optimality cares not for elegance!

Yet, even in the relatively formal world of machine learning, the practice seems contrary to this. When you are optimizing a neural network, you don't actually care that much whether it's something like a hypothesis (making predictions) or something like a policy (carrying out actions). You apply the same kind of regularization either way, as far as I understand (regularization being the machine-learner's generalization of Occam). (Correction: this seems not to be the case.)

We might say that this is because (in some sense) the instrumental uncertainty and the epistemic uncertainty are actually being wrapped up together. But (1) this reply seems overly quick to me at present, and I'd want to understand in detail whether this can be justified; (2) I'm curious if there is a purely instrumental version of Occam to be articulated; it seems intuitively really plausible to me, though technically quite mysterious.

So: is it possible to formulate an instrumental version of Occam? Can we justify a simplicity bias in our policies?

31

Ω 18

New Answer
Ask Related Question
New Comment

4 Answers

First, if we take PSRL as our model algorithm, then at any given time we follow a policy optimal for some hypothesis sampled out of the belief state. Since our prior favors simple hypotheses, the hypothesis we sampled is likely to be simple. But, given a hypothesis of description complexity , the corresponding optimal policy has description complexity , since the operation "find the optimal policy" has description complexity.

Taking computational resource bounds into account makes things more complicated. For some computing might be intractable, even though itself is "efficiently computable" in some sense. For example we can imagine an that has exponentially many states plus some succinct description of the transition kernel.

One way to deal with it is using some heuristic for optimization. But, then the description complexity is still .

Another way to deal with it is restricting ourselves to the kind of hypotheses for which is tractable, but allowing incomplete/fuzzy hypotheses, so that we can still deal with environments whose complete description falls outside this class. For example, this can take the form of looking for some small subset of features that has predictable behavior that can be exploited. In this approach, the description complexity is probably still something like , where this time is incomplete/fuzzy (but I don't yet know how PSRL for incomplete/fuzzy hypothesis should work).

Moreover, using incomplete models we can in some sense go in the other direction, from policy to model. This might be a good way to think of model-based RL. In actor-critic algorithms, our network learns a pair consisting of a value function and a policy . We can think of such a pair as an incomplete model that is defined by the Bellman inequality interpreted as a constraint on the transition kernel (or and the reward function ):

Assuming that our incomplete prior assigns weight to this incomplete hypothesis, we get a sort of Occam's razor for policies.

In model-free RL, policy-based methods choose policies by optimizing a noisy estimate of the policy's value. This is analogous to optimizing a noisy estimate of prediction accuracy (i.e., accuracy on the training data) to choose a predictive model. So we often need to trade variance for bias in the policy-learning case (i.e., shrink towards simpler policies) just as in the predictive modeling case.

So: is it possible to formulate an instrumental version of Occam? Can we justify a simplicity bias in our policies?

Maybe problems that don't have simple solutions (i.e. all their solutions have a large description length) are usually intractable for us. If so, given a problem that we're trying to solve, the assumption that it has simple solutions is probably either useful (if it's true) or costless (if it isn't). In other words: "look for your missing key under the lamppost, not because it's probably there, but because you'll only ever find it if it's there".

One way of solving the problem of action selection is to treat our own actions as random variables, then build a probabilistic model, and condition that probability distribution on things going well for us in the future. See, for example, Planning By Probabilistic Inference, human brains (I would argue), upside-down RL (kinda, I think?), etc.

In this formulation, instrumental Occam's razor is just a special case of epistemic Occam's razor, it seems to me.