Wiki Contributions


Instrumental Occam?

DanielFilan asked me to comment on this paragraph:

Yet, even in the relatively formal world of machine learning, the practice seems contrary to this. When you are optimizing a neural network, you don't actually care that much whether it's something like a hypothesis (making predictions) or something like a policy (carrying out actions). You apply the same kind of regularization either way, as far as I understand (regularization being the machine-learner's generalization of Occam).

AFAIK, it's not actually standard to regularize RL policies the same way you regularize supervised learning. For example, A3C, PPO, and SAC, three leading Deep RL algorithms, use an entropy bonus to regularize their policies. Notably, entropy encourages policies that do different things, instead of policies that are internally simple. On the other hand, in supervised learning, people use techniques such as L2 regularization and Dropout, to get predictors that are simple.

You do see L2 regularization used in a lot of deep RL papers (for example, it's used on every network in AlphaZero variants, in DQN, and even in earlier versions of the SAC algorithm I mentioned before). However, it's important to note that L2 regularization is used on prediction tasks:

  • The policy and value network try to predict the MCTS-amplified policy and value, respectively.
  • The L2-regularized networks in DQN and SAC are used to predict the Q-values.

(Vanessa's argument about PSRL also seems similar, as PSRL is fundamentally doing supervised learning.)

As for the actual question, I'm not sure "instrumental Occam" exists. Absent multi-agent issues, my guess is Occam's razor is useful in RL insofar as your algorithm has predictive tasks. You want a simple rule for predicting the reward, given your action, not a simple rule for predicting action given an observation history. Insofar as an actual simplicity prior on policies exist and are useful, my guess is that it's because your AI might interact with other AIs (including copies of itself), and so need to be legible/inexploitable/predictable/etc.

How good a proxy for accuracy is precision?

I wonder if that's because they're using the ISO definition of accuracy? A quick google search for these diagrams led me to this reddit thread, where the discussion below reflects the fact that people use different definitions of accuracy.

EDIT: here's a diagram of the form that Elizabeth is complaining about (source: the aforementioned reddit thread):

Accuracy vs Precision

Does it become easier, or harder, for the world to coordinate around not building AGI as time goes on?

Hyperbolic discounting leads to preferences reversals over time: the classic example is always preferring a certain $1 now to $2 tomorrow, but preferring a certain $2 in a week to $1 in 6 days. This is a pretty clear sign that it never "should" be done - An agent with these preferences might find themselves paying a cent to switch from $1 in 6 days to $2 in 7, then, 6 days later, paying another cent to switch it back and get th $1 immediately.

However, in practice, even rational agents might exhibit hyperbolic discounting like preferences (though no preference reversals): for example, right now I might not believe you're very trustworthy and worry you might forget to give me money tomorrow. So I prefer $1 now to $2 tomorrow. But if you actually are going to give me $1 in 6 days, I might update to thinking you're quite trustworthy and then be willing to wait another day to get $2 instead. (See this paper for a more thorough discussion of this possibility:

How good a proxy for accuracy is precision?

I believe your definition of accuracy differs from the ISO definition (which is the usage I learned in undergrad statistics classes, and also the usage most online sources seem to agree with): a measurement is accurate insofar as it is close to the true value. By this definition, the reason the second graph is accurate but not precise is because all the points are close to the true value. I'll be using that definition in the remainder of my post. That being said, Wikipedia does claim your usage is the more common usage of the word.

I don't have a clear sense of how to answer your question empirically, so I'll give a theoretical answer.

Suppose our goal is to predict some value . Let be our predictor for (for example, we could have ask a subject to predict ). A natural way to measure accuracy for prediction tasks is the mean squared error , where a lower mean square error is higher accuracy. The Bias Variance Decomposition of mean squared error gives us:

The first term on the right is the bias of your estimator - how far the expected value of your estimator is from the true value. An unbiased estimator is one that, in expectation, gives you the right value (what you mean by "accuracy" in your post, and what ISO calls "trueness"). The second term is the variance of your estimator - how far your estimator is, in expectation, from the average value of the estimator. Rephrasing a bit, this measures how imprecise your estimator is, on average.

As both the terms on the right are always non-negative, the bias and variance of your estimator both lower bound your mean square error.

However, it turns out that there's often a trade off between having an unbiased estimator and a more precise estimator, known appropriately as the bias-variance trade-off. In fact, there are many classic examples in statistics of estimators that are biased but have lower MSE than any unbiased estimator. (Here's the first one I found during Googling)

Open thread, August 14 - August 20, 2017

Why do you think this doesn't exist?

Open thread, August 14 - August 20, 2017

For what it's worth, though, as far as I can tell we don't have the ability to create an AI that will reliably maximize the number of paperclips in the real world, even with infinite computing power. As Manfred said, model-based goals seems to be a promising research direction for getting AIs to care about the real world, but we don't currently have the ability to get such an AI to reliably actually "value paperclips". There are a lot of problems with model-based goals that occur even in the POMDP setting, let alone when the agent's model of the world or observation space can change. So I wouldn't expect anyone to be able to propose a fully coherent complete answer to your question in the near term.

It might be useful to think about how humans "solve" this problem, and whether or not you can port this behavior over to an AI.

If you're interested in this topic, I would recommend MIRI's paper on value learning as well as the relevant Arbital Technical Tutorial.

Game Theory & The Golden Rule (From Reddit)

The reason for this is because of the 5% chance for mistakes. Copycat does worse vs both Simpleton and Copycat than Simpleton does against itself.

Load More