Sorted by New

Buck's Shortform

formatting problem, now fixed

Aligning a toy model of optimization

Given a policy π we can directly search for an input on which it behaves a certain way.

(I'm sure this point is obvious to Paul, but it wasn't to me)

We can search for inputs on which a policy behaves badly, which is really helpful for verifying the worst case of a certain policy. But we can't search for a policy which has a good worst case, because that would require using the black box inside the function passed to the black box, which we can't do. I think you can also say this as "the black box is an NP oracle, not a oracle".

This still means that we can build a system which in the worst case does nothing, rather than in the worst case is dangerous: we do whatever thing to get some policy, then we search for an input on which it behaves badly, and if one exists we don't run the policy.

Robustness to Scale

I think that the terms introduced by this post are great and I use them all the time

Six AI Risk/Strategy Ideas

Ah yes this seems totally correct

Buck's Shortform

[I'm not sure how good this is, it was interesting to me to think about, idk if it's useful, I wrote it quickly.]

Over the last year, I internalized Bayes' Theorem much more than I previously had; this led me to noticing that when I applied it in my life it tended to have counterintuitive results; after thinking about it for a while, I concluded that my intuitions were right and I was using Bayes wrong. (I'm going to call Bayes' Theorem "Bayes" from now on.)

Before I can tell you about that, I need to make sure you're thinking about Bayes in terms of ratios rather than fractions. Bayes is enormously easier to understand and use when described in terms of ratios. For example: Suppose that 1% of women have a particular type of breast cancer, and a mammogram is 20 times more likely to return a positive result if you do have breast cancer, and you want to know the probability that you have breast cancer if you got that positive result. The prior probability ratio is 1:99, and the likelihood ratio is 20:1, so the posterior probability is = 20:99, so you have probability of 20/(20+99) of having breast cancer.

I think that this is absurdly easier than using the fraction formulation. I think that teaching the fraction formulation is the single biggest didactic mistake that I am aware of in any field.

Anyway, a year or so ago I got into the habit of calculating things using Bayes whenever they came up in my life, and I quickly noticed that Bayes seemed surprisingly aggressive to me.

For example, the first time I went to the Hot Tubs of Berkeley, a hot tub rental place near my house, I saw a friend of mine there. I wondered how regularly he went there. Consider the hypotheses of "he goes here three times a week" and "he goes here once a month". The likelihood ratio is about 12x in favor of the former hypothesis. So if I previously was ten to one against the three-times-a-week hypothesis compared to the once-a-month hypothesis, I'd now be 12:10 = 6:5 in favor of it. This felt surprisingly high to me.

(I have a more general habit of thinking about whether the results of calculations feel intuitively too low or high to me; this has resulted in me noticing amusing inconsistencies in my numerical intuitions. For example, my intuitions say that $3.50 for ten photo prints is cheap, but 35c per print is kind of expensive.)

Another example: A while ago I walked through six cars of a train, which felt like an unusually long way to walk. But I realized that I'm 6x more likely to see someone who walks 6 cars than someone who walks 1.

In all these cases, Bayes Theorem suggested that I update further in the direction of the hypothesis favored by the likelihood ratio than I intuitively wanted to. After considering this a bit more, I have came to the conclusion that my intuitions were directionally right; I was calculating the likelihood ratios in a biased way, and I was also bumping up against an inconsistency in how I estimated priors and how I estimated likelihood ratios.

If you want, you might enjoy trying to guess what mistake I think I was making, before I spoil it for you.

Here's the main mistake I think I was making. Remember the two hypotheses about my friend going to the hot tub place 3x a week vs once a month? I said that the likelihood ratio favored the first by 12x. I calculated this by assuming that in both cases, my friend visited the hot tub place on random nights. But in reality, when I'm asking whether my friend goes to the hot tub place 3x every week, I'm asking about the total probability of all hypotheses in which he visits the hot tub place 3x per week. There are a variety of such hypotheses, and when I construct them, I notice that some of the hypotheses placed a higher probability on me seeing my friend than the random night hypothesis. For example, it was a Saturday night when I saw my friend there and started thinking about this. It seems kind of plausible that my friend goes once a month and 50% of the times he visits are on a Saturday night. If my friend went to the hot tub place three times a week on average, no more than a third of those visits could be on a Saturday night.

I think there's a general phenomenon where when I make a hypothesis class like "going once a month", I neglect to think about things about specific hypotheses in the class which make the observed data more likely. The hypothesis class offers a tempting way to calculate the likelihood, but it's in fact a trap.

There's a general rule here, something like: When you see something happen that a hypothesis class thought was unlikely, you update a lot towards hypotheses in that class which gave it unusually high likelihood.

And this next part is something that I've noticed, rather than something that follows from the math, but it seems like most of the time when I make up hypotheses classes, something like this happens where I initially calculate the likelihood to be lower than it is, and the likelihoods of different hypothesis classes are closer than they would be.

(I suspect that the concept of a maximum entropy hypothesis is relevant. For every hypothesis class, there's a maximum entropy (aka maxent) hypothesis, which is the hypothesis which is maximally uncertain subject to the constraint of the hypothesis class. Eg the maximum entropy hypothesis for the class "my friend visits the hot tub place three times a month on average" is the hypothesis where the probability of my friend visiting the hot tub place every day is equal and uncorrelated. In my experience in real world cases, hypotheses classes tend to contain non-maxent hypotheses which fit the data better much better. In general for a statistical problem, these hypotheses don't do better than the maxent hypothesis; I don't know why they tend to do better in problems I think about.)

Another thing causing my posteriors to be excessively biased towards low-prior high-likelihood hypotheses is that priors tend to be more subjective to estimate than likelihoods are. I think I'm probably underconfident in assigning extremely high or low probabilities to hypotheses, and this means that when I see something that looks like moderate evidence of an extremely unlikely event, the likelihood ratio is more extreme than the prior, leading me to have a counterintuitively high posterior on the low-prior hypothesis. I could get around this by being more confident in my probability estimates at the 98% or 99% level, but it takes a really long time to become calibrated on those.

Open & Welcome Thread - November 2019

Email me at buck@intelligence.org with some more info about you and I might be able to give you some ideas (and we can maybe talk about things you could do for ai alignment more generally)

Six AI Risk/Strategy Ideas

Minor point: I think asteroid strikes are probably very highly correlated between Everett branches (though maybe the timing of spotting an asteroid on a collision course is variable).

Buck's Shortform

A couple weeks ago I spent an hour talking over video chat with Daniel Cantu, a UCLA neuroscience postdoc who I hired on Wyzant.com to spend an hour answering a variety of questions about neuroscience I had. (Thanks Daniel for reviewing this blog post for me!)

The most interesting thing I learned is that I had quite substantially misunderstood the connection between convolutional neural nets and the human visual system. People claim that these are somewhat bio-inspired, and that if you look at early layers of the visual cortex you'll find that it operates kind of like the early layers of a CNN, and so on.

The claim that the visual system works like a CNN didn’t quite make sense to me though. According to my extremely rough understanding, biological neurons operate kind of like the artificial neurons in a fully connected neural net layer--they have some input connections and a nonlinearity and some output connections, and they have some kind of mechanism for Hebbian learning or backpropagation or something. But that story doesn't seem to have a mechanism for how neurons do weight tying, which to me is the key feature of CNNs.

Daniel claimed that indeed human brains don't have weight tying, and we achieve the efficiency gains over dense neural nets by two other mechanisms instead:

Firstly, the early layers of the visual cortex are set up to recognize particular low-level visual features like edges and motion, but this is largely genetically encoded rather than learned with weight-sharing. One way that we know this is that mice develop a lot of these features before their eyes open. These low-level features can be reinforced by positive signals from later layers, like other neurons, but these updates aren't done with weight-tying. So the weight-sharing and learning here is done at the genetic level.

Secondly, he thinks that we get around the need for weight-sharing at later levels by not trying to be able to recognize complicated details with different neurons. Our vision is way more detailed in the center of our field of view than around the edges, and if we need to look at something closely we move our eyes over it. He claims that this gets around the need to have weight tying, because we only need to be able to recognize images centered in one place.

I was pretty skeptical of this claim at first. I pointed out that I can in fact read letters that are a variety of distances from the center of my visual field; his guess is that I learned to read all of these separately. I'm also kind of confused by how this story fits in with the fact that humans seem to relatively quickly learn to adapt to inversion goggled. I would love to check what some other people who know neuroscience think about this.

I found this pretty mindblowing. I've heard people use CNNs as an example of how understanding brains helped us figure out how to do ML stuff better; people use this as an argument for why future AI advances will need to be based on improved neuroscience. This argument seems basically completely wrong if the story I presented here is correct.

Buck's Shortform

I recommend looking on Wyzant.

In OpenAI's Roboschool blog post: