joseph_c — LessWrong

Thank you for the more detailed recipe! I'm not going to switch to only eating this meal (don't worry about my vitamin intake!). It's just something I plan to add to my rotation because it's easy to make and healthy.

Traditional Food

joseph_c6d30

Is the bean dish supposed to be more like soup or chili? I tried it out, and it was rather soupy. Maybe I added too much broth. What portions did you use for the canned tomatoes, beans, and broth?

The Enemy Gets The Last Hit

joseph_c12d*30

What's going on is something like adverse selection in an auction. In reality, chess is a solvable game, so that in perfect play win/loss/draw probabilities are all 0 or 1 for every board state. However, you don't know what these probabilities are, so you use a model. A naive player might just want to play the action which takes them to the board state they model as having the highest probability of winning. However, this fails to take into account the fact that one's model can be wrong, and so one will tend to pick actions which lead to board states which one mispredicts as being better than they actually are.

If one can accurately model how inaccurate one's model tends to be at certain board states, then one can do fine without ending on one's opponent's move by discounting board states one models as modelling poorly. This is nontrivial for humans to do correctly, however.

Instead, a heuristic one can use is to just let one's opponent make the last move in one's search tree. This gives one a lower bound on how good each board state is (an optimistic guess for one's opponent is a pessimistic guess for one's self), so one's choices of board states will not be catastrophically biased. By catastrophically biased, I mean that in chess it's much easier to wreck your game than to make a stunningly clever move which causes you to win, so that being too optimistic is much worse than being too pessimistic.

Lambda Calculus Prior

joseph_c21d10

This is not at all my specialty, but might the problem go away if instead of directly passing in the next term into your lambda calculus machine, you first quote it? By "quoting", I mean converting it to a representation that the lambda calculus machine can inspect, like the QUOTE operator in Lisp.

Results of "Experiment on Bernoulli processes"

joseph_c1mo20

I don't have any immediate plans to do something like this again, but I'll make use of your offer if I end up doing another challenge. Thanks!

It will cost you nothing to "bribe" a Utilitarian

joseph_c2mo142

I think the problem here is the assumption that there is only one AI company. If there are multiple AI companies and they don't form a trust, then they need to bid against each other to acquire safety researchers, right? This is like in economics where if you are the only person selling bread, you can sell it for less than its value to any given customer, but if there are multiple people selling bread you need to sell it for $ε$ minus your competitors' prices.

Discrete Generation

joseph_c2mo20

When generating, we will sample uniformly, which requires
bits to describe. This gives the loss
$∥ final image - target ∥^{2} - iters \sum iter = 1 H (p_{batch mean}^{iter}) .$

You should be using a MSE between uniform and $p_{batch mean}$ instead of a KL divergence in the loss. The batch mean is only an estimate for what you truly want, which is the mean over the entire dataset (or perhaps, all possible images, but there's not much you can do about that). If you directly substitute it in for the dataset mean in the KL divergence, the resulting gradients are not unbiased estimators of the correct gradients. On the other hand, if you use a MSE loss instead, the gradients are unbiased estimators of the correct gradients for the MSE. In the limit as the dataset marginals approach the uniform distribution, the gradients of the KL divergence will be parallel to the gradients of the MSE gradients, so it's okay to use a MSE instead.

Discrete Generation

joseph_c2mo20

Right now in your code, you only calculate reconstruction error gradients for the very last step.

if random.random() > delta:
    loss = loss + (probs * err).sum(dim=1).mean()
    break

Pragmatically, it is more efficient to calculate reconstruction error gradients at every step and just weight by the probability of being the final image:

loss = loss + (1 - delta) * (probs * err).sum(dim=1).mean()
if random.random() > delta:
    break

Discrete Generation

joseph_c2mo*21

Although not mentioned in Yang's paper, we can instead select images proportional to ...

This gives the loss $∥ final image - target ∥^{2} - iters \sum iter = 1 H (p_{batch mean}^{iter}) .$ If we want an infinite-depth model, we can choose to sometimes halt, but usually sample another image with probability $δ$ (for 'discount factor'). Also, as the depth increases, the images should become more similar to each other, so $β$ should increase exponentially to compensate. Empirically, I found $β = δ^{- iter}$ as $δ \to 1$ to give decent results.

I think you should choose $β$ so that ${(\frac{β}{2})}^{- 1} = \frac{\sum_{i = 1}^{B} ∥ best x_{i} - {target}_{i} ∥^{2}}{B - 1},$ the sample variance over the batch between the closest choice and the target. This is because a good model should match both the mean and the variance of the ground truth. The ground truth is that, when you encode an image, you choose the $x_{i}$ that has the least reconstruction error. The probabilities $p_{i} \propto e^{- β error (x_{i}, target)}$ can be interpreted as conditional probabilities that you chose the right $x_{i}$ for the encoding, where each $x_{i}$ has a Gaussian prior for being the "right" encoding with mean $x_{i}$ and variance $2 / β$ . The variance of the prior for the $x_{i}$ that is actually chosen should match the variance it sees in the real world. Hence, my recommendation for $β$ .

(You should weight the MSE loss by $β$ as well.)

Experiment: Test your priors on Bernoulli processes.

joseph_c2mo*20

You're mostly right. The other solves have given pretty much identical distributions.

Some of your distributions are worse than other distributions. If I run 100,000,000 experiments and calculate the frequencies, some of you will be more off at the fourth decimal point.

The market doesn't have that kind of precision, and even if it did, I wouldn't change the resolution criterion. But I can still score you guys myself later on.

I do agree that I should have given much fewer public experiments. Then it would be a better test on priors.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments