joseph_c — LessWrong

Results of "Experiment on Bernoulli processes"

I don't have any immediate plans to do something like this again, but I'll make use of your offer if I end up doing another challenge. Thanks!

It will cost you nothing to "bribe" a Utilitarian

joseph_c12d142

I think the problem here is the assumption that there is only one AI company. If there are multiple AI companies and they don't form a trust, then they need to bid against each other to acquire safety researchers, right? This is like in economics where if you are the only person selling bread, you can sell it for less than its value to any given customer, but if there are multiple people selling bread you need to sell it for $ε$ minus your competitors' prices.

Discrete Generative Models

joseph_c12d20

When generating, we will sample uniformly, which requires
bits to describe. This gives the loss
$∥ final image - target ∥^{2} - iters \sum iter = 1 H (p_{batch mean}^{iter}) .$

You should be using a MSE between uniform and $p_{batch mean}$ instead of a KL divergence in the loss. The batch mean is only an estimate for what you truly want, which is the mean over the entire dataset (or perhaps, all possible images, but there's not much you can do about that). If you directly substitute it in for the dataset mean in the KL divergence, the resulting gradients are not unbiased estimators of the correct gradients. On the other hand, if you use a MSE loss instead, the gradients are unbiased estimators of the correct gradients for the MSE. In the limit as the dataset marginals approach the uniform distribution, the gradients of the KL divergence will be parallel to the gradients of the MSE gradients, so it's okay to use a MSE instead.

Discrete Generative Models

joseph_c13d20

Right now in your code, you only calculate reconstruction error gradients for the very last step.

if random.random() > delta:
    loss = loss + (probs * err).sum(dim=1).mean()
    break

Pragmatically, it is more efficient to calculate reconstruction error gradients at every step and just weight by the probability of being the final image:

loss = loss + (1 - delta) * (probs * err).sum(dim=1).mean()
if random.random() > delta:
    break

Discrete Generative Models

joseph_c13d*21

Although not mentioned in Yang's paper, we can instead select images proportional to ...

This gives the loss $∥ final image - target ∥^{2} - iters \sum iter = 1 H (p_{batch mean}^{iter}) .$ If we want an infinite-depth model, we can choose to sometimes halt, but usually sample another image with probability $δ$ (for 'discount factor'). Also, as the depth increases, the images should become more similar to each other, so $β$ should increase exponentially to compensate. Empirically, I found $β = δ^{- iter}$ as $δ \to 1$ to give decent results.

I think you should choose $β$ so that ${(\frac{β}{2})}^{- 1} = \frac{\sum_{i = 1}^{B} ∥ best x_{i} - {target}_{i} ∥^{2}}{B - 1},$ the sample variance over the batch between the closest choice and the target. This is because a good model should match both the mean and the variance of the ground truth. The ground truth is that, when you encode an image, you choose the $x_{i}$ that has the least reconstruction error. The probabilities $p_{i} \propto e^{- β error (x_{i}, target)}$ can be interpreted as conditional probabilities that you chose the right $x_{i}$ for the encoding, where each $x_{i}$ has a Gaussian prior for being the "right" encoding with mean $x_{i}$ and variance $2 / β$ . The variance of the prior for the $x_{i}$ that is actually chosen should match the variance it sees in the real world. Hence, my recommendation for $β$ .

(You should weight the MSE loss by $β$ as well.)

Experiment: Test your priors on Bernoulli processes.

joseph_c13d*20

You're mostly right. The other solves have given pretty much identical distributions.

Some of your distributions are worse than other distributions. If I run 100,000,000 experiments and calculate the frequencies, some of you will be more off at the fourth decimal point.

The market doesn't have that kind of precision, and even if it did, I wouldn't change the resolution criterion. But I can still score you guys myself later on.

I do agree that I should have given much fewer public experiments. Then it would be a better test on priors.

Experiment: Test your priors on Bernoulli processes.

joseph_c14d10

It's asking, "If I draw a histogram of the frequency of R of the fifth trial, with buckets corresponding to the number of Rs in the first four trials, what will the heights of the bars be?"

We are not doing any more experiments. All the experiments have already been done in the 1,000,000 provided experiments. I've just left out the fifth trial from these experiments.

This is almost the same question as, "If we do experiment 1000001 and see k Rs in the first four trials, then what credence do you assign to the 5th trial being R," but not quite. Your goal is to predict the marginals frequencies for the experiments I have actually conducted, not any idealized "next experiment". Because 1,000,000 trials is so many, this should be close, but they are not quite the same. The actual marginal frequencies will have some noise, for example.

I hope this helps! If you need more explanation, feel free to ask.

Experiment: Test your priors on Bernoulli processes.

joseph_c14d30

Correct, they are not equivalent. The second statement is a consequence of the first. I made this consequence explicit to justify my choice later on to bucket by the number of s but not their order.

The first statement, though, is also true. It's your full guarantee.

Using complex polynomials to approximate arbitrary continuous functions

joseph_c15d10

Is this inspired by the recent HSBC and IBM paper about using quantum computers to price bonds?https://arxiv.org/abs/2509.17715v1

I haven't read it myself, but someone who knows much more quantum mechanics than I mentioned it to me.

Thinking Mathematically - Convergent Sequences

joseph_c18d10

I agree. I think real analysis should really be taking a more topological approach to limits and continuity. In a topology classroom, they would instead define a limit in the real numbers as "every open ball around your limit point contains all of the elements of the sequence past a certain index", which is much the same as your description of Terry Tao's "-close" and "eventually $ϵ$ -close". Likewise, a continuous function would be defined, "For every open ball around $f (x)$ in the range, there is an open ball around $x$ in the domain where points around the domain ball get mapped inside the range's ball." The whole $ϵ$ - $δ$ definition obscures what is really going on with a bunch of mathematical jargon.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments