When generating, we will sample uniformly, which requires
bits to describe. This gives the loss
You should be using a MSE between uniform and instead of a KL divergence in the loss. The batch mean is only an estimate for what you truly want, which is the mean over the entire dataset (or perhaps, all possible images, but there's not much you can do about that). If you directly substitute it in for the dataset mean in the KL divergence, the resulting gradients are not unbiased estimators of the correct gradients. On the other hand, if you use a MSE loss instead, the gradients are unbiased estimators of the correct gradients for the MSE. In the limit as the dataset marginals approach the uniform distribution, the gradients of the KL divergence will be parallel to the gradients of the MSE gradients, so it's okay to use a MSE instead.
Right now in your code, you only calculate reconstruction error gradients for the very last step.
if random.random() > delta:
loss = loss + (probs * err).sum(dim=1).mean()
break
Pragmatically, it is more efficient to calculate reconstruction error gradients at every step and just weight by the probability of being the final image:
loss = loss + (1 - delta) * (probs * err).sum(dim=1).mean()
if random.random() > delta:
break
Although not mentioned in Yang's paper, we can instead select images proportional to ...
This gives the loss If we want an infinite-depth model, we can choose to sometimes halt, but usually sample another image with probability (for 'discount factor'). Also, as the depth increases, the images should become more similar to each other, so should increase exponentially to compensate. Empirically, I found as to give decent results.
I think you should choose so that the sample variance over the batch between the closest choice and the target. This is because a good model should match both the mean and the variance of the ground truth. The ground truth is that, when you encode an image, you choose the that has the least reconstruction error. The probabilities can be interpreted as conditional probabilities that you chose the right for the encoding, where each has a Gaussian prior for being the "right" encoding with mean and variance . The variance of the prior for the that is actually chosen should match the variance it sees in the real world. Hence, my recommendation for .
(You should weight the MSE loss by as well.)
You're mostly right. The other solves have given pretty much identical distributions.
Some of your distributions are worse than other distributions. If I run 100,000,000 experiments and calculate the frequencies, some of you will be more off at the fourth decimal point.
The market doesn't have that kind of precision, and even if it did, I wouldn't change the resolution criterion. But I can still score you guys myself later on.
I do agree that I should have given much fewer public experiments. Then it would be a better test on priors.
It's asking, "If I draw a histogram of the frequency of R of the fifth trial, with buckets corresponding to the number of Rs in the first four trials, what will the heights of the bars be?"
We are not doing any more experiments. All the experiments have already been done in the 1,000,000 provided experiments. I've just left out the fifth trial from these experiments.
This is almost the same question as, "If we do experiment 1000001 and see k Rs in the first four trials, then what credence do you assign to the 5th trial being R," but not quite. Your goal is to predict the marginals frequencies for the experiments I have actually conducted, not any idealized "next experiment". Because 1,000,000 trials is so many, this should be close, but they are not quite the same. The actual marginal frequencies will have some noise, for example.
I hope this helps! If you need more explanation, feel free to ask.
Correct, they are not equivalent. The second statement is a consequence of the first. I made this consequence explicit to justify my choice later on to bucket by the number of s but not their order.
The first statement, though, is also true. It's your full guarantee.
Is this inspired by the recent HSBC and IBM paper about using quantum computers to price bonds?https://arxiv.org/abs/2509.17715v1
I haven't read it myself, but someone who knows much more quantum mechanics than I mentioned it to me.
I agree. I think real analysis should really be taking a more topological approach to limits and continuity. In a topology classroom, they would instead define a limit in the real numbers as "every open ball around your limit point contains all of the elements of the sequence past a certain index", which is much the same as your description of Terry Tao's "-close" and "eventually -close". Likewise, a continuous function would be defined, "For every open ball around in the range, there is an open ball around in the domain where points around the domain ball get mapped inside the range's ball." The whole - definition obscures what is really going on with a bunch of mathematical jargon.
For Linux users on US Keyboards, you might want to try making Caps Lock the multi key (also called the compose key). On Cinnamon this can be done by going to Keyboard > Layouts > Options... > Position of Compose key, and other desktop environments probably have similar settings.
This lets me type umlauts (ä, ü, ö), foreign currencies (£, €, ¥,), copyright/trademark (©, ™), and a bunch of other stuff. For example, "ü" is made by typing Compose, u, and " in sequence. I also added the line
<Multi_key> <backslash> : "λ"
to my ~/.XCompose file so that I can type λ efficiently; this is useful when writing Lisp code.
I think the problem here is the assumption that there is only one AI company. If there are multiple AI companies and they don't form a trust, then they need to bid against each other to acquire safety researchers, right? This is like in economics where if you are the only person selling bread, you can sell it for ε less than its value to any given customer, but if there are multiple people selling bread you need to sell it for ε minus your competitors' prices.