LESSWRONG
LW

3061
joseph_c
2072730
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
3joseph_c's Shortform
1y
1
It will cost you nothing to "bribe" a Utilitarian
joseph_c8h52

I think the problem here is the assumption that there is only one AI company. If there are multiple AI companies and they don't form a trust, then they need to bid against each other to acquire safety researchers, right? This is like in economics where if you are the only person selling bread, you can sell it for ε less than its value to any given customer, but if there are multiple people selling bread you need to sell it for ε minus your competitors' prices.

Reply
Discrete Generative Models
joseph_c1d20

When generating, we will sample uniformly, which requires

KL(pbatch mean∥Uniform)=constant−H(pbatch mean)

bits to describe. This gives the loss

∥final image−target∥2−iters∑iter=1H(piterbatch mean).

 

You should be using a MSE between uniform and pbatch mean instead of a KL divergence in the loss. The batch mean is only an estimate for what you truly want, which is the mean over the entire dataset (or perhaps, all possible images, but there's not much you can do about that). If you directly substitute it in for the dataset mean in the KL divergence, the resulting gradients are not unbiased estimators of the correct gradients. On the other hand, if you use a MSE loss instead, the gradients are unbiased estimators of the correct gradients for the MSE. In the limit as the dataset marginals approach the uniform distribution, the gradients of the KL divergence will be parallel to the gradients of the MSE gradients, so it's okay to use a MSE instead.

Reply
Discrete Generative Models
joseph_c2d20

Right now in your code, you only calculate reconstruction error gradients for the very last step.

if random.random() > delta:
    loss = loss + (probs * err).sum(dim=1).mean()
    break

Pragmatically, it is more efficient to calculate reconstruction error gradients at every step and just weight by the probability of being the final image:

loss = loss + (1 - delta) * (probs * err).sum(dim=1).mean()
if random.random() > delta:
    break
Reply
Discrete Generative Models
joseph_c2d*21

Although not mentioned in Yang's paper, we can instead select images proportional to p∝e−β⋅error(x,target) ...

This gives the loss ∥final image−target∥2−iters∑iter=1H(piterbatch mean). If we want an infinite-depth model, we can choose to sometimes halt, but usually sample another image with probability δ (for 'discount factor'). Also, as the depth increases, the images should become more similar to each other, so β should increase exponentially to compensate. Empirically, I found β=δ−iter as δ→1 to give decent results.

I think you should choose β so that (β2)−1=∑Bi=1∥best xi−targeti∥2B−1, the sample variance over the batch between the closest choice and the target. This is because a good model should match both the mean and the variance of the ground truth. The ground truth is that, when you encode an image, you choose the xi that has the least reconstruction error. The probabilities pi∝e−βerror(xi,target) can be interpreted as conditional probabilities that you chose the right xi for the encoding, where each xi has a Gaussian prior for being the "right" encoding with mean xi and variance 2/β. The variance of the prior for the xi that is actually chosen should match the variance it sees in the real world. Hence, my recommendation for β.

(You should weight the MSE loss by β as well.)

Reply
Experiment: Test your priors on Bernoulli processes.
joseph_c2d*20

You're mostly right. The other solves have given pretty much identical distributions.

Some of your distributions are worse than other distributions. If I run 100,000,000 experiments and calculate the frequencies, some of you will be more off at the fourth decimal point.

The market doesn't have that kind of precision, and even if it did, I wouldn't change the resolution criterion. But I can still score you guys myself later on.

I do agree that I should have given much fewer public experiments. Then it would be a better test on priors.

Reply
Experiment: Test your priors on Bernoulli processes.
joseph_c3d10

It's asking, "If I draw a histogram of the frequency of R of the fifth trial, with buckets corresponding to the number of Rs in the first four trials, what will the heights of the bars be?"

We are not doing any more experiments. All the experiments have already been done in the 1,000,000 provided experiments. I've just left out the fifth trial from these experiments.

This is almost the same question as, "If we do experiment 1000001 and see k Rs in the first four trials, then what credence do you assign to the 5th trial being R," but not quite. Your goal is to predict the marginals frequencies for the experiments I have actually conducted, not any idealized "next experiment". Because 1,000,000 trials is so many, this should be close, but they are not quite the same. The actual marginal frequencies will have some noise, for example.

I hope this helps! If you need more explanation, feel free to ask.

Reply
Experiment: Test your priors on Bernoulli processes.
joseph_c3d30

Correct, they are not equivalent. The second statement is a consequence of the first. I made this consequence explicit to justify my choice later on to bucket by the number of Rs but not their order.

The first statement, though, is also true. It's your full guarantee.

Reply
Using complex polynomials to approximate arbitrary continuous functions
joseph_c4d10

Is this inspired by the recent HSBC and IBM paper about using quantum computers to price bonds?https://arxiv.org/abs/2509.17715v1

I haven't read it myself, but someone who knows much more quantum mechanics than I mentioned it to me.

Reply
Thinking Mathematically - Convergent Sequences
joseph_c7d10

I agree. I think real analysis should really be taking a more topological approach to limits and continuity. In a topology classroom, they would instead define a limit in the real numbers as "every open ball around your limit point contains all of the elements of the sequence past a certain index", which is much the same as your description of Terry Tao's "ϵ-close" and "eventually ϵ-close". Likewise, a continuous function would be defined, "For every open ball around f(x) in the range, there is an open ball around x in the domain where points around the domain ball get mapped inside the range's ball." The whole ϵ-δ definition obscures what is really going on with a bunch of mathematical jargon.

Reply
Remap your caps lock key
joseph_c1mo30

For Linux users on US Keyboards, you might want to try making Caps Lock the multi key (also called the compose key). On Cinnamon this can be done by going to Keyboard > Layouts > Options... > Position of Compose key, and other desktop environments probably have similar settings.

This lets me type umlauts (ä, ü, ö), foreign currencies (£, €, ¥,), copyright/trademark (©, ™), and a bunch of other stuff. For example, "ü" is made by typing Compose, u, and " in sequence. I also added the line

<Multi_key> <backslash> : "λ"

to my ~/.XCompose file so that I can type λ efficiently; this is useful when writing Lisp code.

Reply
Load More
18Experiment: Test your priors on Bernoulli processes.
3d
15
3joseph_c's Shortform
1y
1