DaemonicSigil

Linkpost for: https://pbement.com/posts/perturbation_theory.html

In quantum mechanics there is this idea of perturbation theory, where a Hamiltonian is perturbed by some change to become . As long as the perturbation is small, we can use the technique of perturbation theory to find out facts about the perturbed Hamiltonian, like what its eigenvalues should be.

An interesting question is if we can also do perturbation theory in machine learning. Suppose I am training a GAN, a diffuser, or some other machine learning technique that matches an empirical distribution. We'll use a statistical physics setup to say that the empirical distribution is given by:

Note that we may or may not have an explicit formula for . The distribution of the perturbed Hamiltonian is given by:

The loss function of the network will look something like:

Where are the network's parameters, and is the per-sample loss function which will depend on what kind of model we're training. Now suppose we'd like to perturb the Hamiltonian. We'll assume that we have an explicit formula for . Then the loss can be easily modified as follows:

If the perturbation is too large, then the exponential causes the loss to be dominated by a few outliers, which is bad. But if the perturbation *isn't* too large, then we can perturb the empirical distribution by a small amount in a desired direction.

One other thing to consider is that the exponential will generally increase variance in the magnitude of the gradient. To partially deal with this, we can define an adjusted batch size as:

Then by varying the actual number of samples we put into a batch, we can try to maintain a more or less constant adjusted batch size. One way to do this is to define an error variable, `err = 0`

. At each step, we add a constant `B_avg`

to the error. Then we add samples to the batch until adding one more sample would cause the adjusted batch size to exceed `err`

. Subtract the adjusted batch size from `err`

, train on the batch, and repeat. The error carries over from one step to the next, and so the adjusted batch sizes should average to `B_avg`

.

8d42

I don't think we should consider the centroid important in describing the LLM's "ontology". In my view, the centroid just points in the direction of highest density of words in the LLM's space of concepts. Let me explain:

The reason that embeddings are spread out is to allow the model to distinguish between words. So intuitively, tokens with largeish dot product between them correspond to similar words. Distinguishability of tokens is a limited resource, so the training process should generally result in a distribution of tokens that uses this resource in an efficient way to encode the information needed to predict text. Consider a language with 100 words for snow. Probably these all end up with similar token vectors, with large dot products between them. Exactly *which* word for snow someone writes is probably not too important for predicting text. So the training process makes those tokens relatively less distinguishable from each other. But the fact that there are a 100 tokens all pointing in a similar direction means that the centroid gets shifted in that direction.

Probably you can see where this is going now. The centroid gets shifted in directions where there are many tokens that the network considers to be all similar in meaning, directions where human language has allocated a lot of words, while the network considers the differences in shades of meaning between these words to be relatively minor.

Mathematically, convergence just means that the distance to some limit point goes to 0 in the limit. There's no implication that the limit point has to be unique, or optimal. Eg. in the case of Newton fractals, there are multiple roots and the trajectory converges to one of the roots, but which one it converges to depends on the starting point of the trajectory. Once the weight updates become small enough, we should say the net has converged, regardless of whether it achieved the "optimal" loss or not.

If even "converged" is not good enough, I'm not sure what one could say instead. Probably the real problem in such cases is people being doofuses, and probably they will continue being doofuses no matter what word we force them to use.

On the actual object level for the word "optimal", people already usually say "converged" for that meaning and I think that's a good choice.

13d68

Relatedly, you bring up adversarial examples in a way that suggests that you think of them as defects of a primitive optimization paradigm, but it turns out that adversarial examples often correspond to predictively useful features that the network is actively using for classification, despite those features not being robust to pixel-level perturbations that humans don't notice—which I guess you could characterize as "weird squiggles" from our perspective, but the etiology of the squiggles presents a much more optimistic story about fixing the problem with adversarial training than if you thought "squiggles" were an inevitable consequence of using conventional ML techniques.

Train two distinct classifier neural-nets on an image dataset. Set aside one as the "reference net". The other net will be the "target net". Now perturb the images so that they look the same to humans, and also get classified the same by the reference net. So presumably both the features humans use to classify, and the squiggly features that neural nets use should be mostly unchanged. Under these constraints on the perturbation, I bet that it will still be possible to perturb images to produce adversarial examples for the target net.

Literally. I will bet money that I can still produce adversarial examples under such constraints if anyone wants to take me up on it.

Linkpost for: https://pbement.com/posts/endpoint_penalty.html

When training a Wasserstein GAN, there is a very important constraint that the discriminator network must be a Lipschitz-continuous function. Roughly we can think of this as saying that the output of the function can't change too fast with respect to position, and this change must be bounded by some constant . If the discriminator function is given by then we can write the Lipschitz condition for the discriminator as:

Usually this is implemented as a gradient penalty. People will take a gradient (higher order, since the loss already has a gradient in it) of this loss (for ):

In this expression is sampled as , a random mixture of a real and a generated data point.

But this is complicated to implement, involving a higher order gradient. It turns out we can also just impose the Lipschitz condition directly, via the following penalty:

Except to prevent issues where we're maybe sometimes dividing by zero, we throw in an and a reweighting factor of (not sure if that is fully necessary, but the intuition is that making sure the Lipschitz condition is enforced for points at large separation is the most important thing).

For the overall loss, we compare all pairwise distances between real data and generated data and a random mixture of them. Probably it improves things to add 1 or two more random mixtures in, but I'm not sure and haven't tried it.

In any case, this seems to work decently well (tried on mnist), so it might be a simpler alternative to gradient penalty. I also used instance noise, which as pointed out here, is amazingly good for preventing mode collapse and just generally makes training easier. So yeah, instance noise is great and you should use it. And if you really don't want to figure out how to do higher order gradients in pytorch for your WGAN, you've still got options.

Yes. I think Beff was speaking imprecisely there. In order to be consistent with what he's written elsewhere, he should have said something like: "maximizing the rate of free energy dissipation".

21d1711

C: You heard it, e/acc isn't about maximizing entropy [no shit?!]

B: No, it's about maximizing the free energy

C: So e/acc should want to collapse the false vacuum?

Holy mother of bad faith. Rationalists/lesswrongers have a problem with saying obviously false things, and this is one of those.

It's in line with what seems like Connor's debate strategy - make your opponent define their views and their terminal goal in words, and then pick apart that goal by pushing it to the maximum. Embarrassing.

I agree with you that Connor performed very poorly in this debate. But this one is actually fair game. If you look at Beff's writings about "thermodynamic god" and these kinds of things, he talks a lot about how these ideas are supported by physics and the Crooks fluctuation theorem. Normally in a debate if someone says they value X, you interpret that as "I value X, but other things can also be valuable and there might be edge cases where X is bad and I'm reasonable and will make exceptions for those."

But physics doesn't have a concept of "reasonable". The ratio between the forward and backward probabilities in the Crooks fluctuation theorem is exponential in the amount of entropy produced. It's not exponential in the amount of entropy produced plus some correction terms to add in reasonable exceptions for edge cases. Given how much Beff has emphasized that his ideas originated in physics, I think it's reasonable to take him at his word and assume that he really is talking about the thing in the exponent of the Crooks fluctuation theorem. And then the question of "so hey, it sure does look like collapsing the false vacuum would dissipate an absolutely huge amount of free energy" is a very reasonable one to ask.

If you care about the heat coming out on the hot side rather than the heat going in on the cold side (i.e. the application is heat pump rather than refrigerator), then the theoretical limit is always greater than 1, since the work done gets added onto the heat absorbed:

Cooling performance can absolutely be less than 1, and often is for very cold temperatures.

In this context, we're imitating some probability distribution, and the perturbation means we're slightly adjusting the probabilities, making some of them higher and some of them lower. The adjustment is small in a multiplicative sense not an additive sense, hence the use of exponentials. Just as a silly example, maybe I'm training on MNIST digits, but I want the 2's to make up 30% of the distribution rather than just 10%. The math described above would let me train a GAN that generates 2's 30% of the time.

I'm not sure what is meant by "the difference from a gradient in SGD", so I'd need more information to say whether it is different from a perturbation or not. But probably it's different: perturbations in the above sense are perturbations in the probability distribution over the training data.