Wiki Contributions

Comments

Oops, my code had a bug so only self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) and not self.feed_forward(self.ffn_norm(h)) was in the SVD. So the diagram isn't 100% accurate.

Finally gonna start properly experimenting on stuff. Just writing up what I'm doing to force myself to do something, not claiming this is necessarily particularly important.

Llama (and many other models, but I'm doing experiments on Llama) has a piece of code that looks like this:

        h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
       out = h + self.feed_forward(self.ffn_norm(h))

Here, out is the result of the transformer layer (aka the residual stream), and the vectors self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) and self.feed_forward(self.ffn_norm(h)) are basically where all the computation happens. So basically the transformer proceeds as a series of "writes" to the residual stream using these two vectors.

I took all the residual vectors for some queries to Llama-8b and stacked them into a big matrix M with 4096 columns (the internal hidden dimensionality of the model). Then using SVD, I can express , where the 's and 's are independent units vectors. This basically decomposes the "writes" into some independent locations in the residual stream (u's), some latent directions that are written to (v's) and the strength of those writes (s's, aka the singular values).

To get a feel for the complexity of the writes, I then plotted the s's in descending order. For the prompt "I believe the meaning of life is", Llama generated the continuation "to be happy. It is a simple concept, but it is very difficult to achieve. The only way to achieve it is to follow your heart. If you follow your heart, you will find happiness. If you don’t follow your heart, you will never find happiness. I believe that the meaning of life is to". During this continuation, there were 2272 writes to the residual stream, and the singular values for these writes were as follows:

The first diagram shows that there were 2 directions that were much larger than all the others. The second diagram shows that most of the singular values are nonnegligible, which indicates to me that almost all of the writes transfer nontrivial information. This can also be seen in the last diagram, where the cumulative size of the singular values increases approximately logaritmically with their count.

This is kind of unfortunate, because if almost all of the s was concentrated in a relatively small number of dimensions (e.g. 100), then we could simplify the network a lot by projecting down to these dimensions. Still, this was relatively expected because others had found the singular values of the neural networks to be very complex.

Since variance explained is likely nonlinearly related to quality, my next step will likely be to clip the writes to the first k singular vectors and see how that impacts the performance of the network.

Second, Lomborg’s 2024 paper finds evidence that women time their births to just before a wage growth plateau. The evidence it gives again comes from IVF failures. Women who were planning to have a birth, but never succeed, have much flatter wage growth after their planned birth year, even though they didn’t actually have any kids. So the divergence between childrearing mothers and non-childbearing mothers shows up even in this placebo case when neither group actually had kids. Therefore, the event study is overstating the earnings impact of childbirth.

Couldn't it also be that the women in question plan their career based on the expectation to have children and this is what leads to the plateau? In that case it seems like it would be incorrect to interpret these results as evidence against a child penalty, as it's merely that the child penalty affects women regardless of whether they have the children. To check, I think you should ask the study participants why their career plateaued then.

Yeah I mean, I'm not claiming it has a big sense of obligation, only that it illustrates a condition where discourse seems to benefit from a sense of obligation.

Here's an example of a cheap question I just asked on twitter. Maybe Richard Hanania will find it cheap to answer too, but part of the reason I asked it was because I expect him to find it difficult to answer.

If he can't answer it, he will lose some status. That's probably good - if his position in the OP is genuine and well-informed, he should be able to answer it. The question is sort of "calling his bluff", checking that his implicitly promised reason actually exists.

Actually, we'll reschedule to make it for the meetup.

I didn't make it last time because my wife was coming home from a conference, and I probably can't make it next time because of a vacation in Iceland, but I will most likely come the time after that.

I don't have much experience with freedom of information requests, but I feel that when questions in online debates are hard to answer, it's often because they implicitly highlight problems with the positions that have been forwarded. For all I know, it could work similarly with freedom of information requests.

Ok, so this sounds like it talks about cardinality in the sense of 1 or 3, rather than in the sense of 2. I guess I default to 2 because it's more intuitive due to the transfer property, but maybe 1 or 3 are more desirable due to being mathematically richer.

Also the remark that hyperfinite can mean smaller than a nonstandard natural just seems false, where did you get that idea from?

When I look up the definition of hyperfinite, it's usually defined as being in bijection with the hypernaturals up to a (sometimes either standard or nonstandard, but given the context of your OP I assumed you mean only nonstandard) natural . If the set is in bijection with the numbers up to , then it would seem to have cardinality less than [1].

  1. ^

    Obviously this doesn't hold for transfinite sizes, but we're merely considering hyperfinite sizes, so it should hold there.

Load More