Since it's slightly hard to find: Paul references it here (ctrl+f for "interior") and links to this source (once again ctrl+f for "interior"). Paul also refers to it in this post. The term is actually "position of the interior" and apparently comes from military strategist Carl von Clausewitz.

Reply

3

Transfer learning and generalization-qua-capability in Babbage and Davinci (or, why division is better than Spanish)

Vivek Hebbar3mo62

Can you clarify what figure 1 and figure 2 are showing?

I took the text description before figure 1 to mean {score on column after finetuning on 200 from row then 10 from column} - {score on column after finetuning on 10 from column}. But then the text right after says "Babbage fine-tuned on addition gets 27% accuracy on the multiplication dataset" which seems like a different thing.

Reply

Survey for alignment researchers!

Vivek Hebbar3moΩ112

Note: The survey took me 20 mins (but also note selection effects on leaving this comment)

Reply

What’s up with LLMs representing XORs of arbitrary features?

Vivek Hebbar4moΩ110

Here's a fun thing I noticed:

There are 16 boolean functions of two variables. Now consider an embedding that maps each of the four pairs {(A=true, B=true), (A=true, B=false), ...} to a point in 2d space. For any such embedding, at most 14 of the 16 functions will be representable with a linear decision boundary.

For the "default" embedding (x=A, y=B), xor and its complement are the two excluded functions. If we rearrange the points such that xor is linearly represented, we always lose some other function (and its complement). In fact, there are 7 meaningfully distinct colinearity-free embeddings, each of which excludes a different pair of functions.^[1]

I wonder how this situation scales for higher dimensions and variable counts. It would also make sense to consider sparse features (which allow superposition to get good average performance).

^{^}
The one unexcludable pair is ("always true", "always false").
These are the seven embeddings:

Reply

What’s up with LLMs representing XORs of arbitrary features?

Vivek Hebbar4moΩ110

Oops, I misunderstood what you meant by unimodality earlier. Your comment seems broadly correct now (except for the variance thing). I would still guess that unimodality isn't precisely the right well-behavedness desideratum, but I retract the "directionally wrong".

Reply

What’s up with LLMs representing XORs of arbitrary features?

Vivek Hebbar4moΩ230

The variance of the multivariate uniform distribution is largest along the direction $x_{1} + x_{2}$ , which is exactly the direction which we would want to represent a AND b.

The variance is actually the same in all directions. One can sanity-check by integration that the variance is 1/12 both along the axis and along the diagonal.

In fact, there's nothing special about the uniform distribution here: The variance should be independent of direction for any N-dimensional joint distribution where the N constituent distributions are independent and have equal variance.^[1]

The diagram in the post showing that "and" is linearly represented works if the features are represented discretely (so that there are exactly 4 points for 2 binary features, instead of a distribution for each combination). As soon as you start defining features with thresholds like DanielVarga did, the argument stops going through in general, and the claim can become false.

The stuff about unimodality doesn't seem relevant to me, and in fact seems directionally wrong.

^{^}
I have a not-fully-verbalized proof which I don't have time to write out

Reply

What’s up with LLMs representing XORs of arbitrary features?

Vivek Hebbar4moΩ110

Maybe models track which features are basic and enforce that these features be more salient

Couldn't it just write derivative features more weakly, and therefore not need any tracking mechanism other than the magnitude itself?

Reply

Vivek Hebbar's Shortform

Vivek Hebbar7mo81

It's sad that agentfoundations.org links no longer work, leading to broken links in many decision theory posts (e.g. here and here)

Reply

Note-Taking without Hidden Messages

Vivek Hebbar7mo30

This will initially boost relative to $¯ S$ because it will suddenly be joined to a network with is correctly transmitting $¯ H$ but which does not understand $¯ S$ at all.
However, as these networks are trained to equilibrium the advantage will disappear as a steganographic protocol is agreed between the two models. Also, this can only be used once before the networks are in equilibrium.

Why would it be desirable to do this end-to-end training at all, rather than simply sticking the two networks together and doing no further training? Also, can you clarify what the last sentence means?
(I have guesses, but I'd rather just know what you meant)

Reply