## LESSWRONGLW

AlexMennen

I'm still mystified by the Born rule

Also, I'm curious what you think the connection is between the "L2 is connected to bilinear forms" and "L2 is the only Lp metric invariant under nontrivial change of basis", if it's easy to state.

This was what I was trying to vaguely gesture towards with the derivation of the "transpose = inverse" characterization of L2-preserving matrices; the idea was that the argument was a natural sort of thing to try, so if it works to get us a characterization of the Lp-preserving matrices for exactly one value of p, then that's probably the one that has a different space of Lp-preserving matrices than the rest. But perhaps this is too sketchy and mysterian. Let's try a dimension-counting argument.

Linear transformations  and bilinear forms  can both be represented with  matrices. Linear transformations act on the space of bilinear forms by applying the linear transformation to both inputs before plugging them into the bilinear form. If the matrix  represents a linear transformation and the matrix  represents a bilinear form, then the matrix representing the bilinear form you get from this action is . But whatever, the point is, so far we have an -dimensional group acting on an -dimensional space. But quadratic forms (like the square of the L2 norm) can be represented by symmetric  matrices, the space of which is -dimensional, and if  is symmetric, then so is . So now we have an -dimensional group acting on a -dimensional space, so the stabilizer of any given element must be at least  dimensional. As it turns out, this is exactly the dimensionality of the space of orthogonal matrices, but the important thing is that this is nonzero, which explains why the space of orthogonal matrices must not be discrete.

Now let's see what happens if we try to adapt this argument to Lp and p-linear forms for some p2.

With p=1, a linear transformation preserving a linear functional corresponds to a matrix  preserving a row vector  in the sense that . You can do a dimension-counting argument and find that there are tons of these matrices for any given row vector, but it doesn't do you any good because 1 isn't even so preserving the linear functional doesn't mean you preserve L1 norm.

Let's try p=4, then. A 4-linear form  can be represented by an  hypermatrix, the space of which is -dimensional. Again, we can restrict attention to the symmetric ones, which are preserved by the action of linear maps. But the space of symmetric  hypermatrices is -dimensional, still much more than . This means that our linear maps can use up all of their degrees of freedom moving a symmetric 4-linear form around to different 4-linear forms without even getting close to filling up the whole space, and never gets forced to use its surplus degrees of freedom with linear maps that stabilize a 4-linear form, so it doesn't give us linear maps stabilizing L4 norm.

I'm still mystified by the Born rule

A related thing that's special about the L2 norm is that there's a bilinear form  such that |v| carries the same information as .

"Ok, so what? Can't do you the same thing with any integer n, with an n-linear form?" you might reasonably ask. First of all, not quite, it only works for the even integers, because otherwise you need to use absolute value*, which isn't linear.

But the bilinear forms really are the special ones, roughly speaking because they are a similar type of object to linear transformations. By currying, a bilinear form on V is a linear map , where  is the space of linear maps . Now the condition of a linear transformation preserving a bilinear form can just be written in terms of chaining linear maps together. A linear map  has an adjoint  given by  for , and a linear map  preserves a bilinear form  iff . When using coordinates in an orthonormal basis, the bilinear form is represented by the identity matrix, so if  is represented by the matrix , this becomes , which is where the usual definition  of an orthogonal matrix comes from. For quadrilinear forms etc, you can't really do anything like this. So it's L2 for which you get a way of characterizing "norm-preserving" in a nice clean linear-algebraic-in-character way, so it makes sense that that would be the one to have a different space of norm-preserving maps than the others.

I also subtly brushed past something that makes L2 a particularly special norm, although I guess it's not clear if it helps. A nondegenerate bilinear form is the same thing as an isomorphism between  and . If  is always positive, then taking its square root gives you a norm, and that norm is L2 (though it may be disguised if you weren't using an orthonormal basis); and if it isn't always positive, then you don't get a norm out of it at all. So L2 is unique among all possible norms in that it induces and comes from an identification between your vector space and its dual.

*This assumes your vector space is over  for simplicity. If it's over , then you can't get multilinearity no matter what you do, and the way this argument has to go is that you can get close enough by taking the complex conjugate of exactly half of the inputs, and then you get multilinearity from there. Speaking of , this reminds me that I was inappropriately assuming your vector space was over  in my previous comment. Over , you can multiply basis vectors by any scalar of absolute value 1, not just +1 and -1. This is broader that the norm-preserving changes of basis you can do over  to exactly the extent explicable by the fact that you're sneaking in a little bit of L2 via the definition of the absolute value of a complex number.

I'm still mystified by the Born rule

is the L2 norm preferred b/c it's the only norm that's invariant under orthonormal change of basis, or is the whole idea of orthonormality somehow baking in the fact that we're going to square and sqrt everything in sight (and if so how)

The L2 norm is the only Lp norm that can be preserved by any non-trivial change of basis (the trivial ones: permuting basis elements and multiplying some of them by -1). This follows from the fact that, for p2, the basis elements are their negatives can be identified just from the Lp norm and the addition and scalar multiplication operations of the vector space. To intuitively gesture at why this is so, let's look at L1 and L.

In L1, the norm of the sum of two vectors is the sum of their norms iff for each coordinate, both vectors have components of the same sign; otherwise, they cancel in some coordinate, and the norm of the sum is smaller than the sum of the norms. 0 counts as the same sign as everything, so the more zeros a vector has in its coordinates, the more other vectors it will have the maximum possible norm of sum with. The basis vectors and their negations are thus distinguished as those unit vectors u for which the set {v : |u+v| = |u|+|v|} is maximal. Since the alternative to |u+v| = |u|+|v| is |u+v| < |u|+|v|, the basis vectors can be thought of as having maximal tendency for their sums with other vectors to have large norm.

In L, on the other hand, as long as you're keeping the largest coordinate fixed, changing the other coordinates costs nothing in terms of the norm of the vector, but making those other coordinates larger still creates more opportunities to change the norm of other vectors when you add them together. So if you're looking for a unit vector u that minimizes {v : |u+v|  |v|}, u is a basis vector or the negation of one. The basis vectors have minimal tendency for their sums with other vectors to have large norm.

As p increases, the tendency for basis vectors to have large sums with other vectors decreases (as compared to the tendency for arbitrary vectors to have large sums with other vectors). There must be a cross-over point where whether or not a vector is a basis vector ceases to be predictive of the norm of its sum with an arbitrary other vector, and we lose the ability to figure out which vectors are basis vectors only at that point, which is p=2.

So if you're trying to guess what sort of norm some vector space naturally carries (let's say you're given, as a hint, that it's an Lp norm for some p), L2 should start out as a pretty salient option, along with, and arguably ahead of, L1 and L. As soon as you hear anything about there being multiple different bases that seem to have equal footing (as is saliently the case in QM), that settles it: L2 is the only option.

I'm still mystified by the Born rule

I disagree that using the latter to generate a sensory stream from a quantum state yields reasonable predictions -- eg, taken literally I think you're still zeroing out all but a measure-zero subset of the position basis

The observation you got from your sample is information. Information is entropy, and entropy is locally finite. So I don't think it's possible for the states consistent with the observation you got from your sample to have measure zero.

Utility Maximization = Description Length Minimization

I don't see the connection to the Jeffrey-Bolker rotation? There, to get the shouldness coordinate, you need to start with the epistemic probability measure, and multiply it by utility; here, utility is interpreted as a probability distribution without reference to a probability distribution used for beliefs.

Superintelligence via whole brain emulation

All that is indeed possible, but not guaranteed. The reason I was speculating that better brain imaging wouldn't be especially useful for machine learning in the absence of better neuron models is that I'd assume that the optimization pressure that went into the architecture of brains was fairly heavily tailored to the specific behavior of the neurons that those brains are made of, and wouldn't be especially useful relative to other neural network design techniques that humans come up with when used with artificial neurons that behave quite differently. But sure, I shouldn't be too confident of this. In particular, the idea of training ML systems to imitate brain activation patterns, rather than copying brain architecture directly, is a possible way around this that I hadn't considered.

Superintelligence via whole brain emulation

No. Scanning everything and then waiting until we have a good enough neuron model might work fine; it's just that the scan wouldn't give you a brain emulation until your neuron model is good enough.

An overview of 11 proposals for building safe advanced AI

For individual ML models, sure, but not for classes of similar models. E.g. GPT-3 presumably was more expensive to train than GPT-2 as part of the cost to getting better results. For each of the proposals in the OP, training costs constrain how complex a model you can train, which in turn would affect performance.

Intuitive Lagrangian Mechanics

I'm confused about the motivation for in terms of time dilation in general relativity. I was under the impression that general relativity doesn't even have a notion of gravitational potential, so I'm not sure what this would mean. And in Newtonian physics, potential energy is only defined up to an added constant. For to represent any sort of ratio (including proper time/coordinate time), V would have to be well-defined, not just up to an arbitrary added constant.

I also had trouble figuring out the relationship between the Euler-Lagrange equation and extremizing S. The Euler-Lagrange equation looks to me like just a kind of funny way of stating Newton's second law of motion, and I don't see why it should be equivalent to extremizing action. Perhaps this would be obvious if I knew some calculus of variations?

Relaxed adversarial training for inner alignment

I'm concerned about Goodhart's law on the acceptability predicate causing severe problems when the acceptability predicate is used in training. Suppose we take some training procedure that would otherwise result in an unaligned AI, and modify the training procedure by also including the acceptability predicate in the loss function during training. This results the end product that has been trained to appear to satisfy the intended version of the acceptability predicate. One way that could happen is if it actually does satisfy what was intended by the acceptability predicate, which is great. But otherwise, we have made the bad behavior of the final product more difficult to detect, essentially by training the AI to be deceptively aligned.