AlexMennen

Dutch-Booking CDT: Revised Argument

I think the assumption that multiple actions have nonzero probability in the context of a deterministic decision theory is a pretty big problem. If you come up with a model for where these nonzero probabilities are coming from, I don't think your argument is going to work.

For instance, your argument fails if these nonzero probabilities come from epsilon exploration. If the agent is forced to take every action with probability epsilon, and merely chooses which action to assign the remaining probability to, then the agent will indeed purchase the contract for some sufficiently small price if , even if is not the optimal action (let's say is the optimal action). When the time comes to take an action, the agent's best bet is (prime meaning sell the contract for price ). The way I described the set-up, the agent doesn't choose between and , because actions other than the top choice all happen with probability epsilon. The fact that the agent sells the contract back in its top choice isn't a Dutch book, because the case where the agent's top choice goes through is the case in which the contract is worthless, and the contract's value is derived from other cases.

We could modify the epsilon exploration assumption so that the agent also chooses between and even while its top choice is . That is, there's a lower bound on the probability with which the agent takes an action in , but even if that bound is achieved, the agent still has some flexibility in distributing probability between and . In this case, contrary to your argument, the agent will prefer rather than , i.e., it will not get Dutch booked. This is because the agent is still choosing as the only action with high probability, and refers to the expected consequence of the agent choosing as its intended action, so the agent cannot use when calculating which of or is better to pick as its next choice if its attempt to implement intended action fails.

Another source of uncertainty that the agent could have about its actions is if it believes it could gain information in the future, but before it has to make a decision, and this information could be relevant to which decision it makes. Say that and are the agent's expectations at time of the utility that taking action would cause it to get, and the utility it would get conditional on taking action , respectively. Suppose the bookie offers the deal at time , and the agent must act at time . If the possibility of gaining future knowledge is the only source of the agent's uncertainty about its own decisions, then at time , it knows what action it is taking, and is undefined on actions not taken. and should both be well-defined, but they could be different. The problem description should disambiguate between them. Suppose that every time you say and in the description of the contract, this means and , respectively. The agent purchases the contract, and then, when it comes time to act, it evaluates consequences by , not , so the argument for why the agent will inevitably resell the contract fails. If the appearing in the description of the contract instead means (since the agent doesn't know what that is yet, this means the contract references what the agent will believe in the future, rather than stating numerical payoffs), then the agent won't purchase it in the first place because it will know that the contract will only have value if seems to be suboptimal at time and it takes action anyway, which it knows won't happen, and hence the contract is worthless.

"Infra-Bayesianism with Vanessa Kosoy" – Watch/Discuss Party

The Nirvana trick seems like a cheap hack, and I'm curious if there's a way to see it as good reasoning.

One response to this was that predicting Nirvana in some circumstance is equivalent to predicting that there are no possible futures in that circumstance, which is a sensible thing to say as a prediction that that circumstance is impossible.

What's So Bad About Ad-Hoc Mathematical Definitions?

That's exactly what I was trying to say, not a disagreement with it. The only step where I claimed all reasonable ways of measuring spreadout-ness agree was on the result you get after summing up a large number of iid random variables, not the random variables that were being summed up.

What's So Bad About Ad-Hoc Mathematical Definitions?

The "or any other measure of spreadout-ness" can be dropped here

What I meant is that, if you restrict attention to normal distributions with a fixed mean, then any reasonable measure of how spread out it is (including any of the E[|x-mean|^p]) will be a sufficient statistic, because any such measure, in order to be reasonable, must increase as variance increases (for normal distributions), so this function can be inverted to recover the variance. In other words, any other such measure will indeed be isomorphic to variance when restricted to normal distributions.

The value of m minimizing E[|X-m|] should change if I decrease the minimum X-value a lot, while leaving everything else constant

This does not change the minimizer of E[|X-m|] because it increases E[|X-m|] by the same amount for every m>min(X).

In general, you can't decrease E[|X-m|] by moving m from median to median-d for d>0 because, for xmedian (half the distribution), you increase |X-m| by d, and for the other half, you decrease |X-m| by at most d.

What's So Bad About Ad-Hoc Mathematical Definitions?

Variance has more motivation than just that it's a measure of how spread out the distribution is. Variance has the property that if two random variables are independent, then the variance of their sum is the sum of their variances. By the central limit theorem, if you add up a sufficiently large number of independent and identically distributed random variables, the distribution you get is well-approximated by a distribution that depends only on mean and variance (or any other measure of spreadout-ness). Since it is the variance of the distributions you were adding together that determines this, variance is exactly the thing you care about if you want to know the degree of spreadout-ness of a sum of a large number of independent variables from the distribution. If you take any measure of how spread out a distribution is that doesn't carry the same information as the variance, then it will fail to predict how spread out the sum of a large number of independent copies of the distribution is, by any measure.

Edit: On the subject of other possible measures of features of probability distributions, one could also make the same complaint about mean as a measure of the middle of a distribution, when there are possible alternatives like median. Again, a similar sort of argument can be used to identify mean as the best one in some circumstances. But if I were to define a measure of how spread out a distribution is as E[|X-m|] for some m, I would use m=median rather than m=mean. This is because m=median minimizes this expected absolute value (in fact, median can be defined this way), so this measures the minimal average distance every point in the distribution has to travel in order for them to all meet at one point (the median is the most efficient point for them to meet).

I'm still mystified by the Born rule

I was just thinking back to this, and it occurred to me that one possible reason to be unsatisfied with the arguments I presented here is that I started off with this notion of a crossing-over point as p continuously increases. But then when you asked "ok, but why is the crossing-over point 2?", I was like "uh, consider that it might be an integer, and then do a bunch of very discrete-looking arguments that end up showing there's something special about 2", which doesn't connect very well with the "crossover point when p continuously varies" picture. If indeed this seemed unsatisfying to you, then perhaps you'll like this more:

If we have a norm on a vector space, then it induces a norm on its dual space, given by . If a linear map preserves a norm, then its adjoint preserves the induced norm on the dual space.

Claim: The Lp norm on column vectors induces, as its dual, the Lq norm on row vectors, where p and q satisfy .

Thus if a matrix preserves Lp norm, then its adjoint preserves Lq norm. When p=2, we get that its adjoint preserves the same norm. This sort of gives you a natural way of seeing 2 as halfway between 1 and infinity, and giving, for every p, a corresponding q that is equally far away from the middle in the other direction, in the appropriate sense.

Proof of claim: Given p and q such that , and a row vector with Lq norm 1, let , so that . Then let (with the same sign as ). The column vector has Lp norm 1. . This shows that the dual-Lp norm of is at least 1. Standard constrained optimization techniques will verify that this maximizes subject to the constraint that has Lp norm 1, and thus that the dual-Lp norm of is exactly 1.

Corollary: If a matrix preserves Lp norm for any p2, then it is a permutation matrix (up to flipping the signs of some of its entries).

Proof: Let q be such that . The columns of the matrix each have Lp norm 1, so the whole matrix has Lp norm (since the entries from each of the n columns contribute 1 to the sum). By the same reasoning about its adjoint, the matrix has Lq norm . Assume wlog p<q. Lq norm is Lp norm for q>p, with equality only on scalar multiples of basis vectors. So if any column of the matrix isn't a basis vector (up to sign), then its Lq norm is less than 1; meanwhile, all the columns have Lq norm at most 1, so this would mean that the Lq norm of the whole matrix is strictly less than , contradicting the argument about its adjoint.

I'm still mystified by the Born rule

Also, I'm curious what you think the connection is between the "L2 is connected to bilinear forms" and "L2 is the only Lp metric invariant under nontrivial change of basis", if it's easy to state.

This was what I was trying to vaguely gesture towards with the derivation of the "transpose = inverse" characterization of L2-preserving matrices; the idea was that the argument was a natural sort of thing to try, so if it works to get us a characterization of the Lp-preserving matrices for exactly one value of p, then that's probably the one that has a different space of Lp-preserving matrices than the rest. But perhaps this is too sketchy and mysterian. Let's try a dimension-counting argument.

Linear transformations and bilinear forms can both be represented with matrices. Linear transformations act on the space of bilinear forms by applying the linear transformation to both inputs before plugging them into the bilinear form. If the matrix represents a linear transformation and the matrix represents a bilinear form, then the matrix representing the bilinear form you get from this action is . But whatever, the point is, so far we have an -dimensional group acting on an -dimensional space. But quadratic forms (like the square of the L2 norm) can be represented by *symmetric* matrices, the space of which is -dimensional, and if is symmetric, then so is . So now we have an -dimensional group acting on a -dimensional space, so the stabilizer of any given element must be at least dimensional. As it turns out, this is exactly the dimensionality of the space of orthogonal matrices, but the important thing is that this is nonzero, which explains why the space of orthogonal matrices must not be discrete.

Now let's see what happens if we try to adapt this argument to Lp and p-linear forms for some p2.

With p=1, a linear transformation preserving a linear functional corresponds to a matrix preserving a row vector in the sense that . You can do a dimension-counting argument and find that there are tons of these matrices for any given row vector, but it doesn't do you any good because 1 isn't even so preserving the linear functional doesn't mean you preserve L1 norm.

Let's try p=4, then. A 4-linear form can be represented by an hypermatrix, the space of which is -dimensional. Again, we can restrict attention to the symmetric ones, which are preserved by the action of linear maps. But the space of symmetric hypermatrices is -dimensional, still much more than . This means that our linear maps can use up all of their degrees of freedom moving a symmetric 4-linear form around to different 4-linear forms without even getting close to filling up the whole space, and never gets forced to use its surplus degrees of freedom with linear maps that stabilize a 4-linear form, so it doesn't give us linear maps stabilizing L4 norm.

I'm still mystified by the Born rule

A related thing that's special about the L2 norm is that there's a bilinear form such that |v| carries the same information as .

"Ok, so what? Can't do you the same thing with any integer n, with an n-linear form?" you might reasonably ask. First of all, not quite, it only works for the even integers, because otherwise you need to use absolute value*, which isn't linear.

But the bilinear forms really are the special ones, roughly speaking because they are a similar type of object to linear transformations. By currying, a bilinear form on V is a linear map , where is the space of linear maps . Now the condition of a linear transformation preserving a bilinear form can just be written in terms of chaining linear maps together. A linear map has an adjoint given by for , and a linear map preserves a bilinear form iff . When using coordinates in an orthonormal basis, the bilinear form is represented by the identity matrix, so if is represented by the matrix , this becomes , which is where the usual definition of an orthogonal matrix comes from. For quadrilinear forms etc, you can't really do anything like this. So it's L2 for which you get a way of characterizing "norm-preserving" in a nice clean linear-algebraic-in-character way, so it makes sense that that would be the one to have a different space of norm-preserving maps than the others.

I also subtly brushed past something that makes L2 a particularly special norm, although I guess it's not clear if it helps. A nondegenerate bilinear form is the same thing as an isomorphism between and . If is always positive, then taking its square root gives you a norm, and that norm is L2 (though it may be disguised if you weren't using an orthonormal basis); and if it isn't always positive, then you don't get a norm out of it at all. So L2 is unique among all possible norms in that it induces and comes from an identification between your vector space and its dual.

*This assumes your vector space is over for simplicity. If it's over , then you can't get multilinearity no matter what you do, and the way this argument has to go is that you can get close enough by taking the complex conjugate of exactly half of the inputs, and then you get multilinearity from there. Speaking of , this reminds me that I was inappropriately assuming your vector space was over in my previous comment. Over , you can multiply basis vectors by any scalar of absolute value 1, not just +1 and -1. This is broader that the norm-preserving changes of basis you can do over to exactly the extent explicable by the fact that you're sneaking in a little bit of L2 via the definition of the absolute value of a complex number.

I'm still mystified by the Born rule

is the L2 norm preferred b/c it's the only norm that's invariant under orthonormal change of basis, or is the whole idea of orthonormality somehow baking in the fact that we're going to square and sqrt everything in sight (and if so how)

The L2 norm is the only Lp norm that can be preserved by *any* non-trivial change of basis (the trivial ones: permuting basis elements and multiplying some of them by -1). This follows from the fact that, for p2, the basis elements are their negatives can be identified just from the Lp norm and the addition and scalar multiplication operations of the vector space. To intuitively gesture at why this is so, let's look at L1 and L.

In L1, the norm of the sum of two vectors is the sum of their norms iff for each coordinate, both vectors have components of the same sign; otherwise, they cancel in some coordinate, and the norm of the sum is smaller than the sum of the norms. 0 counts as the same sign as everything, so the more zeros a vector has in its coordinates, the more other vectors it will have the maximum possible norm of sum with. The basis vectors and their negations are thus distinguished as those unit vectors u for which the set {v : |u+v| = |u|+|v|} is maximal. Since the alternative to |u+v| = |u|+|v| is |u+v| < |u|+|v|, the basis vectors can be thought of as having maximal tendency for their sums with other vectors to have large norm.

In L, on the other hand, as long as you're keeping the largest coordinate fixed, changing the other coordinates costs nothing in terms of the norm of the vector, but making those other coordinates larger still creates more opportunities to change the norm of other vectors when you add them together. So if you're looking for a unit vector u that minimizes {v : |u+v| |v|}, u is a basis vector or the negation of one. The basis vectors have minimal tendency for their sums with other vectors to have large norm.

As p increases, the tendency for basis vectors to have large sums with other vectors decreases (as compared to the tendency for arbitrary vectors to have large sums with other vectors). There must be a cross-over point where whether or not a vector is a basis vector ceases to be predictive of the norm of its sum with an arbitrary other vector, and we lose the ability to figure out which vectors are basis vectors only at that point, which is p=2.

So if you're trying to guess what sort of norm some vector space naturally carries (let's say you're given, as a hint, that it's an Lp norm for some p), L2 should start out as a pretty salient option, along with, and arguably ahead of, L1 and L. As soon as you hear anything about there being multiple different bases that seem to have equal footing (as is saliently the case in QM), that settles it: L2 is the only option.

This sort of thing seems to suggest that EY's claims in this post about the scale of the relative intelligence differences between chimps, a village idiot, and Einstein is incorrect. The difference in intelligence between village idiot and Einstein may be comparable to the difference in intelligence between some nonhuman animals and a human village idiot. Which is a priori surprising, given that human brains are very structurally similar to each other in comparison to nonhuman animal brains.