AlexMennen

Wiki Contributions

Load More

Comments

I don't think this one works. In order for the channel capacity to be finite, there must be some maximum number of bits N you can send. Even if you don't observe the type of the channel, you can communicate a number n from 0 to N by sending n 1s and N-n 0s. But then even if you do observe the type of the channel (say, it strips the 0s), the receiver will still just see some number of 1s that is from 0 to N, so you have actually gained zero channel capacity. There's no bonus for not making full use of the channel; in johnswentworth's formulation of the problem, there's no such thing as some messages being cheaper to transmit through the channel than others.

We "just" need to update the three geometric averages on this background knowledge. Plausibly how this should be done in this case is to normalize them such that they add to one.

My problem with a forecast aggregation method that relies on renormalizing to meet some coherence constraints is that then the probabilities you get depend on what other questions get asked. It doesn't make sense for a forecast aggregation method to give probability 32.5% to A if the experts are only asked about A, but have that probability predictably increase if the experts are also asked about B and C. (Before you try thinking of a reason that the experts' disagreement about B and C is somehow evidence for A, note that no matter what each of the experts believe, if your forecasting method is mean log odds, but renormalized to make probabilities sum to 1 when you ask about all 3 outcomes, then the aggregated probability assigned to A can only go up when you also ask about B and C, never down. So any such defense would violate conservation of expected evidence.)

(In the case of the arithmetic mean, updating on the background information plausibly wouldn't change anything here, but that's not the case for other possible background information.)

Any linear constraints (which are the things you get from knowing that certain Boolean combinations of questions are contradictions or tautologies) that are satisfied by each predictor will also be satisfied by their arithmetic mean.

But it is anyway a more general question (than the question of whether the geometric mean of the odds is better or the arithmetic mean of the probabilities): how should we "average" two or more probability distributions (rather than just two probabilities), assuming they come from equally reliable sources?

That's part of my point. Arithmetic mean of probabilities gives you a way of averaging probability distributions, as well as individual probabilities. Geometric mean of log odds does not.

If we assume that the prior was indeed important here then this makes sense, but if we assume that the prior was irrelevant (that they would have arrived at 25% even if their prior was e.g. 10% rather than 50%), then this doesn't make sense. (Maybe they first assumed the probability of drawing a black ball from an urn was 50%, then they each independently created a large sample, and ~25% of the balls came out black. In this case the prior was mostly irrelevant.) We would need a more general description under which circumstances the prior is indeed important in your sense and justifies the multiplicative evidence aggregation you proposed.

In this example, the sources of evidence they're using are not independent; they can expect ahead of time that each of them will observe the same relative frequency of black balls from the urn, even while not knowing in advance what that relative frequency will be. The circumstances under which the multiplicative evidence aggregation method is appropriate are exactly the circumstances in which the evidence actually is independent.

But in the second case I don't see how a noisy process for a probability estimate would lead to being "forced to set odds that you'd have to take bets on either side of, even someone who knows nothing about the subject could exploit you on average".

They make their bet direction and size functions of the odds you offer them in such a way that they bet more when you offer better odds. If you give the correct odds, then the bet ends up resolving neutrally on average, but if you give incorrect odds, then which direction you are off in correlates with how big a bet they make in such a way that you lose on average either way.

Oh, derp. You're right.

I think the way I would rule out my counterexample is by strengthening A3 to if  and  then there is ...

Answer by AlexMennenApr 15, 2023Ω10142

Q2: No. Counterexample: Suppose there's one outcome  such that all lotteries are equally good, except for the lottery than puts probability 1 on , which is worse than the others.

I'm not sure why you don't like calling this "redundancy". A meaning of redundant is "able to be omitted without loss of meaning or function" (Lexico). So ablation redundancy is the normal kind of redundancy, where you can remove sth without losing the meaning. Here it's not redundant, you can remove a single direction and lose all the (linear) "meaning".

Suppose your datapoints are   (where the coordinates  and  are independent from the standard normal distribution), and the feature you're trying to measure is . A rank-1 linear probe will retain some information about the feature. Say your linear probe finds the  coordinate. This gives you information about ; your expected value for this feature is now , an improvement over its a priori expected value of . If you ablate along this direction, all you're left with is the  coordinate, which tells you exactly as much about the feature  as the  coordinate does, so this rank-1 ablation causes no loss in performance. But information is still lost when you lose the  coordinate, namely the contribution of  from the feature. The thing that you can still find after ablating away the  direction is not redundant with the the rank-1 linear probe in the  direction you started with, but just contributes the same amount towards the feature you're measuring.

The point is, the reason why CCS fails to remove linearly available information is not because the data "is too hard". Rather, it's because the feature is non-linear in a regular way, which makes CCS and Logistic Regression suck at finding the direction which contains all linearly available data (which exists in the context of "truth", just as it is in the context of gender and all the datasets on which RLACE has been tried).

Disagree. The reason CCS doesn't remove information is neither of those, but instead just that that's not what it's trained to do. It doesn't fail, but rather never makes any attempt. If you're trying to train a function such that  and , then  will achieve optimal loss just like  will.

What you're calling ablation redundancy is a measure of nonlinearity of the feature being measured, not any form of redundancy, and the view you quote doesn't make sense as stated, as nonlinearity, rather than redundancy, would be necessary for its conclusion. If you're trying to recover some feature , and there's any vector  and scalar  such that  for all data  (regardless of whether there are multiple such , which would happen if the data is contained in a proper affine subspace), then there is a direction such that projection along it makes it impossible for a linear probe to get any information about the value of . That direction is , where  is the covariance matrix of the data. This works because if , then the random variables  and  are uncorrelated (since ), and thus  is uncorrelated with .

If the data is normally distributed, then we can make this stronger. If there's a vector  and a function  such that  (for example, if you're using a linear probe to get a binary classifier, where it classifies things based on whether the value of a linear function is above some threshhold), then projecting along  removes all information about . This is because uncorrelated linear features of a multivariate normal distribution are independent, so if , then  is independent of , and thus also of . So the reason what you're calling high ablation redundancy is rare is that low ablation redundancy is a consequence of the existence of any linear probe that gets good performance and the data not being too wildly non-Gaussian.

Ablating along the difference of the means makes both CCS & Supervised learning fail, i.e. reduce their accuracy to random guessing. Therefore:

  • The fact that Recursive CCS finds many good direction is not due to some “intrinsic redundancy” of the data. There exist a single direction which contains all linearly available information.
  • The fact that Recursive CCS finds strictly more than one good direction means that CCS is not efficient at locating all information related to truth: it is not able to find a direction which contains as much information as the direction found by taking the difference of the means. Note: Logistic Regression seems to be about as leaky as CCS. See INLP which is like Recursive CCS, but with Logistic Regression.

I don't think that's a fair characterization of what you found. Suppose, for example, that you're given a vector in  whose th component is , where  is a random variable with high variance, and  are i.i.d. with mean  and tiny variance. There is a direction which contains all the information about  contained in the vector, namely the average of the coordinates. Subtracting out the mean of the coordinates from each coordinate will remove all information about . But the data is plenty redundant; there are  orthogonal directions each of which contain almost all of the available information about , so a probe trained to recover  that learns to just copy one of the coordinates will be pretty efficient at recovering . If the  have variance  (i.e. are just constants always equal to ), then there are  orthogonal directions each of which contain all information about , and a probe that copies one of them is perfectly efficient at extracting all information about .

If you can find multiple orthogonal linear probes that each get good performance at recovering some feature, then something like this must be happening.

My point wasn't that the equation didnt hold perfectly, but that the discrepancies are very suspicious. Two of the three discrepancies were off by exactly 1 order of magnitude, making me fairly confident that they are the result of a typo. (Not sure what's going on with the other discrepency).

In the table of parameters, compute, and tokens, compute/(parameters*tokens) is always 6, except in one case where it's 0.6, one case where it's 60, and one case where it's 2.75. Are you sure this is right?

Load More