I think that’s a good explanation. I agree that the solution to Akrasia I describe is kind of hacked together and is far from ideal. If you have a better solution to this I would be very interested and it would change my attitude to status significantly. I suspect that this is the largest inferential gap you would have to cross to get your point across to me, although as I mentioned I’m not sure how central I am as an example.
I’m not sure suffering is the correct frame here - I don’t really feel like Akrasia causes me to suffer. If I give in then I feel a bit disappointed with myself but the agent which wants me to be a better person isn’t very emotional (which I think is part of the problem). Again there may be an inferential gap here.
This is a beautifully succinct way of phrasing it. I still have enough deontologist in me to feel a little dirty every time I do it though!
Firstly, I'm with you on your model of status and the availability of perceived opportunity for additional status in a hyper-connected world is really interesting.
Where I have a big disagreement is in the lesson to take from this. Your argument is that we should essentially try to turn off status as a motivator. I would suggest it would be wiser to try to better align status motivations with the things we actually value.
I struggle hugely with akrasia. If I didn't have some external motivation then I'd probably just lie in bed all day watching tv. I don't know if I'm unusually susceptible to this but my impression is that this is a fairly common problem, even if to a lesser extent in some.
One of my solutions to this is to deliberately do things for the sake of status. Rather, I look for opportunities where me getting more status aligns with me doing things which I think are good.
As an example, take karma on LessWrong. This isn't completely analogous to status but every time I get karma I feel a little (or sometimes big!) boost of self-worth. If writing on LessWrong is aligned with my values then this is a good thing. If you add in a cash prize from someone respected in the community then my status circuit is triggered significantly to motivate me to write an answer even if the actual size of the cash prize doesn't justify the amount of time put in!  I could try to fight against this and not allow status triggers but I don't think that would actually improve my self-actualisation.
In a non LW context, if status in the eyes of my family is important, I won't just spend my time watching tv but will also spend time playing with my kids. I would play with my kids anyway as I know it's the right thing to do and is fun but on those occasions where tv is more appealing, listening to my status motivation can help me do the right thing while expending less will-power. 
On a practical level I'm not sure that trying to ban status motivations is practical. As you point out a status high is readily achievable elsewhere so if opportunities for status are banned within one community then this would just subconsciously motivate me to look elsewhere.
 This isn't a complaint!
 I am aware that confessing to this in most places would be seen as a huge social faux pas, I'm hoping LW will be more understanding.
This is a review of my own post.
The first thing to say is that for the 2018 Review Eli’s mathematicians post should take precedence because it was him who took up the challenge in the first place and inspired my post. I hope to find time to write a review on his post.
If people were interested (and Eli was ok with it) I would be happy to write a short summary of my findings to add as a footnote to Eli’s post if it was chosen for the review.
This was my first post on LessWrong and looking back at it I think it still holds up fairly well.
There are a couple of things I would change if I were doing it again:
This is really interesting, thanks, not something I'd thought of.
If the teacher (or whoever set the test) also has a spread of credence over the answers then a Bayesian update would compare the values of P(A), P(B|¬A) and P(C|¬A and ¬B)  between the students and teacher. This is my first thought about how I'd create a fair scoring rule for this.
 P(D|¬A and ¬B and ¬C) = 1 for all students and teachers so this is screened off by the other answers.
The score for the 50:50:0:0 student is:
The score for the 40:20:20:20 student is:
I think the way you've done it is Briers rule which is (1 - the score from the OP). In Briers rule the lower value is better.
I think all of this is also true of a scoring rule based on only the probability placed on the correct answer?
In the end you'd still expect to win but this takes longer (requires more questions) under a rule which includes probabilities on incorrect answers - it's just adding noise to the results.
Tldr; I don’t think that this post stands up to close scrutiny although there may be unknown knowns anyway. This is partly due to a couple of things in the original paper which I think are a bit misleading for the purposes of analysing the markets.
The unknown knowns claim is based on 3 patterns in the data:
“The mean prediction market belief of replication is 63.4%, the survey mean was 60.6% and the final result was 61.9%. That’s impressive all around.”
“Every study that would replicate traded at a higher probability of success than every study that would fail to replicate.”
“None of the studies that failed to replicate came close to replicating, so there was a ‘clean cut’ in the underlying scientific reality.”
Taking these in reverse order:
I don’t think that there is as clear a distinction between successful and unsuccessful replications as stated in the OP:
"None of the studies that failed to replicate came close to replicating"
This assertion is based on a statement in the paper:
“Second, among the unsuccessful replications, there was essentially no evidence for the original finding. The average relative effect size was very close to zero for the eight findings that failed to replicate according to the statistical significance criterion.”
However this doesn’t necessarily support the claim of a dichotomy – the average being close to 0 doesn’t imply that all the results were close to 0, nor that every successful replication passed cleanly. If you ignore the colours, this graph from the paper suggests that the normalised effect sizes are more of a continuum than a clean cut (central section b is relevant chart).
Eyeballing that graph, there is 1 failed replication which nearly succeeded and 4 successful which could have failed. If the effect size shifted by less than 1 S.D. (some of them less than 0.5 S.D.) then the success would have become a failure or vice-versa (although some might have then passed at stage 2). 
Of the 5 replications noted above, the 1 which nearly passed was ranked last by market belief, the 4 which nearly failed were ranked 3, 4, 5 and 7. If any of these had gone the other way it would have ruined the beautiful monotonic result.
According to the planned procedure , the 1 study which nearly passed replication should have been counted as a pass as it successfully replicated in stage 1 and should not have proceeded to stage 2 where the significance disappeared. I think it is right to count this as an overall failed replication but for the sake of analysing the market it should be listed as a success.
Having said that, the pattern is still a very impressive result which I look into below.
The OP notes that there is a good match between the mean market belief of replication and the actual fraction of successful replications. To me this doesn’t really suggest much by way of whether the participants in the market were under-confident or not. If they were to suddenly become more confident then the mean market belief could easily move away from the result.
If the market is under-confident, it seems like one could buy options in all the markets trading above 0.5 and sell options in all the ones below and expect to make a profit. If I did this then I would buy options in 16/21 (76%) of markets and would actually increase the mean market belief away from the actual percentage of successful replications. By this metric becoming more confident would lower accuracy.
In a similar vein, I also don’t think Spearman coefficients can tell us much about over/under-confidence. Spearman coefficients are based on rank order so if every option on the market became less/more confident by the same amount, the Spearman coefficients wouldn’t change.
Notwithstanding the above, the graph in the OP still looks to me as though the market is under-confident. If I were to buy an option in every study with market belief >0.5 and sell in every study <0.5 I would still make a decent profit when the market resolved. However it is not clear whether this is a consistent pattern across similar markets.
Fortunately the paper also includes data on 2 other markets (success in stage 1 of the replication based on 2 different sets of participants) so it is possible to check whether these markets were similarly under-confident. 
If I performed the same action of buying and selling depending on market belief I would make a very small gain in one market and a small loss in the other. This does not suggest that there is a consistent pattern of under-confidence.
It is possible to check for calibration across the markets. I split the 63 market predictions (3 markets x 21 studies) into 4 groups depending on the level of market belief, 50-60%, 60-70%, 70-80% and 80-100% (any market beliefs with p<50% are converted to 1-p for grouping).
For beliefs of 50-60% confidence, the market was correct 29% of the time. Across the 3 markets this varied from 0-50% correct.
For beliefs of 60-70% confidence, the market was correct 93% of the time. Across the 3 markets this varied from 75-100% correct.
For beliefs of 70-80% confidence, the market was correct 78% of the time. Across the 3 markets this varied from 75-83% correct.
For beliefs of 80-100% confidence, the market was correct 89% of the time. Across the 3 markets this varied from 75-100% correct.
We could make a claim that anything which the markets show in the 50-60% range are genuinely uncertain but that for everything above 60% we should just adjust all probabilities to at least 75%, maybe something like 80-85% chance.
If I perform the same buying/selling that I discussed previously but set my limit to 0.6 instead of 0.5 (i.e. don’t buy or sell in the range 40%-60%) then I would make a tidy profit in all 3 markets.
But I’m not sure whether I’m completely persuaded. Essentially there is only one range which differs significantly from the market being well calibrated (p=0.024, two-tailed binomial). If I adjust for multiple hypothesis testing this is no longer significant. There is some Bayesian evidence here but not enough to completely persuade me.
I don’t think the paper in question provides sufficient evidence to conclude that there are unknown knowns in predicting study replication. It is good to know that we are fairly good at predicting which results will replicate but I think the question of how well calibrated we are remains an open topic.
Hopefully the replication markets study will give more insights into this.
 The replication was performed in 2 stages. The first was intended to have a 95% change of finding an effect size of 75% of the original finding. If the study replicated here it was to stop and ticked off as a successful replication. Those that didn’t replicate in stage 1 proceeded to stage 2 where the sample size was increased in order to have a 95% change of finding effect sizes at 50% of the original finding.
 Fig 7 in the supplementary information shows the same graph as in the OP but basing on Treatment 1 market beliefs which relate to stage 1 predictions. This still looks quite impressively monotonic. However the colouring system is misleading for analysing market success as the colouring system related to success after stage 2 of the replication but the market was predicting stage 1. If this is corrected then the graph look a lot less monotonic, flipping the results for Pyc & Rawson (6th), Duncan et al. (8th) and Ackerman et al. (19th).
I have done some credence training but I think my instincts here are more based on Maths and specifically Bayes (see this comment).
I think the zero probability thing is a red herring - replace the 0s with ϵ and the 50s with 50-ϵ and you get basically the same thing. There are some questions where keeping track of the ϵ just isn't worth it.
A proper scoring rule is designed to reward both knowledge and accurate reporting of credences. This is achieved if we score based on the correct answer, whether or not we also score based on the probabilities of the wrong answers.
If we also attempt to optimise for certain ratios between credences of different answers then this is at the expense of rewarding knowledge of the correct answer.
If Alice has credence levels of 50:50:ϵ:ϵ and Bob has 40:20:20:20 and the correct answer is A then Bob will get a higher score than Alice despite her putting more of her probability mass on the correct answer.
Do you consider this a price worth paying to reward having particular ratios between credences?
Maybe 1) is where I have a fundamental difference.
Given evidence A, a Bayesian update considers how well evidence A was predicted.
There is no additional update due to how well ¬A being false was predicted. Even if ¬A is split into sub-categories, it isn't relevant as that evidence has already been taken into account when we updated based on A being true.
r.e. 2) 50:25:0:0 gives a worse expected value than 50:50:0:0 as although my score increases if A is true, it decreases by more if B is true (assuming 50:50:0:0 is my true belief)
r.e. 3) I think it's important to note that I'm assuming that exactly 1 of A or B or C or D is the correct answer. Therefore that the probabilities should add up to 100% to maximise your expected score (otherwise it isn't a proper scoring rule).