Great collection of results. I particularly found the interactive graph useful.
I'm slightly confused by the trend lines (especially for Games and Other) - they don't seem intuitively the best fits. It looks like they place a lot of importance on the high parameter recent models (possibly the cost for each datapoint is in parameter space rather than log(parameter) space?
My hypothesis - Apologising is low status.
Possibly this is no longer the case (apologising first is often seen as a sign of maturity) but I can certainly see this being the case in the ancestral environment.
This would match my experience that apologising is feels awful even when it is entirely my fault.
I guess if the other person has already apologised then me also apologising just puts our status back roughly where we started which is why going second feels much easier.
A previous calculation on LW gave 2.4 x 10^24 for AlphaStar (using values from the original alphastar blog post) which suggested that the trend was roughly on track.
The differences between the 2 calculations are (your values first):
Agents: 12 vs 600
Days: 44 vs 14
TPUs: 32 vs 16
Utilisation: 33% vs 50% (I think this is just estimated in the other calculation)
Do you have a reference for the values you use?
Or, alternatively, did Oxford really find a pharmaceutical company so incompetent that they did this by mistake, on top of giving an entire trial segment the wrong dose of vaccine the first time around? These are some rather epic screwups.
My experience working for a large company makes me not particularly surprised by this and I would give a decent amount of probability to this being an accident. I don't know enough about the specific procedures to be hugely confident but it does seem most likely to me.
If we're fairly confident that the wrong dose thing was an accident - I can't think of any reason to do this deliberately and then try to cover it up - then AstraZeneca obviously have the potential to make big mistakes.
One scenario would be that the person requesting / approving the press release is not the same as the person running the project but rather their boss or their bosses boss or even in another department. The press release approver is less involved in the minutiae and has remembered the 79% figure, maybe even goes so far as to check their e-mails that this is the correct figure (or check with someone else who checks their e-mails). Probably none of these people were in the meeting with the safety board.
I have had this experience myself on many occasions where my superiors have given information to customers that is outdated just from them not being as up to date or forgetting the latest results. Obviously I'd like to think something like this would have more care taken about it but the dosing debacle is suggestive that checking things isn't AstraZeneca's strong suit.
That combined with the 0% chance of this not being noticed suggests to me that this wasn't on purpose.
There is a system for describing human facial expressions - Facial Action Coding System.
This has also been expanded for some animals (chimps, macaques, gibbons, orangutans, dogs, cats, horses). Alas, no dolphins.
I wondered whether a decent amount of the cost increase was in changing from a hatchback to a sedan but I see that this is only $1,000 to go from the Mirage hatchback to sedan. And the Mirage sedan is the same size as a 90's Ford Escort sedan/station wagon so size doesn't explain it either.
Yeah, I didn't actually answer q18 either (possibly knite maybe used my list as a basis?) for exactly that reason. Scott just put me in as the same as him for that question for the purposes of making an apples-to-apples comparison which seemed fine - no idea what I would have put if I had answered!
I'm kicking myself on #16 - I don't know enough about epidemiology to make such a strong guess.
Yeah, I did a similar thing on #38 where I was similarly overconfident on an economy question which I don't know nearly enough about.
On #16 itself I was lower than I should have been because I was using "virus" as a reference class rather than "respiratory virus" which was an obvious mistake looking back at it.
It looks like you're using the correct formula but maybe with a mistake of what the "p" in the formula means so that your scores on questions where the result was "false" are incorrect.
I think you maybe used ln(probability put on "true")-ln(.5) and then multiplied the result by -1 if the actual answer was false?
The formulation Scott used was ln(probability put on the correct answer)-ln(.5)
So for q3 for example the calculation shouldn't be
but should be
One for older / more interested kids - the Monty Hall problem.
I remember my uncle spending a long time going through this with me and having to actually run the scenario a few times for me to believe he was right!