Stefan_Schubert's Comments

Robin Hanson on the futurist focus on AI
How about a book that has a whole bunch of other scenarios, one of which is AI risk which takes one chapter out of 20, and 19 other chapters on other scenarios?

It would be interesting if you went into more detail on how long-termists should allocate their resources at some point; what proportion of resources should go into which scenarios, etc. (I know that you've written a bit on such themes.)

Unrelatedly, it would be interesting to see some research on the supposed "crying wolf effect"; maybe with regards to other risks. I'm not sure that effect is as strong as one might think at first glance.

Robin Hanson on the futurist focus on AI

Associate professor, not assistant professor.

Is there a definitive intro to punishing non-punishers?
One of those concepts is the idea that we evolved to "punish the non-punishers", in order to ensure the costs of social punishment are shared by everyone.

Before thinking of how to present this idea, I would study carefully whether it's true. I understand there is some disagreement regarding the origins of third-party punishment. There is a big literature on this. I won't discuss it in detail, but here are some examples of perspectives which deviate from that taken in the quoted passage.

Joe Henrich writes:

This only makes sense as cultural evolution. Not much third party punishment in many small-scale societies .

So in Henrich's view, we didn't even (biologically) evolve to punish wrong-doers (as third parties), let alone non-punishers. Third-party punishment is a result of cultural, not biological, evolution, in his view.

Another paper of potential relevance by Tooby and Cosmides and others:

A common explanation is that third-party punishment exists to maintain a cooperative society. We tested a different explanation: Third-party punishment results from a deterrence psychology for defending personal interests. Because humans evolved in small-scale, face-to-face social worlds, the mind infers that mistreatment of a third party predicts later mistreatment of oneself.

Another paper by Pedersen, Kurzban and McCullough argues that the case for altruistic punishment is overstated.

Here, we searched for evidence of altruistic punishment in an experiment that precluded these artefacts. In so doing, we found that victims of unfairness punished transgressors, whereas witnesses of unfairness did not. Furthermore, witnesses’ emotional reactions to unfairness were characterized by envy of the unfair individual's selfish gains rather than by moralistic anger towards the unfair behaviour. In a second experiment run independently in two separate samples, we found that previous evidence for altruistic punishment plausibly resulted from affective forecasting error—that is, limitations on humans’ abilities to accurately simulate how they would feel in hypothetical situations. Together, these findings suggest that the case for altruistic punishment in humans—a view that has gained increasing attention in the biological and social sciences—has been overstated.
How do you assess the quality / reliability of a scientific study?

A recent paper developed a statistical model for predicting whether papers would replicate.

We have derived an automated, data-driven method for predicting replicability of experiments. The method uses machine learning to discover which features of studies predict the strength of actual replications. Even with our fairly small data set, the model can forecast replication results with substantial accuracy — around 70%. Predictive accuracy is sensitive to the variables that are used, in interesting ways. The statistical features (p-value and effect size) of the original experiment are the most predictive. However, the accuracy of the model is also increased by variables such as the nature of the finding (an interaction, compared to a main effect), number of authors, paper length and the lack of performance incentives. All those variables are associated with a reduction in the predicted chance of replicability.
The first result is that one variable that is predictive of poor replicability is whether central tests describe interactions between variables or (single-variable) main effects. Only eight of 41 interaction effect studies replicated, while 48 of the 90 other studies did.

Another, unrelated, thing is that authors often make inflated interpretations of their studies (in the abstract, the general discussion section, etc). Whereas there is a lot of criticism of p-hacking and other related practices pertaining to the studies themselves, there is less scrutiny of how authors interpret their results (in part that's understandable, since what counts as a dodgy interpretation is more subjective). Hence when you read the methods and results sections it's good to think about whether you'd make the same high-level interpretation of the results as the authors.

Two explanations for variation in human abilities

One aspect may be that the issues we discuss and try to solve are often at the limit of human capabilities. Some people are way better at solving them than others, and since those issues are so often in the spotlight, it looks like the less able are totally incompetent. But actually, they're not; it's just that the issues they are able to solve aren't discussed.


What Comes After Epistemic Spot Checks?
On first blush this looks like a success story, but it’s not. I was only able to catch the mistake because I had a bunch of background knowledge about the state of the world. If I didn’t already know mid-millenium China was better than Europe at almost everything (and I remember a time when I didn’t), I could easily have drawn the wrong conclusion about that claim. And following a procedure that would catch issues like this every time would take much more time than ESCs currently get.

Re this particular point, I guess one thing you might be able to do is to check arguments, as opposed to statements of facts. Sometimes, one can evaluate whether arguments are valid even when one isn't too knowledgeable about the particular topic. I previously did some work on argument-checking of political debates. (Though the rationale for that wasn't that argument-checking can require less knowledge than fact-checking, but rather that fact-checking of political debates already exists, whereas argument-checking does not).

I never did any systematic epistemic spot checks, but if a book contains a lots of arguments that appear fallacious or sketchy, I usually stop reading it. I guess that's related.

Replace judges with Keynesian beauty contests?

Thanks for this. In principle, you could use KBCs for any kind of evaluation, including evaluation of products, texts (essay grading, application letters, life plans, etc), pictures (which of my pictures is the best?), etc. The judicial system is very high-stakes and probably highly resistant to reform, whereas some of the contexts I list are much lower stakes. It might be better to try out KBCs in such a low-stakes context (I'm not sure which one would be best). I don't know what extent KBCs have tested for these kinds of purposes (it was some time since I looked into these issues, and I've forgotten a bit). That would be good to look into.

One possible issue that one would have to overcome is explicit collusion among subsets of raters. Another is, as you say, that people might converge on some salient characteristics that are easily observable but don't track what you're interested in (this could at least in some cases be seen as a form of "tacit collusion").

My impression is that collusion is a serious problems for ratings or recommender systems (which KBCs can be seen as a type of) in general. As a rule of thumb, people might be more inclined to engage in collusion when the stakes are higher.

To prevent that, one option would be to have a small number of known trustworthy experts, who also make evaluations which function as a sort of spot checks. Disagreement with those experts could be heavily penalised, especially if there are signs that the disagreement is due to (either tacit or explicit) collusion. But in the end only any anti-collusion measure needs to be tested empirically.

Relatedly, once people have a history of ratings, you may want to give disproportionate weights to those with a strong track record. Such epistocratic systems can be more efficient than democratic systems. See Thirteen Theorems in Search of the Truth.

KBCs can also be seen as a kind of prediction contests, where you're trying to predict other people's judgements. Hence there might be synergies with other forms of work on predictions.

Occam's Razor: In need of sharpening?

There is a substantial philosophical literature on Occam's Razor and related issues:

Hedge drift and advanced motte-and-bailey

Yes, a new paper confirms this.

The association between quality measures of medical university press releases and their corresponding news stories—Important information missing
Say Wrong Things

Agreed; those are important considerations. In general, I think a risk for rationalists is to change one's behaviour on complex and important matters based on individual arguments which, while they appear plausible, don't give the full picture. Cf Chesterton's fence, naive rationalism, etc.

Load More