Within-group differences are larger than between-group differences in most of these domains, so I'd rather make it easier for both groups to deviate from their group tendencies than to try to identify more group tendencies that it will be hard to deviate from.

3 Levels of Rationality Verification

I don't see what I thought were the obvious answers, so here they are. The foundations are elsewhere on the site, but they seemed missing from this list.

Reputational: Expect Bayesian masters to participate in other scientific fields. People who make more discoveries in other fields get more street cred among rationalists, especially when they can explain how rationalism helped them make the discoveries. Obviously, this is a long-term process that doesn't lend itself to improving the art quickly.

Experimental: This one's a two-step process. First, ask a large collection of university professors to insert one lie into each of their lectures a'la http://www.overcomingbias.com/2008/02/my-favorite-lia.html (mentioned in another comment). Have them note which students discover each lie, but don't have that count for any sort of grade (to prevent gaming). Second, sort students randomly into the experimental rationality classes, and/or have the classes "fill up" (with a lottery for seats) to provide a control. Look for whether there's a difference in lie-detection rates between the differently-taught groups.

Experimental #2, much longer term: Track the career outcomes of the students who took each different rationality class. See whether there's a difference in winning between the groups.

Treat not submitting a mistake report as the "I have no idea" claim: that you've assigned a probability of "mistakes/total emails" to this particular email being a mistake.