I guess you intend to classify the responses afterward to discover underexplored dimensions of evaluations. Anticipating that I will just offer a lot of dimensions and examples thereof:
You can treat this submission as an evaluation of evaluations ;-)
EDIT: spell checking
Many systems get "spot-checked" by artificially forcing them into a rare but important-to-correctly-handle stressed state under controlled conditions where more monitoring and recovery resources are available (or where the stakes are lower) than would be the case during a real instance of the stressed state.
These serve to practice procedures, yes, but they also serve to evaluate whether the procedures would be followed correctly in a crisis, and whether the procedures even work.
A meeting quality score, as described in the patent referenced in this article (https://www.geekwire.com/2020/microsoft-patents-technology-score-meetings-using-body-language-facial-expressions-data/ )
Two that are focused on critique rather than evaluation per se:
Microsoft TrueSkill (Multiplayer ELO-like system, https://www.wikiwand.com/en/TrueSkill)
I originally read this EA as "Evolutionary Algorithms" rather than "Effective Altruism", which made me think of this paper on degenerate solutions to evolutionary algorithms (https://arxiv.org/pdf/1803.03453v1.pdf). One amusing example is shown in a video at https://twitter.com/jeffclune/status/973605950266331138?s=20
One key factor in metrics is how the number relates to the meaning. We'd prefer metrics that have scales which are meaningful to the users, not arbitrary. I really liked one example I saw recently.
In discussing this point in a paper entitled "Arbitrary metrics in psychology," Blanton and Jaccard (doi:10.1037/0003-066X.61.1.27) fist point out that likert scales are not so useful. They then discuss the the (in)famous IAT test, where the scale is a direct measurement of the quantity of interest, but note that: "The metric of milliseconds, however, is arbitrary when it is used to measure the magnitude of an attitudinal preference." Therefore, when thinking about degree of racial bias, "researchers and practitioners should refrain from making such diagnoses until the metric of the IAT can be made less arbitrary and until a compelling empirical case can be made for the diagnostic criteria used." They go on to discuss norming measures, and looking at variance - but the base measure being used in not meaningful, so any transformation is of dubious value.
Going beyond that paper, looking at the broader literature on biases, we can come up with harder to measure but more meaningful measures of bias. Using probability of hiring someone based on racially-coded names might be a more meaningful indicator - but probability is also not a clear indicator, and use of names as a proxy obscures some key details about whether the measurement is class-based versus racial. It's also not as clear how big of an effect a difference in probability makes, despite being directly meaningful.
A very directly meaningful measure of bias that is even easier to interpret is dollars. This is immediately meaningful; if a person pays a different amount for identical service, that is a meaningful indicator of not only the existence, but the magnitude of a bias. Of course, evidence of pay differentials is a very indirect and complex question, but there are better ways of getting the same information in less problematic contexts. Evidence can still be direct, such as how much someone bids for watches, where pictures were taken with the watch on a black or white person's wrist, are a much more direct and useful way to understand how much bias is being displayed.
I have a go-to evaluation system for best ROI items in a brainstormed list amongst team members. First we generate the list, which ends up with, say, three dozen items from 6 of us. Then name a reasonably small but large-enough number like 10. Everyone may put 10 stars by items, max 2 per item, for any reason they like, including "this would be best for my morale". Sort, pick the top three to use. Any surprises? Discuss them. (Modify numbers like 8, 2, and 3 as appropriate.)
This evaluation system is simple to implement in many contexts, easily understood without much explanation at all, fast, and produces perfectly acceptable if not necessarily optimal results. It is pretty decent at grabbing info from folks intuitions without requiring them to introspect enough to make those intuitions explicit.
Current open market price of an asset
Public, highly liquid markets for assets create lots of information about the value of those assets, which is extremely useful for both individuals and firms that are trying to understand:
I see "property assessment" on the list, but it's worth calling out self-assessment specifically (where the owner has to sell their property if offered their self-assessed price).
Then there are those grades organizations give politicians. And media endorsements of politicians. And, for that matter, elections.
Keynesian beauty contests.
And it seems with linking to this prior post (not mine): https://www.lesswrong.com/posts/BthNiWJDagLuf2LN2/evaluating-predictions-in-hindsight
My posts here are basically all evaluations or considerations useful for cost-effectiveness evaluations. They are crossposted from the EA Forum. The most interesting ones for your purpose are probably:
- A general framework for evaluating aging research. Part 1: reasoning with Longevity Escape Velocity
- Why SENS makes sense
- Evaluating Life Extension Advocacy Foundation