TLDR

Submit ideas of “interesting evaluations” in the comments. The best one by December 5th will get $50. All of them will be highly appreciated.

Motivation

A few of us (myself, Nuño Sempere, and Ozzie Gooen), have been working recently to better understand how to implement meaningful evaluation systems for EA/rationalist research and projects. This is important both for short-term use (so we can better understand how valuable EA/rationalist research is) and for long-term use (as in, setting up scalable forecasting systems on qualitative parameters). In order to understand this problem, we've been investigating evaluations specific to research and evaluations in a much broader sense.

We expect work in this area to be useful for a wide variety of purposes. For instance, even if Certificates of Impact eventually get used as the primary mode of project evaluation, purchasers of certificates will need strategies to actually do the estimation. 

Existing writing on “evaluations” seems to be fairly domain-specific (only focused on Education or Nonprofits), one-sided (yay evaluations or boo evaluations), or both. This often isn’t particularly useful when trying to understand the potential gains and dangers of setting up new evaluation systems.  

I’m now investigating a neutral history of evaluations, with the goal of identifying trends in what aids or hinders an evaluation system in achieving its goals. The ideal output of this stage would be an absolutely comprehensive list that will be posted to LessWrong. While this is probably impractical, hopefully, we could make one comprehensive enough, especially with your help.

Task

Suggest an interesting example (or examples) of an evaluation system. For these purposes, evaluation means "a systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards", but if you think of something that doesn't seem to fit, err on the side of inclusion

Prize

The prize is $50 for the top submission.

Rules

To enter, submit a comment suggesting an interesting example below, before the 5th of December. This post is both on LessWrong and the EA Forum, so comments on either count.

Rubric

To hold true to the spirit of the project, we have a rubric evaluation system to score this competition. Entries will be evaluated using the following criteria:

  • Usefulness/uniqueness of lesson from the example
  • Novelty or surprise of the entry itself, for Elizabeth
  • Novelty of the lessons learned from the entry, for Elizabeth.

Accepted Submission Types

I care about finding interesting things more than proper structure. Here are some types of entries that would be appreciated:

  • A single example in one of the categories already mentioned
  • Four paragraphs on an unusual exam and its interesting impacts
  • A babbled list of 104 things that vaguely sound like evaluations

Examples of Interesting Evaluations

We have a full list here, but below is a subset to not anchor you too much. Don't worry about submitting duplicates: I’d rather risk a duplicate than miss an example. 
 

  1. Chinese Imperial Examination
  2. Westminster Dog Show
  3. Turing Test
  4. Consumer Reports Product Evaluations
  5. Restaurant Health Grades
  6. Art or Jewelry Appraisal
  7. ESGs/Socially Responsible Investing Company Scores
  8. “Is this porn?”
    1. Legally?
    2. For purposes of posting on Facebook?
  9. Charity Cost-Effectiveness Evaluations
  10. Judged Sports (e.g. Gymnastics)

Motivating Research

These are some of our previous related posts:

New to LessWrong?

New Answer
New Comment

17 Answers sorted by

I guess you intend to classify the responses afterward to discover underexplored dimensions of evaluations. Anticipating that I will just offer a lot of dimensions and examples thereof:

  • Evaluation of an attribute the subject can or can not influence (weight vs. height)
  • Kind of the evaluated attribute(s) - Physical (weight), technical (gear ratio), cognitive (IQ, mental imagery), mental (stress), medical (ICD classification), social (number of friends), mathematical (prime), ...
  • Abstractness of the evaluated attribute(s)
    • low: e.g. directly observable physical attributes like height; 
    • high: requiring expert interpretation and judgment e.g. beauty of a proof
  • Evaluation with the knowledge of the subject or without - test in school vs. secret observation
  • Degree of Goodhardting possible or actually occurring on the evaluation
  • Entangledness of the evaluation with the subject and evaluator
    • No relation between both - two random strangers, one assesses the other and moves on
    • Evaluator acts in a complex system, subject does not - RCT study of a new drug in mice
    • Both act in a shared complex system - employee evaluation by superior
  • Evaluation that is voluntary or not - medical check vs. appraisal on the job
  • Evaluation that is legal or not - secret observation of work performance is often illegal
  • Evaluation by the subject itself, another entity, or both together
  • Purpose of the evaluation - for decision making (which candidate to choose), information gathering (observing competitors or own strengths), or quality assurance (good meet expectations)
  • Evaluation for the purpose of the subject, the evaluator, another party, or a combination - exams in school often serve all of these
  • Objectiveness of the evaluation or the underlying criteria
  • Degree of standardization or "acceptedness" of the criteria - SAT vs. ad-hoc questionnaire
  • Single (entry exams), repeated (test in school), or continuous evaluation (many technical monitoring systems)
  • Size of evaluated population - single, few, statistically relevant sample size, or all
  • Length of the evaluation
  • Effort needed for the evaluation

You can treat this submission as an evaluation of evaluations ;-)

EDIT: spell checking

Stress tests

Many systems get "spot-checked" by artificially forcing them into a rare but important-to-correctly-handle stressed state under controlled conditions where more monitoring and recovery resources are available (or where the stakes are lower) than would be the case during a real instance of the stressed state.

These serve to practice procedures, yes, but they also serve to evaluate whether the procedures would be followed correctly in a crisis, and whether the procedures even work.

  • Drills
    • Fire/tornado/earthquake/nuclear-attack drills
    • Military drills (the kind where you tell everyone to get to battle stations, not the useless marching around in formation kind)
  • Large cloud computing companies I've worked at need to stay online in the face of loss of a single computer, or a single datacenter. They periodically check to see that these failures are survivable by directly powering off computers, disconnecting entire datacenters from the network, or simply running through a datacenter failover procedure beginning to end to check that it works.
  • https://en.wikipedia.org/wiki/Stress_test_(financial)

Two that are focused on critique rather than evaluation per se:

Microsoft TrueSkill (Multiplayer ELO-like system, https://www.wikiwand.com/en/TrueSkill)

I originally read this EA as "Evolutionary Algorithms" rather than "Effective Altruism", which made me think of this paper on degenerate solutions to evolutionary algorithms (https://arxiv.org/pdf/1803.03453v1.pdf). One amusing example is shown in a video at https://twitter.com/jeffclune/status/973605950266331138?s=20

 

Some additional ideas: There's a large variety of "loss functions" that are used in machine learning to score the quality of solutions. There are a lot of these, but some of the most popular are below. A good overview is at https://medium.com/udacity-pytorch-challengers/a-brief-overview-of-loss-functions-in-pytorch-c0ddb78068f7
* Mean Absolute Error (a.k.a. L1 loss)
* Mean squared error
* Negative log-likelihood
* Hinge loss
* KL divergence 
* BLEU loss for machine translation (https://www.wikiwand.com/en/BLEU)

There's also a large set of "goodness of fit" m... (read more)

One key factor in metrics is how the number relates to the meaning. We'd prefer metrics that have scales which are meaningful to the users, not arbitrary. I really liked one example I saw recently.

In discussing this point in a paper entitled "Arbitrary metrics in psychology," Blanton and Jaccard (doi:10.1037/0003-066X.61.1.27) fist point out that likert scales are not so useful. They then discuss the the (in)famous IAT test, where the scale is a direct measurement of the quantity of interest, but note that: "The metric of milliseconds, however, is arbitrary when it is used to measure the magnitude of an attitudinal preference." Therefore, when thinking about degree of racial bias, "researchers and practitioners should refrain from making such diagnoses until the metric of the IAT can be made less arbitrary and until a compelling empirical case can be made for the diagnostic criteria used." They go on to discuss norming measures, and looking at variance - but the base measure being used in not meaningful, so any transformation is of dubious value.

Going beyond that paper, looking at the broader literature on biases, we can come up with harder to measure but more meaningful measures of bias. Using probability of hiring someone based on racially-coded names might be a more meaningful indicator - but probability is also not a clear indicator, and use of names as a proxy obscures some key details about whether the measurement is class-based versus racial. It's also not as clear how big of an effect a difference in probability makes, despite being directly meaningful.

A very directly meaningful measure of bias that is even easier to interpret is dollars. This is immediately meaningful; if a person pays a different amount for identical service, that is a meaningful indicator of not only the existence, but the magnitude of a bias. Of course, evidence of pay differentials is a very indirect and complex question, but there are better ways of getting the same information in less problematic contexts. Evidence can still be direct, such as how much someone bids for watches, where pictures were taken with the watch on a black or white person's wrist, are a much more direct and useful way to understand how much bias is being displayed.

See also: https://twitter.com/JessieSunPsych/status/1333086463232258049

2Elizabeth3y
Oh man, I wish you'd come in under the deadline. For people who don't feel like clicking: it's a quantification of behavior predicted by different scores on Big-5.

"Postmortem culture" from the Google SRE book: https://sre.google/sre-book/postmortem-culture/

This book has some other sections that are also about evaluation, but this chapter is possibly my favorite chapter from any corporate handbook.

I have a go-to evaluation system for best ROI items in a brainstormed list amongst team members. First we generate the list, which ends up with, say, three dozen items from 6 of us. Then name a reasonably small but large-enough number like 10. Everyone may put 10 stars by items, max 2 per item, for any reason they like, including "this would be best for my morale". Sort, pick the top three to use. Any surprises? Discuss them. (Modify numbers like 8, 2, and 3 as appropriate.)

This evaluation system is simple to implement in many contexts, easily understood without much explanation at all, fast, and produces perfectly acceptable if not necessarily optimal results. It is pretty decent at grabbing info from folks intuitions without requiring them to introspect enough to make those intuitions explicit.

Current open market price of an asset

Public, highly liquid markets for assets create lots of information about the value of those assets, which is extremely useful for both individuals and firms that are trying to understand:

  • the state of their finances
  • how successful a venture dealing in those assets has been
  • whether to accept a deal (a financial transaction, or some cooperative venture) involving those assets
  • (if the assets are stock in some company) how successful the company has been so far

Raven's Progressive Matrices

Welsh Figure Preference Test

Ruleset evolution in speedrunning as an example of a self-policing community.

In the news today: CASP (Critical Assessment of protein Structure Prediction)

I see "property assessment" on the list, but it's worth calling out self-assessment specifically (where the owner has to sell their property if offered their self-assessed price).

Then there are those grades organizations give politicians. And media endorsements of politicians. And, for that matter, elections.

Keynesian beauty contests.

And it seems with linking to this prior post (not mine): https://www.lesswrong.com/posts/BthNiWJDagLuf2LN2/evaluating-predictions-in-hindsight

My posts here are basically all evaluations or considerations useful for cost-effectiveness evaluations. They are crossposted from the EA Forum. The most interesting ones for your purpose are probably: 

- A general framework for evaluating aging research. Part 1: reasoning with Longevity Escape Velocity
- Why SENS makes sense
- Evaluating Life Extension Advocacy Foundation

5 comments, sorted by Click to highlight new comments since: Today at 8:46 AM

Winner

Last week we announced a prize for the best example of an evaluation. The winner of the evaluations prize is David Manheim, for his detailed suggestions on quantitative measures in psychology.  I selected this answer because, although IAT was already on my list, David provided novel information about multiple tests that saved me a lot of work in evaluating them. David has had involvement with QURI (which funded this work) in the past and may again in the future, so this feels a little awkward, but ultimately it was the best suggestion so it didn’t feel right to take the prize away from him.

EDIT: David has elected to have the prize donated to GiveWell.

Honorable mentions to Orborde on financial stress tests, which was a very relevant suggestion that I was unfortunately already familiar with, and alexrjl on rock climbing route grades, which I would never have thought of in a million years but has less transferability to the kinds of things we want to evaluate.

Post-Mortem

How useful was this prize? I think running the contest was more useful than $50 of my time, however it was not as useful as it could have been because the target moved after we announced the contest. I went from writing about evaluations as a whole to specifically evaluations that worked, and I’m sure if I’d asked for examples of that they would have been provided. So possibly I should have waited to refine my question before asking for examples. On the other hand, the project was refined in part by looking at a wide array of examples (generated here and elsewhere), and it might have taken longer to hone in on a specific facet without the contest.

It would be helpful if you would explain what you mean by an "evaluation system". You seem to regard it as obvious. You provide no definition. You give a few examples. But do you really want people to have to spend time to reverse engineer what you are talking about.

When people put terms in quotes when not quoting someone, as you do, it usually signifies the use of those words in some non-standard manner. The fact that you put terms in quotes thus suggests to me that you are using the words in some unspecified non-standard manner.

I can guess what you might mean and I might even feel confident I am right. And I might later find out that you intended some other meaning and I should have known what you mean.

The first rule of any essay is to be clear what you are talking about. It is not a sin to state the "obvious". Karl Popper was notorious for this - being unclear and then deriding people for misconstruing him. See his later papers and books.

Thank you for raising the issue. Happy to clarify further. 

By evaluation we refer essentially[1] to the definition on Wikipedia page here

Evaluation is a systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards. It can assist an organization, program, design, project or any other intervention or initiative to assess any aim, realisable concept/proposal, or any alternative, to help in decision-making; or to ascertain the degree of achievement or value in regard to the aim and objectives and results of any such action that has been completed

By "interesting" we mean what will do well on the listed rubric. We're looking for examples that would be informative for setting up new research evaluation setups. This doesn't mean the examples have to deal with research, but rather that they bring something new to the table that could be translated. For example, maybe there's a good story of a standardized evaluation that made a community or government significantly more or less effective.

[1]  I say "essentially" because I can imagine that maybe someone will point out some unintended artifact in the definition that goes against our intuitions, but I think that this is rather unlikely to be a problem.

I agree with the sentiment. Although I think this should be a comment, not an answer.

Have moved to comments. Thank you both for the feedback.