Prediction-Augmented Evaluation Systems

ozziegooen

[Note: I made a short video of myself explaining this document here.]

It's common for groups of people to want to evaluate specific things. Here are a few examples I'm interested in:

The expected value of projects or actions within projects
Research papers, on specific rubrics
Quantitative risk estimates
Important actions that may get carried out by artificial intelligences

I think predictions could be useful in scaling and amplifying such evaluation processes. Humans and later AIs could predict intensive evaluation results. There has been previous discussion on related topics, but I thought it would be valuable to consider a specific model here called "prediction-augmented evaluation processes." This is a high-level concept that could be used to help frame future discussion.

Desiderata:

We can call a systematized process that produces evaluations an "evaluation process." Let's begin with a few generic desiderata of these.

High Accuracy / "Evaluating the right thing"
- Evaluations should aim at estimating the thing actually cared about as well as possible. In their limit according to some metric of effort, they should approximate ideal knowledge on the thing cared about.
High Precision / "Evaluating the chosen thing correctly"
- Evaluations should have low amounts of uncertainty and be very consistent. If the precision is generally less than what naive readers would guess, then these evaluations wouldn't be very useful.
Low Total Cost
- Specific evaluations can be costly, but the total cost across evaluations should be low.

I think that the use of predictions could allow us to well fulfill these criterions. It could help decouple evaluations from their scaling, allowing for independent optimization of the first two. The cost should be low relative to that of scaling evaluators in other obvious ways.

Prediction-Augmentation Example

Before getting formal with terminology, I think a specific example would be helpful.

Say Samantha scores research papers for quality on a scale from 1-10. She's great at it, she has a very thorough and lengthy reviewing procedure, and many others trust her reviews. Unfortunately, there's only one Samantha, and there are tons of research papers.

One way to scale Samantha's abilities would be to use a prediction aggregation system. A collection of other people would predict Samantha's scores before she rates them. Predictions would be submitted as probability distributions over possible scores. Each research paper would have a probability of being scored by Samantha, say 10%. In a naive model, this would be done in batches; the predictors could have 1 month to score 100 papers, and then at the end of the month 10 would randomly be chosen and rated by Samantha.

If this batch process would happen multiple times, then eventually outside observers could understand how accurate the predictors are and how to aggregate future forecasts to better predict Samantha's judgments.

An obvious improvement could be that some of the predictors may develop a sense of what arguments Samantha most likes and what data she cares for. They may write up summaries of their arguments to convince Samantha of their particular stances. If managed well, this could speed up Samantha's work and perhaps improve it. She may eventually find many of the people who best understand her system and develop an amount of trust in them. Of course, this could selectively bias her away from making accurate judgments, so this kind of feedback would have to be handled with care.

Once there are enough predictions, it may be possible to train ML agents to do prediction as well. The humans would essentially act as a "bootstrapping" system.

Subcomponents

I've outlined how I would describe the internals of a prediction-augmented evaluation process in an engineering system or similar. The wording here is a bit technical, on purpose, so feel free to skip this section.

augmentation system diagram

This diagram attempts to show a few different things. The entirety of a judging evaluation subprocess and prediction system make up the outer prediction-augmented evaluation process. The judging evaluation subprocess has a percent chance of evaluating each of a set of measurables. Predictors can make predictions on each one of these measurables, where they are trying to predict what the judging evaluation subprocess will judge for that measurable if it's chosen to judge it.

Judging Evaluation Subprocess

I imagine that prediction-augmentation could assist any evaluation process, even theoretically one that is already itself prediction-augmented. Prediction-augmentation acts as a layer that converts one narrow but good evaluation process into a more voluminous process.

In the context of a "prediction-augmented" evaluation process, the "wrapped" evaluation process can be considered the "judging" evaluation subprocess. This internal process would generate "judgments", and separately predictors will make predictions of future judgments. Both judgments and predictions would act as evaluations, so to speak.

There are already many evaluation systems used in the world, and I imagine that almost any could act as judging processes. The main bottlenecks would be judging quantity and reliability; this would be most useful for areas where evaluations are done for many similar things.

Because the judging process is well isolated, and scale is not a huge worry (that's pushed to the prediction layer), it can be thoroughly tested and optimized. Because the scaling mechanism is decently decoupled from the evaluation process, it could be much more rigorous than would otherwise be reasonable. For instance, a paper reviewer may typically spend 4 hours per paper, but with a prediction-augmented layer, perhaps they could spend 40 with the papers selected for judgment.

I use the phrase "evaluation process" rather than "evaluation" to point out the fact that this should be something outside the purview of a single individual. I imagine that the failure rate of individuals to evaluate things after a few years could be considerable, so it would be strongly preferable to have backup plans in the case that that happens. I would assume that organizations would generally be a better alternative, even if they were just mostly backing up individuals. Perhaps organizations could set up official trusts or other legal and financial structures to ensure that judgments get carried out.

There would have to be discussion about what the best evaluation processes would look like if many resources were put into predictions, but I think that's a really good discussion to encourage anyway.

One tricky part would be to further identify evaluation processes that multiple agents would find most informative. For instance, finding some individual that's trusted by several organizations with significant differences of opinion.

Measurables
Measurables refer to the things that get evaluated. It's a bit of a generic word for the use case, but I suspect useful in larger ontologies. Some examples could be "the rating of scientific paper X" or "the expected value of project Y." It's important to keep in mind that measurables only make sense in regards to specific evaluation systems; predictors would rarely predict the actual value of something, but rather, the result of a specific evaluation subprocess. For instance, "GDP of the United States, according to XYZ's process."

Predictions
The system obviously requires predictions, and for this to happen at a decent scale, almost definitely some kind of web application. In theory, a formal prediction market would work, but I imagine it would be very difficult to scale to the levels I would hope for in a large evaluation system. I'm personally more excited about more general prediction aggregation tools like The Good Judgment Project and Metaculus. Metaculus, in particular, allows participants to make guesses on continuous variables, which seems like a reasonable mechanism for evaluation systems. I'm also experimenting with a small project of my own to collect forecasts for experimental purposes.

Incentives for predictors could be a bit tricky to work out, but it definitely seems possible. It seems simple enough to pay people using a function that includes their prediction accuracy and quantity. Sign-ups could be screened to prevent lots of bots from joining. Of course, another option would be for the benefits from predictors to be something that itself gets evaluated using a separate prediction-augmented process.

Scaling & Amplification

I think the main two benefits Prediction-Augmentation could provide are that of "scaling" and "amplification." "Scaling" refers to the ability of such a system to effectively "scale" an evaluation judgment subprocess. The predictors would evaluate many more measurables than the judgment subprocess, and would do so sooner. "Amplification" refers to the ability of the system to improve the best abilities of the judging subprocess. This could come from speeding it up and/or by having judges read content produced by the prediction layer.

I expect "scaling" to be much more impactful than "amplification," especially for the early use of such systems.

Scaling & Amplification are very similar in ways to "Iterated Distillation and Amplification." However, these types of scaling & amplification are obviously not always automated, which is a big difference. That said, hypothetically people could eventually write prediction bots, and similar ones for amplification (with nice user interfaces, I assume.) I think prediction-augmentation may have relevance for direct use in technical AI alignment systems but I am currently more focused on human variants.

Existing/Possible Variants

Selective Evaluations
The judgment subprocess could select specific predicted variables for evaluations after reviewing the predictions, rather than choosing probabilistically. Judges would essentially "challenge" the measurables with the most questionable predictions. Selective evaluations may be more efficient than random evaluations, though it also could mean that predictors may be incentivized to predict items they expect the evaluators would select, leading to some potentially messy issues.

Selective evaluation is essentially very similar to some things many editors and managers do. A news editor may skim a long work by a writer (who is acting in part as a predictor of what the editor will accept), and at times challenge specific parts of text, to either improve directly or send back for improvement.

EV-Adjusted Probabilities
If evaluations are done probabilistically, the probabilities could change depending on the expected value of improved predictions on specific measurables. This could incentivize the predictors to allocate more effort accordingly. This could look a lot like selective evaluations in practice.

Traditional Prediction Systems
I would consider existing prediction aggregators/markets to fall under the umbrella of "Prediction-Augmented Evaluation Processes." These traditionally have had judging subprocesses that are very straightforward and simple; for instance, "Find the GDP of America in 2020 from Wikipedia." They effectively scale simple judgments purely by estimating them early, rather than also by attempting to recreate a complicated analysis.

Possible Uses

Project Evaluations
Projects could be evaluated for their expected marginal impact. This could provide information very similar to certificates of impact. I think that prediction-augmented evaluation systems could be more efficient than certificates of impact, but would first like to see both be tested more experimentally. This post by Ought poses a similar system for doing evaluations on parts of projects. This post by Robin Hanson discusses similar techniques for evaluating the impact of scientific papers.

General Research Questions
If researchers could express specific uncertain claims early on, then outsiders could predict these researcher's eventual findings. For example, a scientist could make a list of 100 binary questions they are not sure about, and promise to evaluate a random subset in 10 years.

AI Decision Validation
One possibility here could be to have a human act as a judge (hopefully augmented in some way), and an intelligent AI be the predictor. The AI would recommend actions/decisions to the human, and the human/augmentation system would selectively or statistically challenge these. I believe this is similar to ideas of selective challenging in AI Safety via Debate.

Human Value Judgement
If we could narrow value judgments into a robust evaluation process, we could scale this to AI systems. This could be used for making decisions around self-driving vehicles and similar. I imagine that much of the challenge here would be for people to agree on evaluation processes for moral questions, but if this could be approximated, the rest could be carried out somewhat straightforwardly. See this post by Paul Christiano for more information.

Website Moderation
Many forums and applications are pretty dependent on specific moderators for moderation. This kind of work could hypothetically help scale them in a controllable way. Future moderators would be obligated to predict the trusted moderators, rather than doing things in other ways. I'm not too sure about this, but know that others in the community have been enthusiastic. See this post by Paul Christiano for more information.

Alternative Dispute Resolution
Existing court systems and alternative dispute resolution systems already are similar to this process in theory. It would be interesting to imagine hypothetical court systems where lower courts would try to predict exactly what higher courts would rule, and on occasion, the higher courts would repeat the same cases. The appellate system may be more efficient, but there may be interesting hybrids. For one, this system could be useful for bootstrapping completely automated rulings.

Unimagined Uses
I imagine many of the most interesting uses of such a system haven't thought about. Prediction-augmented evaluation processes would have some positives and negatives current systems don't have, so may make sense in different cases. If they do very well, I would assume they may do so in ways that would surprise us.

Related Work

Much of what has been discussed here is very generic and thus many parts have been previously considered. Paul Christiano, and the team of Ought, in particular, have written about very similar ideas before; the main difference is that they seem to have focussed more on AI learning and specific decisions. Ought's Predicting Slow Judgements" work investigates how well humans make predictions on different scales of time for evaluations, and then how that could be mimicked by AIs. I've done some work with them before and recommend them to others interested in these topics. Andreas Stuhlmüller's (founder of Ought) previous work with dialog markets is also worth reading.

There seems to be a good amount of research on evaluation procedures and separately on prediction capabilities. For the sake of expediency, I did not treat this as much of a literature review, though would be interested in whether others have recommended literature on these topics.

Promoted to curated: I think this post is the best summary of a bunch of important ideas that have been floating around, related to a bunch of things that Ought is doing, and also to a bunch of Paul's AI Alignment agenda.

I generally like the structure of the post, and also appreciate the addition of a video. I do think that the post is still somewhat hard to read and is I think unnecessarily dry, which made me a bit hesitant to curate it, and is probably also the reason why it's at a relatively low karma score.

I am very excited about people implementing more projects in this space, and think this serves as a quite good overview over potential applications and considerations. Thanks a lot for writing it!

Thanks for the feedback! I was unsure about the structure; my main goals here was to set up a categorization system and have information explained, even if it wasn't particularly understandable. I'll mess around with other techniques in future posts.

This reminds me of boosted decision trees. In fact, boosting translates very well from aggregating decision trees to aggregating human judgment.

Thanks for the video! I had already skimmed this post when I noticed it, and then I watched it and reread the post. Perhaps my favourite thing about it was that it was slightly non-linear (skipping ahead to the diagram, non-linearity when covering sections).

Could you say a bit more about your worries with (scaling) prediction markets?

Do you have any thoughts about which experiments have the best expected information value per $?

I'm not too optimistic about traditional prediction markets, I have feelings similar to Zvi. I haven't seen prediction markets be well subsidized for even a few dozen useful variables; in prediction augmented evaluation systems they would have to be done for thousands+ variables. They seem like more overhead per variable then simply stating one's probability and moving on.

My next step is just messing around a lot with my own prediction application and seeing what seems to work. I plan to gradually invite people, but let them mostly do their own testing. At this point, I want to get an intuitive idea of what seems useful, similar to my experiences making other experimental applications. I'm really not sure what ideas I may come up with, with more experimentation.

That said, I am particularly excited about estimating expected values of things, but realize I may not be able to make all of these public, or may have to keep things very apolitical. I expect it to be really easy to anger people if estimates that are actually important are public.

https://www.lesswrong.com/posts/a4jRN9nbD79PAhWTB/prediction-markets-when-do-they-work

On estimating expected value, I'm reminded by some of Hanson's work where he suggests predicting later evaluation (recent example: http://www.overcomingbias.com/2018/11/how-to-fund-prestige-science.html). I think this is an interesting subcase of the evaluating subprocess. It also fits nicely with this post by PC

Good find. I didn't see that post (it came out a day after I published this, coincidentally). I'm surprised it came out so recently but imagine he probably had similar ideas, and likely wrote them down, much earlier. I definitely recommend it for more details on the science aspect.

From the post: "For each scientific paper, there is a (perhaps small) chance that it will be randomly chosen for evaluation in, say, 30 years. If it is chosen, then at that time many diverse science evaluation historians (SEH) will study the history of that paper and its influence on future science, and will rank it relative to its contemporaries. To choose this should-have-been prestige-rank, they will consider how important was its topic, how true and novel were its claims, how solid and novel were its arguments, how influential it actually was, and how influential it would have been had it received more attention.

....

Using these assets, markets can be created wherein anyone can trade in the prestige of a paper conditional on that paper being later evaluated. Yes traders have to wait a long time for a final payoff. But they can sell their assets to someone else in the meantime, and we do regularly trade 30 year bonds today. Some care will have to be taken to make sure the base asset that is bet is stable, but this seems quite feasible."

If you have other ideas for things to be evaluated / other uses, please post them below!

Kahneman has a consultancy now where they do noise audits, which are about making professional assessments more consistent. They mean things like loan officers, sales pricing, assorted risk evaluation.

He also mentions using ‘reasoned rules’, which are an even weighting of all of the variables. This can be done with simple algorithms, and provided the correct variables are chosen this meets or exceeds expert performance in most domains.

Interesting. Looks like a book is coming out too: https://www.thebookseller.com/news/william-collins-scoops-kahnemans-book-7-figure-pre-empt-752276

I'm interested in the predictors' incentives.

One problem with decision markets is that you only get paid for your information about an option if the decision is taken, which can incentivise you to overstate the case for an option (if you see its predicted benefit X, its true benefit is X+k and it would have to be at X+k+l to be chosen, if l < k, you will want to move the predicted benefit to X+k+l and make a k-l profit).

Maybe you avoid this if you pay for participation in PAES, but then you might risk people piling on to obvious judgments to get paid. Maybe you evaluate the counterfactual shift in confidence from someone making a judgment, and reward accordingly? But then it seems possible that the problems in the previous paragraph would appear again.

I'm happy to talk theoretically, though have the suspicion that there are a whole lot of different ways to approach this problem and experimentation really is the most tractable way to make progress on it.

That said, ideally, a prediction system would include ways of predicting the EVs of predictions and predictors, and people could get paid somewhat accordingly; in this world, high-EV predictions would be ones which may influence decisions counterfactually. You may be able to have a mix of judgments from situations that will never happen, and ones that are more precise but only applicable to ones that do.

I would be likewise suspicious that naive decision markets that use one or two techniques like that would be enough to really make a system robust, but could imagine those ideas being integrated with others for things that are useful.