Testing the Efficacy of Disagreement Resolution Techniques (and a Proposal for Testing Double Crux)

Ronny Fernandez

Introduction

I will describe a procedure for testing the efficacy of disagreement resolution techniques (DRTs) together with methodologies for inducing their use (induction methods). DRTs are structured games (in the loosest senses of the terms) that involve conversation, and are aimed at helping players who disagree about some topic either figure out the truth about that topic, or come to better understand each others' positions and the sources of their disagreement. An induction method is just any way of trying to get someone to use a DRT.

I am writing up a description of this procedure here because I plan to use it to test and find DRTs, and I would like to get feedback before I go out and do that. I originally came up with this procedure in order to test Double Crux, and I still plan to. I will describe the first step of that plan in the second half of this post. The first half of the post explains the general procedure and some frames I think might be useful for understanding the second half of the post.

I would also like to invite others to use the general procedure in order to test or find other DRTs. It seems fairly obvious to me now, but something something hindsight.

I would like any kind of feedback that you think might make the methodology I describe herein better. I am particularly interested in feedback on the specific procedure I plan to use for testing Double Crux, but feedback on the general procedure would also be great. Any feedback or advice on the statistical analysis of the results of my test procedure for Double Crux would also be much appreciated.

Readers who are primarily interested in giving feedback on my statistical approach are encouraged to skip to the "Procedure" subsection in the second half of the post, and then skip to the "Statistical Analysis and Preregistration" subsection.

General Procedure

Step 0:

Gather participants and filter them for desired characteristics. Ignorance of the DRT to be tested at the onset of the study should be verified for all participants. Researchers may also want to filter participants for having some familiarity with the concept of a probability assignment, or having completed some calibration training.

Inform accepted participants of the nature of the study: that they will be scored according to a proper scoring rule; that they may have to complete a training module, a class, or a workshop; that they may have to discuss a controversial question with a stranger, etc.

Step 1:

Have participants assign credence distributions over the possible answers to multiple choice questions. Such questions should have definite correct answers, and it should be hard or impossible for participants to look up the answers during the study.

Step 2:

Pair participants according to disagreement measured by total variation distance.

Step 3:

Randomly assign participants to one of three groups.

Control Group 1: Members are given no special instructions.

Control Group 2: Member are given basic advise on how to have useful conversations with people who disagree with you.

Treatment Group: Members will be induced to use the DRT using the induction method to be tested.

Step 4:

Inform members of all three groups who they were paired with and what question the pair disagreed about. Members of the first two groups are instructed to have a conversation with their partner with the aim of becoming more confident about what the right answer is.

For the third group, the induction method is applied either before or during their conversations depending on its design. Members of the third group are instructed to use the DRT being tested in order to figure out the right answer to the question they were assigned.

Have all three groups conduct their conversations.

Step 5

Have participants assign a new credence distribution over the possible answers to the multiple choice questions they discussed in step 4.

Step 6:

Pay participants according to a proper scoring rule scored on the credence distributions they assigned in step 5.

Measurements

There are two kinds of data that I think we should try to collect from this kind of procecure. The first is the amount of evidence or information gained by each participant through their conversation. For instance, if a participant started out assigning a credence of $0.2$ to the correct answer for their assigned question, and then assigned the correct answer a credence of $0.5$ after their conversation. This means that they started out assigning it odds of $1 : 4$ , and ended up assigning it odds of $1 : 1$ as a result of their conversation. This suggests that they treated the conversation as an observation with a likelihood ratio of $4 : 1$ , which amounts to gaining $2$ bits of evidence from the conversation. If a participant updates away from the correct answer, the likelihood ratio will be below one, and so the log of the likelihood ratio will be negative.

The second kind of data I would like to collect is the degree to which a pair's beliefs converged after the conversation regardless of the truth of the answer on which they converged. I will measure this by taking the total variation distance of their credence distributions before their conversation, and subtracting from that number the total variation distance between the credence distributions they assigned after their conversation.

Constraints on Multiple Choice Questions

The multiple choice questions used should be questions for which participants are unlikely to already know the right answer. They should also be questions for which it is reasonable to expect that people might make incremental progress on figuring out the right answer through conversation, eg: "will president Trump be reelected in 2020" would be fine, but "what is Joe Smith's sister's name" would not.

The questions should also have the property that if you find out the right answer, it is not completely trivial to convince others of the right answer. Counterintuitive physics puzzles satisfy this criteria, but most raven's progressive matrices questions do not.

It might be good for there to be a set of neutral questions, such as counterintuitive physics or logic puzzles, or questions about how long it took for spider silk to evolve, as well as a set of controversial or politically sensitive questions, such as who will be elected or nominated for some important political office, or whether the GDP of a country went up or down after a particular policy was implemented.

DRTs Are Not Individuated by their Induction Methods

The original example of a DRT that I had in mind was Double Crux. Having anchored on Double Crux, the alternative possible DRT-induction method pairs I imagined I might later test shared Double Crux's canonical pedagogical structure: you are taught the technique by someone who knows it, you practice it some, and then you use it with other people. The space of possible DRT-induction method pairs is much larger than this would suggest.

DRTs are not always transmitted to someone before the first time they are used. Some methods of induction allow the first time someone uses a DRT to be before, or simultaneous with, the first time they learn how it works.

One example of an induction method that allows participants to use a DRT before they learn how the DRT works is having a trained facilitator facilitate their conversation according to the norms of the DRT.

Another might be using a program that guides participants through a discussion. Maybe it keeps track of their Cruxes; keeps track of the various topic branches and sub conversations; the different arguments that have been made and what their premises are; keeps track of users' credences and asks them to update them explicitly when a new argument is inputted, etc. If a DRT were codified in such a program, participants might not have to be preemptively instructed in how to use the DRT. The program might be sufficiently intuitive and give participants enough instruction on its own. This would have the added bonus that they could later use that program without having a trained facilitator around.

I could imagine designing a program like this for Double Crux, but also for other DRTs that are not Double Crux. Of course, you could also design alternative DRTs and transmit them in the way that Double Crux is normally transmitted. This means that whether an accompanying induction method is applied before or during the first time participants use a DRT does not depend much on the nature of the DRT itself.

I note that the general procedure does not test the efficacy of DRTs in isolation; it tests them together with a particular method of induction. If a DRT totally fails to get any interesting result when we try to induce it with one method, that does not mean that the DRT itself is inefficacious. It might be that the induction method used failed to induce the DRT's use in the participants, or that it induced its use but failed to transmit the most important fragments of the DRT.

Specific Plans to Test Double Crux Induced by an Online Training Module

Procedure

Step 0:

I will collect participants on positly.com. I will filter them for not already knowing what Double Crux is. I may also filter them for having completed a bachelor's degree depending on how difficult the final set of questions and the concepts involved in the training module turn out to be. Postily participants are already filtered by Positly for being decent study participants in general.

Step 1:

I will have them take a questionnaire that contains several multiple choice questions. Some of these will be predictions about politically sensitive topics. Some of these will be physics or logic puzzles. Some will be numerical estimation questions like "how many miles of road are there in Africa". I will ask them to assign credence distributions over the multiple choices for each question.

I will also instruct participants in how to think about assigning credences, and ask participants to assign other credence distributions in order to check their calibration. All those that do not demonstrate calibration above some minimal standard of will not be asked to participate in the larger study.

For those who do demonstrate some calibration, I will then ask them if they would like to participate in a larger study. I will first explain what sorts of things they can expect to do as part of the study and what sorts of compensation they can expect to earn. I will also explain that any non-identifying data they give me may later be made publicly available.

If they say yes, I will collect their emails and record their credence assignments on a spreadsheet.

Step 2:

I will pair participants according to disagreement, preferring larger disagreements to smaller ones. Again, this will be measured by total variation of distance.

Step 3:

I will then randomly assign pairs to one of three groups:

Control Group 1: Members of this group will not be given any specific advice.

Control Group 2: Members of this group will be given general advice on how to have useful conversations with people they disagree with. I will tell them that they should try to see each other as cooperating to solve a puzzle, that they should see themselves as partners on the same team, that they should try really hard to understand why the other person had such different beliefs from them, and that they should try to ask each other a lot of questions.

Treatment Group: Members of this group will be asked to complete an online training module that is intended to teach them some critical fragment of the Double Crux technique, and asked to Double Crux with their partners in step 4. (More on the design of this module later.)

Step 4:

Participants will be asked to communicate with the other member of their pair via either a video chat service, or a text chat service. (More on this later.) They will be asked to speak to each other for as long as they would like, but for no less than 30 minutes.

They will have already been told that they will be paid according to a proper scoring rule scored on the credence distribution they assign in step 5. Members of the control groups will be instructed to use the conversation as an opportunity to find out more about the correct answer to the question. Members of the treatment group will be instructed to Double Crux with their partner.

Step 5:

I will then have participants assign a new credence distribution over the multiple choice answers for the question they were assigned to discuss, and collect their assignments.

Step 6:

I may offer participants extra compensation to send records of their conversation during step 1.

I may then recruit people who can credibly claim to assess Double Crux performance, and have them rate logs or recordings of the conversations for Double Cruxiness. They will rate these logs on a scale from 1 to 10. They should give a 1 to any conversation they consider to not be an instance of Double Crux at all, a 5 to a conversation that seems average Double Cruxy for conversations between people who disagree and know about Double Crux, and a 10 to any conversation that seems like a totally paradigmatic instance of Double Crux.

If I decide to do this, I will try to design the module without the use of the words "double" or "crux" so as to minimize the chance that participants use the term in their logs, as that would certainly tip off the raters.

Step 7:

Reward participants according to a Brier score scored on the distributions they assigned in step 5.

Measurements

I plan to measure the same quantities mentioned above. I will measure the amount of information gained by each participant from the conversation, and I will measure the degree to which each pair converged.

Text Chat or Video Chat?

I am genuinely unsure about what medium I should ask people to conduct their conversations through. Video chat is closer to in person conversation, and I personally find the extra bandwidth useful, especially when trying to cooperatively figure something out with someone. However, it would be easier to have participants send text logs than video recordings. Text also offers the advantage of providing a visible log of the conversation for the participants to look over, which is basically the same as giving participants more working memory. When people have conversations in person, they can use pen and paper, which has a similar effect.

I would be interested to hear other peoples' thoughts on this.

Why Have People Rate the Double Cruxiness of the Conversations?

I would like to take logs of the conversations and have people rate them for Double Cruxiness so that if I get a negative result, it is more informative. Without step 6, if it turned out that participants in the treatment group got about as much information out of their conversations as members of either control group, we might interpret this as being primarily evidence that the module used in the experiment does not teach Double Crux.

But if we know that people who understand Double Crux rate the participants in the treatment group as using Double Crux more than those in the control groups, then we can rule out that explanation.

If the raters rate the treatment group's Double Cruxiness above the level of the control groups, then that suggests that the module teaches Double Crux. As such, a negative result would be some evidence against the efficacy of the Double Crux technique itself.

If the raters rate the treatment group's Double Cruxiness near the level of the control groups, this means that the module does not teach Double Crux, or at least it does not teach it any better than giving some basic advice about how to have useful disagreements.

In that case, the treatment group scoring about as well as control group 2 would not be much evidence against the efficacy of Double Crux itself, although it would be further evidence that the module does not successfully induce the technique.

A positive result in this case would suggest that those people I consider relative authorities on Double Crux are either not reliable raters of the degree to which the technique is being used, or are not sensitive to the important aspects of its use. I think this would be an interesting thing to learn and investigate further. It might help with the design of future DRTs or training modules.

I might ultimately leave this step out of a pilot study. I would be more likely to include step 6 in a later follow up to the pilot.

If I did decide to include step 6, I will make sure that nobody, including my self, knows how well the treatment group did compared to the control groups until after the rating is done.

Designing the Module

Although I use Double Crux often, I do not think of myself as a qualified instructor of the technique. I would like the module to be designed primarily by people who can credibly claim to have successfully taught Double Crux multiple times. Preferably it would be designed by people who have also put a lot of thought into figuring out what parts of the technique are most important, and how best to transmit that. I will try to recruit or contract some such folks to help me design the module.

If they turn out to be too expensive for my current budget, then I will just try my best at designing the module, plagiarizing from the people I would have hired as much as possible, and asking for as much feedback as I can handle along the way.

I expect that the module should include examples, diagrams, and questions that check for understanding. I would prefer for it to be no longer than 30 minutes, but it might have to be.

Regardless of who ends up designing it, I will be asking for feedback on the module from other people.

Pilot Study

My current plan is to pay participants 20 usd just for completing the study, since it is rather intensive. (It might take up to an hour and a half to complete. I will start out offering a lower price and see what happens, but I would be happy to go as high as 20 usd.) I will also offer up to 10 usd in monetary rewards for updating towards the right answer.

I may or may not offer an additional reward for sending a log if I decide to include step 6 in the pilot study. This would be more likely if I ended up asking participants to use video chat instead of text, since recording video is harder than copying and pasting.

I would like to have at least 30 participants in each group for the pilot study.

Statistical Analysis and Preregistration

I plan to use standard significance tests, specifically a two group Welch test and a three group ANOVA test. I will also use the Bayesian method, BEST, described here. In this section, I will use a standard difference of means test that assumes equal variance, since it is easier to explain, but I will use a Welch test when actually analyzing the data. It gives similar results to the Welch test for the sample sizes and standard deviations I am working with. You can verify their similarity yourself using this online calculator.

There are three distributions we are interested in. The distributions of our measurements for the control group 1 population, the control group 2 population, and the treatment group population. I will call their respective means $μ_{1}$ , $μ_{2}$ , and $μ_{t}$ . The analysis is the same for both measurement types, except that the sample for the convergence measurement will have half as many data points. I will focus on the evidence measure here, since it is what I am more interested in, especially for the pilot study.

There are two main hypotheses we want to test, and two corresponding null hypotheses:

$H_{0} : μ_{1} - μ_{t} \leq 0$

$H_{0} : μ_{1} - μ_{t} > 0$

And also:

$H_{0} : μ_{2} - μ_{t} \leq 0$

$H_{0} : μ_{2} - μ_{t} > 0$

It would also be informative to compare $μ_{1}$ and $μ_{2}$ if $μ_{2}$ and $μ_{t}$ turn out to not be significantly different. This unfortunately requires using a three group ANOVA. The three group ANOVA is more sensitive to sample size and standard deviation than the two group Welch test, and also somewhat weaker. You can verify that it is weaker using this online calculator.

I will use ANOVA and appropriate ad hoc tests to compare all three groups, but I will also use the two group Welch test to compare just control group 2 to the treatment group. This is normally frowned upon, but I have two reasons to think that it should not be in this case.

The first is that in general two significant test results are better than one, even if their combined p-value is higher. I will report the chance of type 1 error for each test individually, and also report the chance that there was any type 1 error at any point in the study. I am now preregistering that I will only compare control group 2 to the treatment group, so there is no chance of me using the two group Welch test to compare both control groups to the treatment group and only reporting whichever result was better.

The second is that if the treatment group does better than control group 2, I think we can be fairly sure that it does better than control group 1 as well. The point of having control group 1 is that if control group 2 and the treatment group do equally well, without control group 1, this is a fairly boring result. However, if they also do better than control group 1, this suggests that the module works only because coaching of any kind works. This would make a negative result on the comparison between the control group 2 and the treatment group more interesting.

In any case, the means of two distributions being equal is equivalent to the means of their sample mean distributions being equal: $μ_{_{2}} - μ_{_{t}} = 0$ . The distributions of $μ_{_{2}}$ and $μ_{_{t}}$ are approximately normally distributed by the central limit theorem, and so their difference is also normally distributed. To estimate the standard deviation of the difference distribution we use the standard formula:

$σ_{_{2} -_{t}} = \sqrt{\frac{s_{2}^{2}}{n_{2}} + \frac{s_{t}^{2}}{n_{t}}}$

Estimating the sample standard deviations to be about 4 bits of information (higher than I expect them to be) each, with a sample size of 30 in each group, that gives us the following critical values for each significance level:

For $α = .2$ we should reject the null hypothesis at approximately $_{2} -_{2} = 0.87$ .

For $α = .15$ we should reject at approximately $_{2} -_{t} = 1.07$ .

For $α = .1$ we should reject at approximately $_{2} -_{t} = 1.33$ .

If instead we get sample standard deviations of about 2 bits, then we get:

For $α = .2$ we should reject the null hypothesis at approximately ${¯ x}_{2} - {¯ x}_{t} = 0.43$ .

For $α = .15$ we should reject at approximately ${¯ x}_{2} - {¯ x}_{t} = 0.66$ .

For $α = .05$ we should reject at approximately ${¯ x}_{2} - {¯ x}_{t} = 0.85.$

I am optimistic that I can get a sample mean difference greater than $0.43$ , but that would not be significant at the liberal $α = .2$ level unless the sample standard deviations are $2$ or less. I think it is possible the pilot study will get a sample difference of greater than $0.87$ . I am less optimistic about greater than $1.07$ , and I think better than $1.33$ is unlikely.

These are my best guesses. They are based on nothing but my hunches.

For my estimates concerning the sample standard deviations I am likely to get, Beth Barnes let me look at the data from OpenAI's debate studies. They use a similar method for getting credence assignments and looking at updates. Although the studies looked at participants' credence assignments for their favorite answers, not the correct answers, I still think the standard deviations in their samples are a reasonable estimate for the standard deviations I can expect.

I wrote a python script to calculate the sample standard deviations for the log likelihood each participant updated with. The sample standard deviations were: [1.054313572588262, 1.1272448020726757, 1.9301510041775782]

For the third of these studies some participants said that their favorite answer was less likely than random chance would suggest. I conclude from this that these participants were filtered for calibration less than those in the first two studies. I am optimistic that I will get sample standard deviations closer to the first two values than to the second, which would make my results far more significant.

As stated above, I also plan to use BEST to do a Bayesian analysis of the results. I will give a few example priors I consider reasonable and their corresponding posteriors. Hopefully, I will also provide a program in R or Python that allows one to input any prior and get a posterior as output.

If there is a better way to do this than the Welch test or ANOVA, Bayesian or not, preferably while making fewer assumptions about the underlying distributions, I am all ears. I would also be interested in hearing about Bayesian alternatives to the BEST approach.

I might be missing something important. The Welch test and three group ANOVA are just the standard approaches I found after googling and asking around for a bit. The BEST method was the first thing I found when I googled "Bayesian comparison of means".

In any case, I plan to make all data (aside from the conversation logs and personally identifying data) publicly available. I would encourage folks to analyze that data however they like and let me know what they find.

What Happens after You Get Your Result?

If I get a positive result, I will look for further funding to do a larger study. If I get a positive result on that larger study, I will have a module that has been empirically shown to help pairs of people who disagree get more information out of talking about their disagreement. I will distribute any such module for free and make all data publicly available. I will of course also try to look for other DRT-induction method pairs with larger effect sizes.

If I get a negative result on the pilot study, or on a larger follow up study, I will still make the data publicly available, as well as the module, and let people make their own judgments on how to update on that result. I may also release a program that takes a prior over the parameter space for the relevant population distributions as input and returns the appropriate posterior.

I will then continue to systematically test DRTs and methods for inducing their use.

This is all part of my grander plot to rigorously test, develop, and disseminate, systematic methods for improving the art of human rationality. I am not going to give up on that because of one negative result.

With that, I would like to thank Daniel Filan, Vaniver (Matthew Graves), Beth Barnes, Oliver Habryka, Spencer Greenberg, Luke Raskopf, Katja Grace, and Eliezer Yudkowsky, for their valuable feedback and/or their much appreciated encouragement.

I would also like to give a special thanks to Beth Barnes for letting me look at OpenAI's debate data to get an estimate for the sample standard deviations I can expect.

And a special thanks to Spencer Greenberg for creating Positly and Guided Track, and walking me through how to use them.

[-]Bird Concept7y30

Problem

The space of possible DRT-induction method pairs is much larger than this would suggest.

I think the space of things you could try is quite large indeed, both when it comes to DRT-induction as well as what you choose to include in the control condition. I can also imagine this being a major point of contention/annoyance post-study (“This is nice, but for me to really change my mind I’d want you to have used this induction/control”).

Solution

Before the experiment, we have prediction markets/forecasting tournaments on the results of the pre-registered statistical tests, given a particular induction x control combination.

When the markets close, your experiment runs as planned -- but you only test the induction x control combinations that had the most disagreement/variance in their estimates.

Prediction market participants are then paid according to a proper scoring rule based on the outcome of the experiment.

So overall, even if you just test 1-3 experimental designs, we could have these markets on 10-20 designs, and get priors for all of them!

This is also a more transparent way of picking conditions to run for the experiment.

___

I've messaged you privately to discuss this further and organise eventual funding and operational support.

[-]megasilverfist7y20

it should be hard or impossible for participants to look up the answers during the study

I am unclear how you are going to enforce this in practice given that the study will be online and that you're expecting people to spend at least 30 minutes on conversation, which implies a large enough reward that it is worth hunting down answers that can't be found on the front page of Google. The only thing that comes to my mind is asking them to make predictions about the future. Related what will your policy be on participants looking up and/or sharing references relevant to steps in their reasoning? E.g. if one of the questions is about Trump being reelected will participants be allowed to visit 538 during step 1 and/or link their partner to it in step 4?

I didn't have enough time to properly evaluate the statistics portion, but at first glance it looks ok. Nothing seems wrong with the significance tests beyond them being significance tests. IIRC BEST addresses my main issues with them, particularly being able to indicated the absence of an effect in a way that isn't the case for mere non-significance, but I haven't used it in forever and don't have time to brush up on it at the moment.