Time to See If We Can Apply Anything We Have Learned

MichaelVassar

Time to See If We Can Apply Anything We Have Learned

by MichaelVassar

1 min read18th Jun 200925 comments

1

CommunityPractical

Personal Blog

It seems to me that this blog has just reached it's first real crisis.

Three people are announcing three apparently opposed beliefs with substantial real expected consequences and yet no-one has yet spoken, or it seems to me implied, the key slogan... "LETS USE SCIENCE!" or, as hubristic Bayesian wannabes, not invoked Bayes as an idol to swear by, but rather said "LETS USE HUMANE REFLECTIVE DECISION THEORY, THE QUANTITATIVELY UNKNOWN BUT QUALITATIVELY INTUITED POWER DEEPER THAN SCIENCE FROM WHICH IT STEMS AND TO WHICH OUR COMMUNITY IS DEVOTED".

IF RDS was applied to our current situation, people would be analyzing Yvain's, Davis' and Eby's proposals, working out exactly what their implications are, and trying to propose, in the name of SCIENCE, hypotheses which will distinguish between them, and in the name of BAYES, confidence estimates of their analyses and of the quality with which the denotations of their words have cleaved reality at the joints enabling an odds ratio of updating to be extracted from a single data point. People would be working out what features of which of the models used by Yvain, Davis and Eby constitute evidence against what other features. They would be trying to evaluate non-verbally, through subjectively opaque but known-to-be-informative processes vulnerable to verbal overshadowing, what relative odds to place on those different features of the models. Finally, they would be examining the expected costs entailed by experiments being proposed and selecting those experiments which promise to provide the most information for the least cost be performed. The cost estimate would include both the effort required to perform the experiments, probably best assessed with an outside view in most cases like these, and the dangers to the minds of the participants from possible adverse outcomes, taking into account, as well as possible, the structural uncertainty of the models.

I sincerely hope to see some of that in the comments section soon, either under this post or the "Applied Picoeconomics" post.

CommunityPractical

Personal Blog

1

New Comment

25 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:38 AM

[-]Vladimir_Nesov15y150

Please, summarize what you are talking about. Not everyone is following all the threads, or intuits the same framing of the problem as you do.

[-]derekz15y50

Yes, I'm afraid this post is kind of impenetrable, although cousin_it's contribution helped. What is "RDS"?

Also, continually saying "People should..." do this and that and the other thing might be received better if you (meaning Michael, not Vladimir) start us off by doing a little of the desired analysis yourself.

[-]Richard_Kennaway15y10

What is "RDS"?

From context, Reflective Decision Theory, and from googling that, decision theory for self-modifying systems, a central problem for any theory of intelligence, human or artificial. However, Google only turns up calls for such a thing to exist, not any actual theory. Is Michael Vassar calling for us to use these examples as a concrete case study from which to work towards an RDS? Or simply to bring scientific method to bear on these examples?

If cousin_it has accurately located the material that Michael was referring to, I'll add my recent citing of PCT/MOL as a fourth contender.

[-]MichaelVassar15y10

No formal theory exists, but we always use an implicit reflective decision theory when, for instance, allocating attention and effort to decision making.

[-][anonymous]15y00

I'd guess "RDS" is Michael's typonym (ouch!) for "Reflective decision theory".

[-]cousin_it15y140

I'll try to summarize the positions by taking (IMO) the most representative paragraph from each one.

1) Yvain:

The technique I decided to test was to write out an oath detailing exactly what I wanted to do, list in nauseating detail all of the conditions under which I could or could not be released from this oath, and then bind myself to it, with the knowledge that if I succeeded I would have a great method of self-improvement and if I failed I would be dooming myself to a life of laziness forever.

2) Z_M_Davis:

Rather than setting explicit measurable goals, I try to continually remind myself that every minute and every dime is precious, and every minute and every dime that you don't spend doing the best thing you can possibly be doing is a mark of sin upon your soul.

3) pjeby:

If you explicitly contemplate all the things that might come up, and decide what you'll do in each case, then you are mentally linking your "interest" to those contexts, along with a preferred behavior... thus reducing the willpower load required to make those decisions when the time comes, and giving that "interest" a larger say in the bargaining that occurs at that point in time.

Now we need to understand what implications those positions have and where they contradict.

[-]pjeby15y50

Now we need to understand what implications those positions have and where they contradict.

They don't contradict, they're simply methods with different tradeoffs for different people. My statements are aimed at people who don't enjoy being under pressure; Davis and Yvain's methods will work well for people who thrive under pressure. Yvain's method has some crossover with mine, in that I predict he will be far less successful with an oath that does not involve the contemplation process. That is, I attribute the majority of his success to the pre-oath contemplation, and very little to the oath itself, or the penalties attached to it. (And I consider the attachment of penalties to be dangerous as well as unnecessary.)

[-]Drahflow15y20

A next step in setting up a decent experiment would be to select 4 groups (1, 2, 3 + Control) of people randomly. To give findings most applicable to rational people, we can just as well select within LW.

Another step: Select a common task the utility of which is perceived as high by most people on LW, yet which is seldomly performed.

Proposal: During two weeks, run for (at least) 10 Minutes on three of each four consecutive days.

Ideally, the experiment would then continue with another task, but switching group members. This improves the experiment via control for different motivation between participants. Also trying a different task gives us a chance to see if there is a general method or whether the method must be selected specific to the task.

If anybody is willing to participate in this experiment, I am willing to coordinate. If somebody else is willing (and more qualified) to coordinate, I hereby enlist as participant.

Qualification: I am a computer scientist, I have one contact to a psychology researcher, I feel competent in statistics. Then again I lost 20$ (of 100$) over 3 month in prediction markets.

[-]CannibalSmith15y60

The fact that I'd be participating in an experiment for SCIENCE! would motivate me overwhelmingly more than any of those techniques.

[-]cousin_it15y40

No, no. You skipped a step. You didn't actually think hard about the implications of the three positions or work out any stark contradictions between them. If you'd done that, you'd have thought up multiple small focused experiments to resolve each individual area of contradiction. Instead you hastily propose one big complex setup that looks more like a contest between three self-help techniques. Whatever the outcome, it won't bring us any closer to the correct constructive theory of human motivation. Yvain has eloquently described the same problem in the thread nearby.

Sorry if this sounded harsh.

[-]Drahflow15y20

No, but a setup which does not try to understand more deeply which parts of which theory contribute to its success still gives pretty useful results about which approach has the highest expected utility.

[-]Drahflow15y00

So, are we going to get this experiment running or not?

Do you have a better proposal?

[-]Mike Bishop15y00

I would also join, but we should take Cousin_It's suggestions. And we should send our ideas to some psychologists for feedback.

[-]Scott Alexander15y60

My first response to this was "Wait, I have a model?"

Right now I don't think we're even at a point where it's valuable to think of three different "theories" or what they imply. We have three different techniques, all of which are kind of supported by picoeconomics but also kind of hand-wave-y. And ZM's objection to my post seems to be more philosophical than a simple "your method won't work" (and I still don't entirely understand Pjeby).

Once we have some more discussion, if it becomes clear that we have actually have three different but comparable willpower techniques and we really want to know which will work best, then we can start wishing we could test them. I'm doubtful we can actually do so, since I doubt we could get more than about 30 volunteers, and 30 divided by 3 groups does not a legitimate sample size make for a complicated psych experiment. But if you have some ideas, I'd be happy to help.

[-]MichaelVassar15y30

The sample size issue etc is why I talk about Bayes. You get important info from single data points all the time in life. There's just a fetish against doing so in science due to bad epistemology trying and failing to counter other bad epistemology.

You certainly derived your belief that your procedure would work from a theory. You hadn't actually even seed it work, so nothing but a theoretical basis could explain your attempt.

[-]Scott Alexander15y70

I don't think it's second-order good epistemology trying and succeeding to counter bad epistemology.

Let's say we run a study with 30 people, and we conclude ZM's method is the best, with p = .55 (sorry, I don't think in Bayesian when I have my psychology experimentation cap on), which is realistic for that kind of sample and the variability we can expect. Now what?

We could come up with some kind of hokey prior, like that there's a 33% chance each of our techniques is best, then apply that and end up with maybe a 38% chance ZM's is best and a 31% chance mine and Pjeby's are best (no, I didn't actually do the math there). But first of all, that prior is hokey. Pjeby's a professional anti-procrastination expert, and we're giving him the same prior as me and Z.M. Davis? Second of all, we still don't really know what "best" means, and it's entirely possible different methods are best for different people in complex ways. Third, I don't trust anyone including myself to know what to do with a 7% chance. I like my method better; should I give that up just because a very small study ended up shifting the probabilities 7% toward ZM? Fourth of all, we still wouldn't know how to apply this to picoeconomics as a theory: using any technique will increase success by placebo effect alone, we have several techniques that all use picoeconomics to different degrees, and we would have to handwave new numbers into existence to calculate things and probably end up with something like a .1% or .2% shift in probabilities.

And this is all if we have perfect study design, there's no confounders, so on and so forth. It would take a lot of work. The best case scenario is that all that work would be for a single digit probability shift, and the realistic case is that there's flaw somewhere in the process, or we simply misinterpret the result (my guess is that people can't deal with a 2% shift correctly and just think "now there's evidence" and count the theory as a little more confirmed) then we'll actually be giving ourselves negative knowledge.

I'm not saying Bayes isn't useful, but it's useful when we have a lot of numbers, when we're willing to put in a very large amount of work, and where there's something clear and mathematical we can do with the output.

[-]ChrisHibbert15y00

I recently read The Cult of Statistical Significance. I realize that it's de rigeur to quote significance, but Ziliak and McCloskey insist that I ask what's the hypothesized size of the effect?

If we run three conditions, and end up with 4, 5, and 6 people getting some improvement, and calculate statistical significance, we obfuscate the fact that the difference is in the noise. If the same tests end up with 2, 4 and 8 people improving according to some metric, then we have stronger reason to suspect something is going on. Size matters. It's usually more interesting than statistical significance.

[-]pjeby15y00

Second of all, we still don't really know what "best" means, and it's entirely possible different methods are best for different people in complex ways.

And there's a worse confounding factor, which is that people tend to interpret instructions in terms of whatever prior model they have. (That's actually why I object so strenuously to a couple of aspects of your "oath" model -- they're not so much intrinsically harmful, as harmful to people with certain prior models.)

Testing the distinction between your method and mine would require pretty stringent behavioral control of subjects in a large experiment, because you'd need to validate that the subject actually considered each situation and consequence. (Writing those things out is a good way to verify it, which is why I think your success was actually a side-effect of the thinking you had to do in order to design and write your oaths.)

However, if you just grab a bunch of volunteers and tell them to do either your version or mine of that process, I predict that a substantial number will not actually follow the directions, and will simply tell themselves they've already thought it through enough after considering maybe 1 or 2 situations, and then proceed to do whatever it is they already do to initiate change effects, sprinkled with a bit of flavor from whatever method they're supposed to be testing.

This is a major confounding factor in testing any cognitive behavior model, be it a self-help technique, time management system, or anything else. People tend to process virtually all new inputs through whatever mental strategies they already have, and lop off the parts that don't fit.

[-]Cyan15y00

All we can feasibly get is the intent-to-treat effect. Estimating actual treatment effects is possible but not practical.

Is that a fair summary of the parent?

[-]pjeby15y00

Estimating actual treatment effects is possible but not practical. Is that a fair summary of the parent?

I'm just saying it's hard, and that informal means won't work very well. Well-designed experiments in psychology tend to be designed to trick people into doing or thinking the thing that's being tested, in order to avoid some of these effects.

[-]Cyan15y00

I mean not practical for the LW community in the way MichaelVassar would like to see happen.

[-]Emile15y40

(For some reason, your post is appearing in a smaller font size than other posts - maybe you could fix that ? It's slightly harder to read ... )

[-]CannibalSmith15y20

It's a lot harder to read.

[-]billswift15y00

It's somewhat different, but Alan Lakein's "How to Get Control of Your Time and Your Life" was rather similar to all three. Start with contemplation of your goals and priorities, including writing them out. And whenever you are unsure what to do next, ask yourself Lakein's question, "What is the best use of my time right now?", using the previously established goals and priorities.

(I'm doing this from memory from some time ago, so details may be off a bit, but it's how I remember the book.)

[-]Mike Bishop15y00

I think the term "challenge" would be more appropriate than "crisis."

Moderation Log