I find the qualitative reflections most enlightening and especially that you said: "But never in the course of this experiment did I count something that turned out to be unimportant."
Your under-confidence in that point may be very common leading to thoughts like: "Yea noticing confusion is all nice but I usually do that already. I'm fairly certain that I'm only missing some irrelevant confusion." Your experience suggests that there is no such thing as irrelevant confusion. The art is to notice as many as humanly possible instead of just some.
I have never read a better motivation to go and actively try to notice confusion than this sentence. Thanks.
This is a well-written post! Upvoted.
A nitpick:
If I naïvely say that Week 1 establishes a true distribution for averaged weekly counts, then being more than 1σ above the mean for three weeks would have a probability of about p = (0.16)^3 = 0.0041 if that true count distribution remained constant.
Unfortunately, this p-value is poorly calibrated because the sampling errors in the estimates of the weekly means and σ are non-negligible compared to the value of σ.* We can obtain an accurate p-value by simulation. Under the null hypothesis of no change in counting frequency, the count for each day follows a Poisson distribution with mean = 150 counts / 35 days (I got 150 from adding up all the counts in the plot; there is some sampling error in this estimate, but its effect on the estimated p-value is negligible). From simulating 10^5 samples, I found 8296 samples with week 3–5 means that are all greater than the sum of the week 1 mean and week 1 SD. This gives p = 0.083.
An alternative (and standard) way to get a p-value here is to use Kendall tau as a test statistic, which gives a non-parametric rank-based test for monotone association. The single-tailed Kendall tau gives p = 0.076.
* ETA: Let me add more explanation for any reader who is not sure what's going on. The p-value in the post is (exactly) correct if the weekly mean and SD under the null hypothesis can be determined without any error. Unfortunately, we cannot do that—the best we can do is to estimate the weekly mean and SD using the week 1 mean and SD, so our estimates contain sampling errors. Often, we do not care about sampling errors when we are working with large samples because these errors are negligibly small compared to the SD. However, in this case, our sample has only n = 7, so sampling errors are non-negligible compared to the SD. This becomes a problem when we work with p-values because the null hypothesis is dependent on our estimates, but the errors in these estimates are not taken into consideration when we calculate the p-value. A common way to work around this is to use simulations, as I did. Alternatively, because our null hypothesis is rather simple, it might be feasible to use analytic methods to calculate a correct p-value.
Thanks! Hardly a nitpick, I should really know better. It looks especially bad that my laziness/carelessness led to overstated results. 150 is the correct number of counts, and I agree with your calculation. Embarrassingly, I also screwed up the p-value for the sleep correlation, [EDIT] which I retracted briefly but now have fixed.
I have a (kind of) meta question: What's up with the "zir" and "zirself" in the text? I've never heard/read that word before and from context I'd infer that it should be "their" and "themselves". Would you clear that up?
Thanks for clearing that up. That was my guess, I was just confused that it suddenly popped up without me ever having heard about it. Is it popular/well-known? When I googled it, there were no hits for an explanation.
I don't know, I've only ever seen it here and on Yvain's blog. Ironically, it doesn't really work for me as far as disrupting intuitive connections and depictions is concerned because it sounds pretty much like the German third person singular feminine pronoun. ;-)
While there have been many attempts at a set of such pronouns and none ever became standard, this is the set I see by far most commonly. Several non-gender-binary-identifying people I know use ze/zir/zirs as their preferred pronouns. They definitely crop up in many more places than just here and SlateStarCodex, as someone else replied, but it tends to be mostly in communities that have a particular focus on gender identity.
When I see a word I've never seen before, I google it. Here you go.
At first, I didn't seem to exercise this skill on days where I wasn't doing cognitively demanding work, or when most of my work was not in an academic context (typically weekends). Over time, I began doing so more, although still less than on demanding academic days.
I know quite a bit of time has passed since you posted this, but do you recall any specific instances of non-cognitively-demanding weekend-type confusions you could share?
I wrote down a handful as I was doing this, but not all of them. There were a couple about navigation (where rather than say "well, I don't know where I am, I'll just trust the group" I figured out how I was confused about different positions of landmarks). I avoided overbaking my cookies when the recipe had the wrong time written down. Analytics for a site I run pointed to a recent change causing problems for some people, and I saw the (slight) pattern right away but ignored it until it got caught on my confusion hook. It's also a nice hook for asking questions in casual conversations. People are happy to explain why they like author X but not the superficially similar author Y I've heard them complain about before, for example.
Sorry for being nit-picky, but one thing here really bugs me.
I would recommend extreme caution when recording data you remember from the experience of a lucid dream. Despite the fact that you may have been conscious that you were unconscious, the fact that you were in a dream-like state could mess with what you remember. While I personally have little (okay, no) experience with lucid dreaming, It seems safe to assume that you might forget details of the dream after waking up and trying to recall it, especially if you wait days before trying to remember the dream. Obviously this is often the case in regular dreams, and while you could make the case that lucid dreams are more vivid and thus easier to remember, its still too sketchy for me to take that as evidence without being heavily skeptic.
Otherwise, well done.
a) In my experience, lucid dreams are more memorable than normal dreams
b) You seem to assume that Whales completely forgot about the dream until they wrote this blog post, which is unlikely, because obviously they'd be thinking about it as soon as they woke up, and probably taking notes.
c) Whales already said that it hardly even constitutes evidence
This is a well-written post! Upvoted.
A nitpick:
If I naïvely say that Week 1 establishes a true distribution for averaged weekly counts, then being more than 1σ above the mean for three weeks would have a probability of about p = (0.16)^3 = 0.0041 if that true count distribution remained constant.
Unfortunately, this p-value is poorly calibrated because the sampling errors in the estimates of the weekly means and σ are non-negligible compared to the value of σ.* We can obtain an accurate p-value by simulation. Under the null hypothesis of no change in counting frequency, the count for each day follows a Poisson distribution with mean = 150 counts / 35 days (I got 150 from adding up all the counts in the plot; there is some sampling error in this estimate, but its effect on the estimated p-value is negligible). From simulating 10^5 samples, I found 8296 samples with week 3–5 means that are all greater than the sum of the week 1 mean and week 1 SD. This gives p = 0.083.
An alternative (and standard) way to get a p-value here is to use Kendall tau as a test statistic, which gives a non-parametric rank-based test for monotone association. The single-tailed Kendall tau gives p = 0.076.
* ETA: Let me add more explanation for any reader who is not sure what's going on. The p-value in the post is (exactly) correct if the weekly mean and SD under the null hypothesis can be determined without any error. Unfortunately, we cannot do that—the best we can do is to estimate the weekly mean and SD using the week 1 mean and SD, so our estimates contain sampling errors. Often, we do not care about sampling errors when we are working with large samples because these errors are negligibly small compared to the SD. However, in this case, our sample has only n = 7, so sampling errors are non-negligible compared to the SD. This becomes a problem when we work with p-values because the null hypothesis is dependent on our estimates, but the errors in these estimates are not taken into consideration when we calculate the p-value. A common way to work around this is to use simulations, as I did. Alternatively, because our null hypothesis is rather simple, it might be feasible to use analytic methods to calculate a correct p-value.
Thanks! Hardly a nitpick, I should really know better. It looks especially bad that my laziness/carelessness led to overstated results. 150 is the correct number of counts, and I agree with your calculation. Embarrassingly, I also screwed up the p-value for the sleep correlation, [EDIT] which I retracted briefly but now have fixed.
I previously discussed the potential relevance of therapeutic and instructional models of metacognitive training to LW-style rationality skills. As an attempted concrete realization of what this connection could look like, I ran a self-experiment in which I counted instances of noticing confusion. Below I elaborate on the motivation and design of the experiment, then discuss some quantitative results and qualitative reflections.
Self-monitoring as a treatment vehicle in cognitive-behavioral and related therapies can take many forms. In one (to my secondhand understanding), the patient is coached in noticing a physical or mental behavior by identifying examples of the behavior and heuristics for when to watch for it, and by examining the feeling of the behavior itself. This is accompanied by practice of coping strategies. The patient is instructed to count the occurrences of that behavior on zir own. This is ideally done with a "wrist counter," which is always available, can be incremented with the press of a single button, and gives both tactile and visual feedback on being pressed.
The patient might, for example, count instances of acting on zir own initiative, or of having positive thoughts about zirself. In this case, tying the thought to the specific physical action of pressing the button, as well as watching a "score" go up, helps with the reward circuit for both noticing the thought and the content of the thought.
The patient could also count negative thoughts, engaging in bad habits, inappropriate "should" statements or other "cognitive distortions." At first, so I'm told, the count will go up, as you get better at noticing; then (optimistically) back down over a few weeks, as your symptoms diminish. In this case, it's important not to focus on the fact you're doing something "bad." Instead, try to reward noticing and dispelling the bad thing, or at least to reward noticing that you're focusing on the bad thing rather than rewarding the noticing. (If all else fails, reward noticing that you're focusing on failing to reward noticing that you're focusing on the bad thing rather than rewarding the noticing. That should definitely do it, right?)
This seems doubly useful: not only are you practicing and rewarding the noticing skill, but in tying it to a physical action, you necessarily bring the noticed behavior to your conscious attention, so that you can deal with it deliberately. If you noticed yourself dismissing a compliment, you'd take that opportunity to point out to yourself that the dismissal is mostly evidence of your mental state, and only weakly of the compliment's validity; you'd try to take the compliment at face value.
I chose to implement a version of this for a personal version of noticing confusion. (I also considered noticing a mental flinch, noticing motivated reasoning, flagging beliefs for review, activating curiosity, welcoming bad news, being specific, and noticing others' nonspecificity/asking for examples. I decided to go for now with what would be most personally useful and the most frequent.) I'm using this counter widget on my phone's home screen. It's two button presses away at any time, and it shows me a nice big number. I also see it whenever I use my phone, which is good for scaffolding but bad for transfer—on one hand, I get reminders to pay attention to my mental processes, so I'm more likely to be able to practice the noticing skill; on the other, I might be inhibiting my learning to apply the skill without reminders. Since I could just keep using the counter if it helped, I didn't worry too much about this.
The details of the fuzzy introspective rules for whether I get to count something as noticing confusion probably don't matter so much, but the basic idea is this: If I notice an unresolved tension or conflict between things I believe, then I count it. I don't count the related and also-crucial noticing that I simply don't understand something—I have to identify a conflict. (I see "notice when I don't understand something" as Level 0 of this skill. It's also particularly easy to practice: just read something on an unfamiliar subject, and draw a question mark next to any specific thing you don't understand. Ideally, revisit those marks later. Get in the habit of doing this for everything you read.) I don't count confusions in retrospect—if I've already resolved a confusion by the time I bring it to conscious awareness and can press the button, then I don't count it. That was a personally controversial call, but there's another sense in which "noticing and resolving confusion" is simply a mode of thought that operates semi- or sub-consciously. I didn't want to get bogged down in counting those, and this seemed like a simple rule to split the cases.
Thus, some non-examples (still worth noticing in their own right, and often leading to pinpointing a confusion):
And some examples of thoughts that I would count (by the way, these mental processes, like most, are mostly nonverbal for me, so don't take this literally; I'm noticing a feeling like tension in the connections between concepts):
Before I began, I guessed that I encountered this kind of confusion several times a day, mostly in seminars, papers, textbooks, debugging, simulated data, and experimental data. I suspected that I already consciously notice many of them, but not all, and that increasing the catch rate would markedly improve how much understanding I got out of the above activities and perhaps prevent some expensive mistakes.
I attempted to keep my confusion-inducing workload constant by working the same number of hours every day. I also distributed my reading of textbooks/papers and my talk attendance to give roughly constant combined time each day, although I'm not sure that those activities had a particularly different density of confusion from my ordinary work. I typically took a couple days a week off of cognitively demanding work, and this pattern is visible in the data, at least at first.
The night before starting the experiment, I ran myself through a couple-hour training exercise on a meaty-looking paper, expressly to pay attention to conflicts in my growing understanding of the result as well as tensions between the content of the paper and my background knowledge, following recommendations of instructional research on metacognition. This was already pretty satisfying and left me feeling good about my self-experiment. The challenge would be to see whether I could improve at spotting and pinning down my nagging doubts, and whether I could take this watchfulness beyond the more-studied domain of self-monitoring while reading. Both of these things seemed to happen.
The quantitative results are promising, but not especially informative. There's only so much I can say with a month's worth of data points in such a non-rigorous self-experiment. As it turned out, my guess of "several times a day" was pretty good—for a good day, full of demanding work, which was what came to mind when guessing. In truth, there's a lot more variation between days, which didn't disappear as I got better at pressing the button: there's a standard deviation of 2.85 counts for week 1, 2.81 counts for week 5, and 2.81 counts for all days.
Here's what the data looks like, with a moving weekly average (thus accounting for the weekend effect) and moving weekly 1σ bounds (e.g. ± 2.85/√7 for the first week):

By week 3, the weekly count has gone up by a standard deviation, and it stays there or higher for weeks 4 and 5. Again, I don't want to lean too hard on these numbers—I wasn't rigorously consistent about the amount and nature of my daily work or the rules for counting. Weeks 1 and 2 might have been bad weeks, so that the increase doesn't represent a real improvement; there's also room for my desire to have a better-looking LW post to have increased the counts. And there's a little ambiguity about what I'm measuring: perhaps the increase in counting comes only from remembering to press the button, and there are plenty of other times when I notice confusions and consciously address them without identifying them as button-pressing candidates. My guess is that this isn't the case—the increase seemed to come in the form of things I barely didn't miss.
If I naïvely say that Week 1 establishes a true distribution for averaged weekly counts, then being more than 1σ above the mean for three weeks would have a probability of about p = (0.16)3 = 0.0041 if that true count distribution remained constant. I'm not going to do any more sophisticated analysis than that, since I don't think the data really supports it. See this detailed comment by VincentYu. There's also a barely-significant relationship with the previous night's sleep duration (p = 0.043, +1 count per hour of sleep). If I adjust for this, the appearance of improvement still holds:

So sleep perhaps accounts for a small amount of random variation, and not the overall shift.
Finally, some qualitative reflections:
The quantitative results are promising, but for me, the qualitative lessons are more important—particularly my underconfidence and the possibility of using contradiction to fuel curiosity. I'll keep counting confusions like this for a while, but I'm not going to worry much about experimental validity. Similarly, it doesn't matter a whole lot to me whether the apparent gains rely on using the counter, since it costs me basically nothing to continue using it. I suppose that one could look into that by taking a break from counting and resuming it after a few months, but that's honestly not my priority.
This is a really easy thing to try, and I'd like to encourage others to build on the simple attempt I've presented here.