Why Everyone (Else) Is a Hypocrite: Evolution and the Modular Mind
Concept Safety
Multiagent Models of Mind
Keith Stanovich: What Intelligence Tests Miss


Similarly, IFS errs on the side of having strong priors on the form/structure of the parts: exiles, firefighters and managers, and on the side of doing "family therapy" with them.

For what it's worth, my experience is that while the written materials for IFS make a somewhat big deal out of the exile/firefighter/manager thing, people trained in IFS don't give it that much attention when actually doing it. In practice, the categories aren't very rigid and a part can take on properties of several different categories. I'd assume that most experienced facilitators would recognize this and just focus on understanding each part "as an individual", without worrying about which category it might happen to fit in.

I guess this is what you already say in the last paragraph, but I'd characterize it more as "the formal protocols are the simplified training wheels version of the real thing" than as "skilled facilitators stop doing the real thing described in the protocols". (Also not all IFS protocols even make all those distinctions, e.g. "the 6 Fs" only talks about a protector that has some fear.)

Great post!

This also reminds me of Tal Yarkoni's paper on what he calls the generalizability crisis in psychology. That's the fact that psychological experiments measure a very specific thing that's treated as corresponding to a more general thing. Psychologists think that the specific thing measures the more general thing, and Yarkoni argues that they're not measuring what they think they're measuring.

One of his examples is about the study of verbal overshadowing. This is a claimed phenomenon where if you have to verbally describe what a face looks like, you will be worse off at actually recognizing that face later on. The hypothesis is that producing the verbal description causes you to remember the verbal description, while remembering the actual face less well - but the verbal description inevitably contains less detail. This has been generalized to the broader claim of "producing verbal descriptions of experiences impair our later recollection of them".

Yarkoni discusses an effort to replicate one of the original experiments:

Alogna and colleagues (2014) conducted a large-scale “registered replication report” (RRR; Simons, Holcombe, & Spellman, 2014) involving 31 sites and over 2,000 participants. The study sought to replicate an influential experiment by Schooler and Engstler-Schooler (1990) in which the original authors showed that participants who were asked to verbally describe the appearance of a perpetrator caught committing a crime on video showed poorer recognition of the perpetrator following a delay than did participants assigned to a control task (naming as many countries and capitals as they could). Schooler & Engstler-Schooler (1990) dubbed this the verbal overshadowing effect. In both the original and replication experiments, only a single video, containing a single perpetrator, was presented at encoding, and only a single set of foil items was used at test. Alogna et al. successfully replicated the original result in one of two tested conditions, and concluded that their findings revealed “a robust verbal overshadowing effect” in that condition.

Let us assume for the sake of argument that there is a genuine and robust causal relationship between the manipulation and outcome employed in the Alogna et al study. I submit that there would still be essentially no support for the authors’ assertion that they found a “robust” verbal overshadowing effect, because the experimental design and statistical model used in the study simply cannot support such a generalization. The strict conclusion we are entitled to draw, given the limitations of the experimental design inherited from Schooler and Engstler-Schooler (1990), is that there is at least one particular video containing one particular face that, when followed by one particular lineup of faces, is more difficult for participants to identify if they previously verbally described the appearance of the target face than if they were asked to name countries and capitals. [...]

On any reasonable interpretation of the construct of verbal overshadowing, the corresponding universe of intended generalization should clearly also include most of the operationalizations that would result from randomly sampling various combinations of these factors (e.g., one would expect it to still count as verbal overshadowing if Alogna et al. had used live actors to enact the crime scene, instead of showing a video). Once we accept this assumption, however, the critical question researchers should immediately ask themselves is: are there other psychological processes besides verbal overshadowing that could plausibly be influenced by random variation in any of these uninteresting factors, independently of the hypothesized psychological processes of interest? A moment or two of consideration should suffice to convince one that the answer is a resounding yes. It is not hard to think of dozens of explanations unrelated to verbal overshadowing that could explain the causal effect of a given manipulation on a given outcome in any single operationalization.

This verbal overshadowing example is by no means unusual. The same concerns apply equally to the broader psychology literature containing tens or hundreds of thousands of studies that routinely adopt similar practices. In most of psychology, it is standard operating procedure for researchers employing just one experimental task, between-subject manipulation, experimenters, testing room, research site, etc., to behave as though an extremely narrow operationalization is an acceptable proxy for a much broader universe of admissible observations. It is instructive—and somewhat fascinating from a sociological perspective—to observe that while no psychometrician worth their salt would ever recommend a default strategy of measuring complex psychological constructs using a single unvalidated item, the majority of psychology studies do precisely that with respect to multiple key design factors. The modal approach is to stop at a perfunctory demonstration of face validity—that is, to conclude that if a particular operationalization seems like it has something to do with the construct of interest, then it is an acceptable stand-in for that construct. Any measurement-level findings are then uncritically generalized to the construct level, leading researchers to conclude that they’ve learned something useful about broader phenomena like verbal overshadowing, working memory, ego depletion, etc., when in fact such sweeping generalizations typically obtain little support from the reported empirical studies.


It's not just hard in practice, it's a question that the studies are theoretically incapable of answering.

Which "unhelpful tendencies and problems" occur in both twins with radically different upbringing and which do not?

That doesn't distinguish between e.g. unhelpful tendencies that occur due to genes that all humans share vs. unhelpful tendencies that occur due to living in an industrialized society.

In general, twin studies only tell you what proportion of the variance in a trait is genetic in a given society. But you can't use that information to determine whether the trait is evolved vs. cultural; that's not the question that the studies are asking. E.g. in a society that had a custom of lobotomizing all red-haired people, "being lobotomized" would be an entirely hereditary trait (since hair color is genetically determined) that turned up in twins with radically different upbringing, even though it was an entirely cultural practice. (More examples here, here and here.)

Is this assertion borne out by twin studies?

How would you test it with twin studies?

If we start going to the exact specifics of what makes them different then yes, there are reasonable grounds for why GPT-3 would be expected to genuinely be more of an advance than SHRDLU was. But at least as described in the post, the heuristic under discussion wasn't "if we look at the details of GPT-3, we have good reasons to expect it to be a major milestone"; the heuristic was "the audience of a horror movie would start screaming when GPT-3 is introduced". 

If the audience of a 1970s horror movie would have started screaming when SHRDLU was introduced, what we now know about why it was a dead end doesn't seem to matter, nor does it seem to matter that GPT-3 is different. Especially since why would a horror movie introduce something like that only for it to turn out to be a red herring?

I realize that I may be taking the "horror movie" heuristic too literally but I don't know how else to interpret it than "evaluate AI timelines based on what would make people watching a horror movie assume that something bad is about to happen".

But if we are the movie audience seeing just the publication of the paper in the 70s, we don't yet know that it will turn out to be a dead end with no meaningful follow-up after 40-50 years. We just see what looks to us like an impressive result at the time.

And we also don't yet know if GPT-3 and Dall-E will turn out to be dead ends with no significant progress for the next 40-50 years. (I will grant that it seems unlikely, but when the SHRDLU paper was published, it being a dead end must have seemed unlikely too.)

Nice interview, liked it overall! One small question -

  • Heuristic: Imagine you were in a horror movie. At what point would the audience be like “why aren’t you screaming yet?” And how can you see GPT-3 and Dall-E (especially Dall-E) and not imagine the audience screaming at you?

I feel like I'm missing something; to me, this heuristic obviously seems like it'd track "what might freak people out" rather than "how close are we actually to AI". E.g. it feels like I could also imagine an audience at a horror movie starting to scream in the 1970s if they were shown the sample dialogue with SHRDLU starting from page 155 here. Is there something I'm not getting?

Load More