Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Meta: This is a short summary & discussion post of a talk on the same topic by Javier Gomez-Lavin, which he gave as part of the PIBBSS speaker series. The speaker series features researchers from both AI Alignment and adjacent fields studying intelligent behavior in some shape or form. The goal is to create a space where we can explore the connections between the work of these scholars and questions in AI Alignment. 


This post doesn't provide a comprehensive summary of the ideas discussed in the talk, but instead focuses on exploring some possible connections to AI Alignment. For a longer version of Gomez-Levin’s ideas, you can check out a talk here

"Dirty concepts" in the Cognitive Sciences

Gomez-Lavin argues that cognitive scientists engage in a form of “philosophical laundering,” wherein they associate, often implicitly, philosophically loaded concepts (such as volition, agency, etc.) into their concept of “working memory.” 

He refers to such philosophically laundered concepts as “dirty concepts” insofar as they conceal potentially problematic assumptions being made. For instance, if we implicitly assume that working memory requires, for example, volition, we have now stretched our conception of working memory to include all of cognition. But, if we do this, then the concept of working memory loses much of its explanatory power as one mechanism among others underlying cognition as a whole. 

Often, he claims, cognitive science papers will employ such dirty concepts in the abstract and introduction but will identify a much more specific phenomena being measured in the methods and results section.

What to do about it? Gomez-Lavin’s suggestion in the case of CogSci

The pessimistic response (and some have suggested this) would be to quit using any of these dirty concept (e.g. agency) all together. However, it appears that this would amount to throwing the baby out with the bathwater. 

To help remedy the problem of dirty concepts in working memory literature, Gomez-Lavin proposes creating an ontology of the various operational definitions of working memory employed in cognitive science by mining a wide range of research articles. The idea is that, instead of insisting that working memory be operationally defined in a single way, we ought to embrace the multiplicity of meanings associated with the term by keeping track of them more explicitly. 

He refers to this general approach as “productive pessimism.” It is pessimistic insofar as it starts from the assumption that dirty concepts are being problematically employed, but it is productive insofar as it attempts to work with this trend rather than fight against it. 

While it is tricky to reason with those fuzzy concepts, once we are rigorous about proposing working definitions / operationalization of these terms as we use them, we can avoid some of the main pitfalls and improve our definitions over time. 

Relevance to AI alignment?

It seems fairly straightforward that AI alignment discourse, too, suffers from dirty concepts. 

If this is the case (and we think it is), a similar problem diagnosis (e.g. how dirty concepts can hamper research/intellectual progress) and treatment (e.g. ontology mapping) may apply. 

A central example here is the notion of "agency". Alignment researchers often speak of AI systems as agents. Yet, there are often multiple, entangled meanings intended when doing so. High-level descriptions of AI x-risk often exploit this ambiguity in order to speak about the problem in general but ultimately employ imprecise terms. This is analogous to how cognitive scientists will often describe working memory in general terms in the abstract section of their papers and operationalize the term only in the methods and results sections. As such, general descriptions of AI x-risk that refer to AI systems as agents are often an instance of the use of dirty concepts and philosophical laundering. A different but related problem arises when the invocation of AI systems as agents (implicitly) refers to different interpretations of the concept. For example, sometimes, the intended use of the concept of agency is simply the one operationally defined in Reinforcement Learning; other times, we might intend the concept of agency as it is used in biology and evolutionary theory (see e.g. this overview of notions of agency used in biology); yet other times, we might also intend the concept of agency found in the philosophy of mind, cognitive science, and / or psychology. (The latter two cases are additionally problematic because the intended concepts might themselves (i.e., the biological or cognitive science conception of agency) be cases of dirty concepts.)  Consequently, and if Gomez-Lavin's suggestion for dealing with dirty concepts is promising, AI x-risk and alignment research could benefit from mapping an ontology of the various operational definitions of agency employed in the AI x-risk and alignment literature.

Below, we have started (and partially left as an exercise to the reader) compiling an incomplete list of "dirty concepts" often used in AI alignment discourse. At the very least, it is helpful to be aware when one is dealing with the dirty concept. At best, some folks will pick up the idea of creating an ontology mapping for (some of) these concepts. 

  • Values, as well as related notions such as: goals, intentions, preferences, desires, ...
  • Optimization
  • Awareness,  self-awareness, situational-awareness [we don't mean to imply those concepts are the same] 
  • Planning
  • Deception 
  • Alignment 
  • Autonomy
  • “The AI system” / “the model” / “the simulation” / “the (LLM) simulacra” (/ etc.)
  • Knowledge / Knowing 
  • Attention
  • Memory
New Comment
4 comments, sorted by Click to highlight new comments since: Today at 2:28 PM
[-]Dagon8mo118

I'd like to plug my common recommendation for such problems (which occur to almost ALL words and short phrases that have both common and technical uses): use more words.  Papers and posts that are based on these words should specify which implications they depend on, either inline in a paragraph or two, or as a reference to another document which defines and uses the term(s).

For relatively coherent bodies of work (some websites or topics on them, some clusters of papers, possibly some subsets of industries), these definitions may become common enough that the definitions become simply a footnote, with more explanation if it deviates from the standard.  

The IETF's RFC process is a pretty reasonable example of such a system.

This social problem sounds like it has a technical solution! There exist browser addons that let readers publicly annotate text. There could easily exist one that uses an LLM to detect ambiguous phrasings and publish one or more annotated interpretations.

Some more terms that could be added to the list of "dirty concepts":

  • Capabilities / capabilities research
  • Embeddedness
  • Interpretability
  • Artificial general intelligence
  • Subagent
  • (Recursive) self-improvement

Apropos the recurring misunderstanding over the meaning of agency, I think it would be helpful to situate what one means by ontology as the term is equally polyvalent - specially when deployed in interdisciplinary contexts.