As Large Language Models (LLMs) like ChatGPT evolve, becoming more advanced and intricate, the challenge of understanding their behaviors becomes increasingly hard. Solely relying on traditional interpretability techniques may not be sufficient or fast enough in our journey to understand and align these AI models.

When exploring human cognition and behavior, we've historically relied on two intertwined yet distinct approaches: psychology and neuroscience. While psychology offers us a lens to understand human behavior through external observations, neuroscience delves into the internal mechanisms, exploring the biological roots of our mental processes. The collaboration that arose between those two fields isn't just a confluence of theories and methods; it's rather a harmonious synergy where insights from one field often inspire and enhance the other.

For a tangible illustration of how psychology and neuroscience complement each other, let's delve into the realm of memory research. Elizabeth Loftus, a psychologist, illuminated the ways in which human memory can be malleable and sometimes inaccurate. Her pioneering psychological studies set the stage for a deeper exploration of memory. Building upon her insights, neuroscientists Yoko Okado and Craig E.L Stark in 2005 delved into the brain's mechanics, seeking the neural underpinnings of these memory phenomena. On another front, 1996 saw a landmark discovery in neuroscience: Giacomo Rizzolatti and his team unveiled the existence of mirror neurons, which help us understand the physical reflection of actions and emotions in the brain. This revelation prompted psychologists to venture further, with researchers like Niedenthal et al. exploring the embodiment of emotional concepts, and many others, drawing parallels with these neurological findings. Such interplay underscores a collaborative dynamic where each field, while preserving its distinctive methodological lens, enriches and is enriched by the other, driving our collective understanding of the human mind forward.

Drawing parallels with AI alignment research, the path of Interpretability mirrors that of neuroscience (bottom-up approach). However, we still lack the equivalent of a "psychological" top-down approach. This gap brings us to "Large Language Model Psychology"[1] — a burgeoning field ripe for exploration and brimming with low-hanging fruits.

While we're still in the early stages of carving out the contours of LLM psychology, this series of posts aims to pique your curiosity rather than present a rigorous academic exposition. I want to spotlight this emerging domain, discuss potential research directions, and hopefully, ignite a newfound enthusiasm among readers.

Disclaimer: My background is primarily in software and AI engineering. My understanding of psychology and neuroscience has been accumulated from various online sources over the past year. I acknowledge that there may be oversimplifications or inaccuracies in my posts. My intent is to continually expand my knowledge in these domains as I delve deeper into the realm of LLM psychology research for alignment. Nonetheless, I believe that a fresh perspective, unburdened by preconceived notions, can sometimes offer valuable insights.

In this sequence, my primary examples will be interactions with ChatGPT since it's the model I'm most familiar with. However, the observations and insights are largely applicable to various state-of-the-art LLMs. Do note, though, that replicating some of the more intricate examples might require quite advanced prompt engineering skills.

The sequence of posts has been designed with continuity in mind. Each post builds on the knowledge and concepts introduced in the previous ones. For a coherent understanding, I recommend reading the posts in the order they're presented. Some points might be confusing or seem out of context if you jump ahead.

  1. ^

    I think @Buck is at the origin of the name "Model Psychology". I heard him say this term at a presentation during SERI Mats 3.0

New to LessWrong?

New Comment