Adam Morris — LessWrong

LESSWRONG
LW

Replying toTests of LLM introspection need to rule out causal bypassing

Tests of LLM introspection need to rule out causal bypassing

Fascinating point, I think you're right. Just to repeat your point in my own words: The problem is that, if the activation steering makes the model want to talk about the injected concept, and if it knows that saying "yes, I received an injection" will give it a chance to talk about the concept later in the response, then it will say "yes" in order to talk about the concept later (even if it actually had no metacognitive awareness of the injection). Is that what you're saying?

Tests of LLM introspection need to rule out causal bypassing

Adam Morris

Adam Morris, Dillon Plunkett

3mo

This point has been floating around implicitly in various papers (e.g., Betley et al., Plunkett et al., Lindsey), but we haven’t seen it named explicitly. We think it’s important, so we’re describing it here.

There’s been growing interest in testing whether LLMs can introspect on their internal states or processes. Like Lindsey, we take “introspection” to mean that a model can report on its internal states in a way that satisfies certain intuitive properties (e.g., the model’s self-reports are accurate and not just inferences made by observing its own outputs). In this post, we focus on the property that Lindsey calls “grounding”. It can’t just be that the model happens to know true... (read 1043 more words →)

Adam Morris3mo

Is there reason to think that Bores or Wiener are not trustworthy or lack integrity? Genuine question, asking because it could affect my donation choices. (I couldn't tell from your post if there were, e.g., rumors floating around about them, or if you were just using this as an example of a key question that you thought was missed in Neyman's analysis.)

Self-interpretability: LLMs can describe complex internal processes that drive their decisions

Adam Morris

Adam Morris, Dillon Plunkett

3mo

This post is a summary of our paper from earlier this year: Plunkett, Morris, Reddy, & Morales (2025). Adam received an ACX grant to continue this work, and is interested in finding more potential collaborators---if you're excited by this work, reach out!

It would be useful for safety purposes if AI models could accurately report on their internal processes and the factors driving their behavior. This is the motivation behind, e.g., attempts to measure introspection or CoT faithfulness in LLMs. Recent work has shown that frontier models can sometimes accurately report relatively simple, qualitative aspects of their internal choice processes, such as whether they had been fine-tuned to be risk-seeking, had received a... (read 1180 more words →)

Adam Morris3mo

Got it. Okay thanks!

Adam Morris3mo

Earnest question: For both this & donating to Alex Bores, does it matter whether someone donates sooner rather than a couple months from now? For practical reasons, it will be easier for me to donate in 2026--but if it will have a substantially bigger impact now, then I want to do it sooner.

Adam Morris7mo

One small suggestion: When I read this, I genuinely couldn't tell whether "Gray swans: None detected this week" was a joke (like you were pretending to look for literal gray/black swans), or if it meant something serious. After reading your website, my guess is that it's meant to be serious---but I'm still not sure, and if it is serious then I don't know what it means. (My understanding is that "black swan" means an unexpected, highly improbable / out of distribution event, so it wasn't clear to me what it would mean in this context to be generally looking for global gray/black swans.) Might be worth clarifying or finding other terminology, if you want readers like me to quickly grok what you mean.

Adam Morris8mo

We haven't had one yet! But we only did it ~3 times. Obviously people are more careful than they'd normally be while dancing on the slippery floor.

Adam Morris9mo

I'll add to this list: If you have a kitchen with a tile floor, have everyone take their shoes off, pour soap and water on the floor, and turn it into a slippery sliding dance party. It's so fun. (My friends and I used to call it "soap kitchen" and it was the highlight of our house parties.)

Printable book of some rationalist creative writing (from Scott A. & Eliezer)

Adam Morris

As a holiday gift to myself, I put together a printable book of some creative writing by Scott Alexander & Eliezer. I'm sharing it here in case anyone else is interested :).

It contains most of Scott's published fiction since he switched to Substack, plus some select creative non-fiction pieces (including a few from the old SSC days). I tried to choose writings that did not overlap with the other printable book of his older fiction pieces. The collection also has a few stories by Eliezer that I love and haven't seen in print elsewhere.

The PDF can be found here: https://thatadammorris.com/files/bread-crumbs.pdf. You can print is using Blurb's PDF to book functionality. Instructions:

Click on

... (read more)

•••

Replying toThree Subtle Examples of Data Leakage

Adam Morris1y

Three Subtle Examples of Data Leakage

I see, that makes sense. Thank you!

Replying toThree Subtle Examples of Data Leakage

Adam Morris1y

Three Subtle Examples of Data Leakage

Can you help me see this point? Why not correct it in the dataset? (Assuming that the dataset hasn't yet been used to train any models)

Replying toPrint Books of Scott Alexander's Writing

Adam Morris2y

Print Books of Scott Alexander's Writing

I'm long overdue here, but thank you so much for doing this!! I've been wanting this for a long time and just discovered this post :)

How are you currently modeling COVID contagiousness?

Adam Morris

My long-distance partner has COVID; her symptoms first started ~10 days ago. She is now symptomless (except fatigue), but is still testing strongly positive on rapids. If she comes to visit me, what are the odds she's still contagious? More generally, how much weight do you place on "symptomless and 10+ days from onset" versus "testing positive on rapids" in modeling contagiousness?