AI safety & alignment researcher
with a bunch of reflexes to eg stop and say “that doesn’t sound right” or “I think I’ve gone wrong, let’s backtrack and try another path”
Shannon Sands says he's found a backtracking vector in R1:
https://x.com/chrisbarber/status/1885047105741611507
I'd have to look back at the methodology to be sure, but on the assumption that they have the model answer immediately without any chain of thought, my default guess about this is that it's about the limits of what can be done in one or a small number of forward passes. If it's the case that the model is doing some sort of internal simulation of its own behavior on the task, that seems like it might require more steps of computation than just a couple of forward passes allow. Intuitively, at least, this sort of internal simulation is what I imagine is happening when humans do introspection on hypothetical situations.
If on the other hand the model is using some other approach, maybe circuitry developed for this sort of purpose, then I would expect that maybe that approach can only handle pretty simple problems, since it has to be much smaller than the circuitry developed for actually handling a very wide range of tasks, ie the rest of the model.
I agree that from a functional perspective, we can interact with an LLM in the same way as we would another human. At the same time I’m pretty sure we used to have good reasons for maintaining a conceptual distinction.
I think of this through the lens of Daniel Dennett's intentional stance; it's a frame that we can adopt without making any claims about the fundamental nature of the LLM, one which has both upsides and downsides. I do think it's important to be careful to stay aware that that's what we're doing in order to avoid sloppy thinking.
Nate Soares' related framing as a 'behaviorist sense' is also useful to me:
If an AI causes some particular outcome across a wide array of starting setups and despite a wide variety of obstacles, then I'll say it "wants" that outcome “in the behaviorist sense”.
therefore longer contexts can elicit much richer classes of behaviour.
Up to and including Turing-completeness ('Ask, and it shall be given')
I think there's starting to be evidence that models are capable of something like actual introspection, notably 'Tell me about yourself: LLMs are aware of their learned behaviors' and (to a more debatable extent) 'Looking Inward: Language Models Can Learn About Themselves by Introspection'. That doesn't necessarily mean that it's what's happening here, but I think it means we should at least consider it possible.
Well, we (humans) categorize our epistemic state largely in propositional terms, e.g. in beliefs and suppositions.
I'm not too confident of this. It seems to me that a lot of human cognition isn't particularly propositional, even if nearly all of it could in principle be translated into that language. For example, I think a lot of cognition is sensory awareness, or imagery, or internal dialogue. We could contort most of that into propositions and propositional attitudes (eg 'I am experiencing a sensation of pain in my big toe', 'I am imagining a picnic table'), but that doesn't particularly seem like the natural lens to view those through.
That said, I do agree that propositions and propositional attitudes would be a more useful language to interpret LLMs through than eg activation vectors of float values.
Sure, I agree that would be useful.
Your first paragraph makes sense as an interpretation, which I discounted because the idea of something like AlphaGo doing CoT (or applying a CoT to it) seems so nonsensical, since it's not at all a linguistic model.
I'm having more trouble seeing how to read what Chalmer says in the way your second paragraph suggests -- eg 'unmoored from the original system' doesn't seem like it's talking about the same system generating an ad hoc explanation. It's more like he's talking about somehow taking a CoT generated by one model and applying it to another, although that also seems nonsensical.
Tentative pre-coffee thought: it's often been considered really valuable to be 'T-shaped'; to have at least shallow knowledge of a broad range of areas (either areas in general, or sub-areas of some particular domain), while simultaneously having very deep knowledge in one area or sub-area. One plausible near term consequence of LLM-ish AI is that the 'broad' part of that becomes less important, because you can count on AI to fill you in on the fly wherever you need it.
Possible counterargument: maybe broad knowledge is just as valuable, although it can be even shallower; if you don't even know that there's something relevant to know, that there's a there there, then you don't know that it would be useful to get the AI to fill you in on it.