lilkim2025 — LessWrong

Emergent Introspective Awareness in Large Language Models

lilkim20259h10

Nope - that's not the one. Did a bit of searching, and it's here.

Emergent Introspective Awareness in Large Language Models

lilkim20252d40

Now this is truly interesting. As other commentators, and the article itself, have pointed out, it's very difficult to discern true introspection (the model tracking and reasoning effectively about its internal activations) from more mundane explanations (activating internal neurons related to bread make a bread-related output more likely, and caused the model to be less certain later on that the original text was not somehow related to bread until it tried to reason through the connection manually).

I remember a paper on world-modeling, where the authors proved pretty concretely that their LLM had a working world model by training a probe to predict the future world state of a gridworld robot programming environment based on the model's last layer activations after each command it entered, and then showing that a counterfactual future world state, in which rotation commands had reversed meanings, could not be similarly predicted. I'm not entirely sure how I would apply the same approach here, but it seems broadly pertinent in that it's likewise a very difficult-to-nail-down capability. Unfortunate that "train a probe to predict the model's internal state given the model's internal state" is a nonstarter.

In this experiment, the most interesting phenomenon to explain is not how the model correctly identifies the injected concept, but rather how it correctly notices that there is an injected concept in the first place.

This was my takeaway too. The explanation of an LLM learning to model its future activations and using the accuracy of this model to determine the entropy of a body of text seems compelling. I wonder where the latent "anomaly" activations found by comparing the set of activations for true positives in this experiment with the activations for the control group would be, and whether they'd be related to the activation patterns associated with any other known concepts.

LLM robots can't pass butter (and they are having an existential crisis about it)

lilkim20252d10

I think I might've been unclear. My understanding is:

An LLM can be fine-tuned to perform low-level robotic actions, like adjusting the position of a gripper, using sets of action tokens (as distinct from language or vision tokens).
- OpenVLA might be a better example than RT-2, given that it's explicitly fine-tuned from LLaMA in roughly the manner I described.
The executor, in this experiment, is a hard-coded script that responds to e.g. "rotate" or "go forward" with a fixed action.
- I.E. there isn't a neural network taking in "go forward" and outputting a series of motor actions, there's something akin to an if-then statement that identifies "go forward" in the LLM output and engages the left and right motors at equal power for some predefined or LLM-specified amount of time.

A VLA of the kind I was thinking of would be both the Orchestrator and the Executor, both encapsulating high-level 'common sense' knowledge that it uses to solve problems, and also directly specifying the low-level motor actions that carry out its tasks.

In practice, this would contrast approaches that give the LLM a Python console with a specialized API designed to facilitate robotics tasks (Voxposer, for instance, gives a 'blind' LLM an API that lets it use a vision-language model guide its movements automatically), or, like I think you're describing, pair a pure LLM or a VLM tasked with high-level control with a VLA model tasked with implementing the strategies it proposes. The advantage, here, would be that there's no bottleneck of information between the planning and implementation modules, which I've noticed is the source of a decent share of failure-cases in practical settings.

LLM robots can't pass butter (and they are having an existential crisis about it)

lilkim20254d10

LLMs are not trained to be robots, and they will most likely never be tasked with low-level controls in robotics (i.e. generating long sequences of numbers for gripper positions and joint angles).

Perhaps I'm off-base; robotics isn't my area, but I have read some papers that indicate that this is viable. This one in particular is pretty well-cited, and I've always suspected that a scale-up of training on video data of humans performing tasks could, with some creative pre-processing efforts to properly label the movements therein, get us something that could output series of limb/finger positions that would resolve a given natural language request. From there, we'd need some output postprocessing of the kind we've seen inklings of in numerous other LLM+robotics papers to get a roughly humanoid robot to match those movements.

The above is certainly simplified, and I suspect that much more intensive image preprocessing involving object detection and labelling would be necessary, but I do think that you could get a proof of concept along these lines with 2025 tech if you picked a relatively simple class of task.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments