LLM robots can't pass butter (and they are having an existential crisis about it)

[-]Mitchell_Porter2mo70

The "Doom Spiral Trace" of Claude Sonnet 3.5's thoughts (see appendix D of the paper) is the most remarkable artefact here. Having an AI spontaneously produce its own version of "Waiting for Godot", as it repeatedly tries and fails to perform a mechanical task, really is like something out of absurdist SF.

We need names for this phenomenon, in which the excess cognitive capacity of an AI, not needed for its task, suddenly manifests itself - perhaps "cognitive overflow"?

[-]khafra1mo50

We need names for this phenomenon, in which the excess cognitive capacity of an AI, not needed for its task, suddenly manifests itself

It is so much like absurdist SF, that's the perfect source for the name--The Marvin Problem: "Here I am, brain the size of a planet and they ask me to take you down to the bridge. Call that job satisfaction? 'Cos I don't."

[-]dr_s2mo40

I mean, we do this too! Like if you were doing a very boring, simple task you would probably seek outlets for your mental energy (e.g. little additional self imposed challenges, humming, fiddling, etc).

[-]ollie_2mo20

What sort of latency does it experience? How large are the prompts?

[-]lilkim20252mo10

LLMs are not trained to be robots, and they will most likely never be tasked with low-level controls in robotics (i.e. generating long sequences of numbers for gripper positions and joint angles).

Perhaps I'm off-base; robotics isn't my area, but I have read some papers that indicate that this is viable. This one in particular is pretty well-cited, and I've always suspected that a scale-up of training on video data of humans performing tasks could, with some creative pre-processing efforts to properly label the movements therein, get us something that could output series of limb/finger positions that would resolve a given natural language request. From there, we'd need some output postprocessing of the kind we've seen inklings of in numerous other LLM+robotics papers to get a roughly humanoid robot to match those movements.

The above is certainly simplified, and I suspect that much more intensive image preprocessing involving object detection and labelling would be necessary, but I do think that you could get a proof of concept along these lines with 2025 tech if you picked a relatively simple class of task.

[-]Lukas Petersson2mo20

RT-2 (the paper you cited) is a VLA, not LLM. VLAs are what the "executor" in our diagram uses.

[-]lilkim20252mo20

I think I might've been unclear. My understanding is:

An LLM can be fine-tuned to perform low-level robotic actions, like adjusting the position of a gripper, using sets of action tokens (as distinct from language or vision tokens).
- OpenVLA might be a better example than RT-2, given that it's explicitly fine-tuned from LLaMA in roughly the manner I described.
The executor, in this experiment, is a hard-coded script that responds to e.g. "rotate" or "go forward" with a fixed action.
- I.E. there isn't a neural network taking in "go forward" and outputting a series of motor actions, there's something akin to an if-then statement that identifies "go forward" in the LLM output and engages the left and right motors at equal power for some predefined or LLM-specified amount of time.

A VLA of the kind I was thinking of would be both the Orchestrator and the Executor, both encapsulating high-level 'common sense' knowledge that it uses to solve problems, and also directly specifying the low-level motor actions that carry out its tasks.

In practice, this would contrast approaches that give the LLM a Python console with a specialized API designed to facilitate robotics tasks (Voxposer, for instance, gives a 'blind' LLM an API that lets it use a vision-language model guide its movements automatically), or, like I think you're describing, pair a pure LLM or a VLM tasked with high-level control with a VLA model tasked with implementing the strategies it proposes. The advantage, here, would be that there's no bottleneck of information between the planning and implementation modules, which I've noticed is the source of a decent share of failure-cases in practical settings.

LESSWRONG
LW

LESSWRONG
LW

105

LLM robots can't pass butter (and they are having an existential crisis about it)

105

105