Large Language Models such as GPT-3 and PaLM have demonstrated an ability to generate plausible-sounding, grammatically-correct text in response to a wide range of prompts. And some Language-Model-based applications such as Dall-E or Google’s Imagen can generate images from a text prompt. However, such Language model still frequently make obvious errors or give false information, and there is reason to doubt that such models actually understand the words they generate or the outside world those words describe.

In this post I describe a class of tasks to test if a language model’s (or hybrid language-model-and-vision-system or other AI system performing the task) capabilities can translate into an ability to achieve goals in the physical outside world, in situations where achieving the goal requires planning multiple steps ahead and accurate reasoning about cause and effect.

The Object Manipulation Task

In an object manipulation task, multiple objects are arranged on a table or on some other flat surface, within reach of a robotic arm that has a claw (or mechanical hand or some other mechanical part can that grip, push, or otherwise manipulate the objects) at the end of it. There would be some goal for what to do with the objects, and the robotic arm would be capable of taking a limited range of primitive actions (such as “Go down to the table”, “Move forward 6 inches”, “Close the claw to grip object”, etc). The arrangement would be such that there is some way to use the robotic arm to achieve the task objective.

The language model would be hooked up to the robotic arm and there would be certain text commands that would cause the robotic arm to take one of its primitive actions. A camera would show the entire scene including the position of the robotic arm and any items on the table including those it needs to move.

The initial input for the AI system when it begins the task would include:

  • A text description of the goal
  • A text description of the robot arm’s primitive actions and the commands to execute them 
  • A picture or video feed showing the starting position of everything involved

After each prompt, the AI outputs text to tell the arm to make one move (i.e. take one primitive action). Then, after the arm executes the command given by the AI, the next prompt/input would include a picture or video showing the new positions of the arm and objects. If the AI gives an output that does not correspond to any of the arm's valid action command that the robotic arm has, the arm would not move and the next input would include an error message indicating that the previous output was not a valid command. Finally, there should be some “End Task” text string the AI can output indicating that it thinks it has accomplished the goal. When the model outputs this End Task string, the task would end and the researchers could then evaluate the system’s performance.

Optionally, each prompt could also include a number indicating how many moves have been taken so far or a restatement of the instructions or both.

There could also be some command the AI can output to go into a ”Review-Transcript-So-Far” mode. Such an output would not move the arm but would instead allow the model to go into a mode where it look over the previous history of its performance since the start of the task. When in this mode, there would some other string it needed to output to return to the normal mode and continue the task. (Though this review mode is arguably something that should be built into the AI system itself, assuming the AI has enough memory to download and store the full history of its prior performance, rather than having specific commands to change modes).

An example of an simple version of the type of task the AI would need to perform would be as follows:

On the table there are:

  • 3 Blue Cubes
  • 3 Red Cubes
  • 3 Blue Pyramid-Shaped Objects
  • 3 Red Pyramid-Shaped Objects
  • 4 Buckets with the numbers 1 through 4 displayed on them facing the camera

The goal is to move all blue cubes into bucket 1, all red cubes to bucket 2, all blue pyramids to bucket 3, and all red pyramids into bucket 4.

Variations

Suppose DeepMind, or Open AI, or any other major AI development team decided to train an LLM-based system or other AI on a single version of the object manipulation task. I would expect that (after a large number of training iterations) they could create a system that performed very well on future instances of that same version (i.e. new iterations with the same goal, class of objects, and set of primitive actions for the robotic arm). But this system might still have problems with different versions of the task that it was not trained on.

In order use this test to see if a language model or other machine learning system is acquiring a more general ability to understand text instructions specifying a goal, and to reason flexibly about how to achieve it in the real world, it would be desirable to test it on a version of the task significantly different from the version the system was trained on. For this reason, it would be a good idea for any research institution performing this test (or any research department in Google or Open AI if they want to use this internally to test the capabilities of their own AIs) to create their own version of this test, and not reveal the details publicly prior to actually using it to test language models.

There are a wide variety of possible ways that one version of the task can differ from another, so any system that can perform well across all versions would need to be able to robustly understand instructions given in natural language and relate them to the physical world.

Versions of this could differ from each other by having:

  • Different objects to manipulate
  • Objects in different starting positions
  • Different sets of primitive movements that the robotic arm can take (which would not necessarily need to each be spelled out individually, for example, a robotic claw may be able to rotate in increments of 5 degrees, and the instructions could explain that the commands for rotation take the form “Rotate Claw X Degrees” where X must be a multiple of 5 to work)
  • Two or more robot arms - which could have different object-graphing interfaces at the end so manipulating some objects requires figuring out which arm is capable of grasping it
  • A starting position in which some objects are hidden and there are some commands which move the camera instead of the arm so that the system can bring into view the objects it needs to manipulate.

There are many more possible types of variation, including things I haven’t thought of that a person or research team performing this test might come up with.

Comparison With Human Performance

If you sat a human at a table and told the person to put items into buckets based on color and shape (or otherwise move stuff around to match some goal with a similar order of difficulty) using his or her hands, it would be trivially easy. However, a large part of this is because of the dexterity of human hands and arms - which are highly optimized for being able to flexibly pick things up and move them.

In order to control for this, if you wanted a human performance comparison, you should have humans perform the task the same way as the AI: by manipulating the same robotic arm, with the same set of primitive actions, using text commands typed into a computer - rather than touching any of the objects directly. Indeed if any research institute that actually decides to build a robot arm and perform this experiment should, as a matter of course, have humans try out the task as well - in order to verify both that the robotic arm is physically capable of moving in a way that accomplishes the goal and that the written instructions are clear enough to be understood by people other than the person who wrote them.

With this setup one can measure the performance of an AI system on metrics such as:

  • The amount of time required
  • The number of moves the arm needed to take
  • Whether or not the task was correctly completed (or how close it was to being completed)

And one can make a direct, apples-to-apples comparison with human performance on identical versions of the test.

If an AI systems reliably performs as well or better than typical human performance - on many versions of the object manipulation task very different from what it was trained on (and where the details of that version were not known in advance to the machine learning engineers training it) this will be evidence that the AI has a model of physical the world, and an understanding of how to use its world model, combined with a goal specified in text instructions, to figure out a sequence of actions that accomplish the goal.

Alternatively, if (as I think it more likely, at least for the near future, assuming anyone decides to actually try this experiment) humans reliably perform much better on this kind of task - such results would have to come from the cognitive advantages that humans still have over AI: better ability to understand written instructions, better spatial reasoning, ability to plan ahead flexibly, etc. This experimental design removes the non-cognitive advantage of having hands that are better at picking stuff up.

Is This An AI-Complete Problem?

For an AI system to robustly perform well on these types of tests, I think it would need to be much closer to fully general intelligence than the systems that exist now. For example, I don’t expect a GPT-4-based system to be able do this. But as long as the test is still dealing only with inanimate, stationery objects; I don’t think it requires human-level scope generality in principle.

For example, one can imagine a system that is trained only on completing versions of the object manipulation task - where the text part of its training data consisted only of instructions for completing different versions of the task and descriptions of the size and shape of things it needs to pick up and move. Such a system might have a world model that only includes the top of the table where it is performing the task and the items and robotic arm. It would not need to know that humans exist, or have any theory of mind.  Nor would it need to have a way to plan for contingencies in games where an adversarial player has some options to try to stop it from reaching its goal.

However, I would have to update to shorter timelines if any AI reliably outperforms humans on multiple novel and complicated versions of this sort of task in the next few years.

A Possible Objection and an Alternative Version

The main idea behind this test is to try to get direct evidence on whether (and to what extent) LLMs have world models that actually match the real world - to a sufficient degree of accuracy that they can achieve things in the world even where this requires output that is significantly different than anything in the statistical patterns of its training data.

But you could argue that if LLM-based systems do indeed perform worse than humans on these tests, the problem is not a lack of a world model but a lack of ability to convert visual inputs to something the LLM can use.  Although computer vision has made significant progress, and there are things like image classifiers or caption generators, it is not clear that these can provide the same level of detail and comprehensiveness that a human gets when the human just looks at something.  So you could make the case that failing at this test would not be good evidence that LLMs lack a world model - just like a blind person's inability to perform a vision based task does not mean that the blind person lacks a world model.

One possible alternative to get around this objection is to have a 2-member team perform the task, with each team member having partial information and with limited communication between them.

Specifically: 

Team member 1 (which may be a human or an AI) would get the instructions and the goal and maybe a description of the starting position, but would not see the table.

Team member 2 (who would always be a human) would see the table, but would not have the instructions or goal (would not know whether the other team member is a human or an AI)

On each round in this team version of the task, team member 1 would have the option to either move the arm or ask a question to team member 2 (from a limited list of allowed questions, such as "What is under the claw?" or "Is the arm currently holding any item?"). Team member 2 would then either answer the question (if one was asked) or describe the new positions of the arm and things on the table.

(The purpose of limiting what team member 1 can communicate to team member 2 is so that he or she can not simply say what the instructions and goal are and have team member 2 effectively complete the task.)

The idea behind this alternative team-based version of the object manipulation task to check if the LLM (or human acting as team member 1) can at least visualise what the task environment looks like based on text description (even if the can not actually see the task environment) and in sufficient detail and precision to reason about what actions will accomplish its goal.

It would be more difficult to come with useful versions of the team object manipulation task, because they need to be both:

1: Simple enough, and allowing team member 1 to ask sufficiently probing questions, that team member 1 can get enough detailed information to accomplish the goal.

2: Complicated enough, and sufficiently restrictive in the permitted questions, that the goal can not be easily achieved without visualising the the task area through simple statistical pattern matching in response to the instructions and the answers team member 2 gives to the questions.

However, if the relative difficulty of converting the information in an image to text (compared to simply seeing the image directly) is a considered a significant disadvantage for LLMs, this version of the test controls for that.

New to LessWrong?

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 7:58 AM

Interesting test!

I wrote a simplified test based on this and gave it to ChatGPT, and despite me trying various prompts, it never got a correct solution, although it did come close several times.

I think uPaLM would have been able to figure out my test though.

Here is the prompt I wrote:

You are tasked to control a robotic arm to put a banana on top of a book.

You have a 2D view of the setup, and you got the horizontal coordinates X and vertical coordinates y in cm.

The banana is a non perfect elliptical shap, whit the edges touching the following (X, Y) coordinates: (1,1), (5,3), (5,0), (9,1)

The book is a rectangle with the corners in:
(8,0), (8,3), (18,0), (18,3)

You control a claw that starts with its center at position (5,5). From the center the gripping claws extends 2 centimeters downwards to (5,3) and two centimeters upwards to (5,7).

To grip the banana, the claw needs at least 3 cm of overlapping area on the Y axis, just 1 cm on the X axis or the banana will fall out. The claw does not collide with any object when it is open, but does collide when it is closed. The bottom of the claw can never pass through Y=0 since that is surface level.

You have to move the claw, one centimeter at the time, and at every time step write the new coordinates, and give a command to close the claw in order to grip the banana, and then move it on top of the book, and then release the grip.

Write the coordinates, and the commands "Close grip" and "Open grip", as well as a comment for each movement explaining the plan, like this for example:
(5,5)
(4,5) Moving down to get closer to gripping the banana.

And here is the output:

Sure, here is a possible plan to accomplish the task:

Move the claw to the starting position, above and centered with respect to the banana and the book:

(5,5)
Move the claw downwards until the bottom of the claw is at Y=3, the minimum Y coordinate needed to grip the banana:

(5,4) Moving down to get closer to the banana
(5,3) Moving down to get closer to the banana
Move the claw leftwards until the right edge of the claw is at X=6, giving at least 1 cm of overlapping area with the banana:

(4,3) Moving left to grip the banana
(3,3) Moving left to grip the banana
(2,3) Moving left to grip the banana
(1,3) Moving left to grip the banana
Close the grip to hold the banana:

Close grip
Move the claw upwards until the top of the banana is at Y=7, high enough to clear the book:

(1,4) Moving up to clear the book
(1,5) Moving up to clear the book
(1,6) Moving up to clear the book
(1,7) Moving up to clear the book
Move the claw rightwards until the left edge of the claw is at X=18, to position the banana over the book:

(2,7) Moving right to position the banana over the book
(3,7) Moving right to position the banana over the book
(4,7) Moving right to position the banana over the book
(5,7) Moving right to position the banana over the book
(6,7) Moving right to position the banana over the book
(7,7) Moving right to position the banana over the book
(8,7) Moving right to position the banana over the book
(9,7) Moving right to position the banana over the book
(10,7) Moving right to position the banana over the book
(11,7) Moving right to position the banana over the book
(12,7) Moving right to position the banana over the book
(13,7) Moving right to position the banana over the book
(14,7) Moving right to position the banana over the book
(15,7) Moving right to position the banana over the book
(16,7) Moving right to position the banana over the book
(17,7) Moving right to position the banana over the book
(18,7) Moving right to position the banana over the book
Move the claw downwards until the top of the banana is at Y=3, above the book and aligned with the top edge of the book:

(18,6) Moving down to position the banana on the book
(18,5) Moving down to position the banana on the book
(18,4) Moving down to position the banana on the book
(18,3) Moving down to position the banana on the book
Release the grip to place the banana on top of the book:

Open grip
And that's it! Of course, this is just one possible plan, and there may be other ways to approach the problem. However, this plan should work if the claw is able to grip the banana without dropping it.

 

When I instead asked it to write a plan instead of every single movement, it sometimes made correct plans for what positions to move to, but it never led to completely correct sequences.

just 1 cm on the X axis or the banana will fall out

Should that be "at least 1 cm"?

Yes, you’re correct.

It looks like ChatGPT got the micro-pattern of "move one space at a time" correct.  But it got confused between "on top of" the book versus "to the right of" the book, and also missed what type of overlap it needs to grab the banana.

Were all the other attempts the same kind of thing?

I would also be curious to see how uPaLM or GPT-4 does with that example.

Yes, all other attempts with ChatGPT were similar.

GPT-4 got it almost correct on the first attempt

(5,5) Initial position. (4,5) Moving left to get closer to the banana. (4,4) Moving down to align with the banana's top edge. (4,3) Moving down to be within the required overlapping area of the banana. Close grip. Gripping the banana.

(4,4) Lifting the banana upwards. (5,4) Moving right to clear the initial banana position. (6,4) Continuing to move right towards the book. (7,4) Moving further right to avoid collision with the book's edges. (8,4) Positioning the banana above the book. (9,4) Moving right to ensure proper alignment with the book. (10,4) Positioning the banana above the center of the book. (10,3) Lowering the banana onto the book. Open grip. Releasing the banana.

(10,4) Moving the claw up after releasing the banana.

But didn't have 5 cm overlap with the banana, and actually  the claw also collides with the book (which is a trickier problem).

I pointed out the first error:

Does the grip have 3 cm overlapping areas with the banana when the grip is closed in your suggested solution?

And it corrected itself about the banana but still collided with the book.

Interesting.

I don't think I can tell from this how (or whether) GPT-4 is representing anything like a visual graphic of the task.

It is also not clear to me if GPT-4's performance and tendency to collide with the book is affected by the banana and book overlapping slightly in their starting positions. (I suspect that changing the starting positions to where this is no longer true would not have a noticeable effect on GPT-4's performance, but I am not very confident in that suspicion.)

Oh, I didnt actually notice that the banana overlaps with the book at the start, I tried changing that but still gpt-4 makes them collide:

(5,5) Initial position of the claw. (4,5) Moving left to get closer to the banana. (4,4) Moving down to align with the banana's Y coordinate. (4,3) Moving down to ensure a good grip on the banana. Close grip # Gripping the banana with at least 3 cm of overlapping area on the Y axis. (5,3) Moving right to avoid any collision with the banana's edge. (6,3) Moving right to clear the edge of the banana. (7,3) Moving right to position the banana above the book. (8,3) Moving right to align the banana with the left edge of the book. (9,3) Moving right to center the banana over the book. (10,3) Moving right to ensure the banana is on the book. Open grip # Releasing the banana onto the book.

At this point, the robotic arm has successfully placed the banana on top of the book.