I think a moderately-skilled person could outperform Claude here, but it's closer than you might think. Have you thought of running this experiment with a human on the other end?
I occasionally give technical support for industrial automation equipment, and I feel for Claude. It's so much harder than it looks, even when you have voice+video instead of text+pictures.
As one example of how it can go wrong, I said "Check the cables on the enclosure, and make sure they're all connected properly." instead of "Check the three cables on the enclosure (Power, ethernet, remote sensor module), and make sure each of them are connected properly." and it took us 20 minutes to figure out that the reason it couldn't communicate with the network is because the ethernet cable was completely missing.
There are hundreds of videos about the difficulty of giving precise directions, usually played for comedy. For example, here:
I feel like Claude didn't get tripped up here by not providing precise enough instructions, or the op not giving the instructions the benefit of the doubt enough times.
We can use the number of mistakes to get a very noisy estimate of Claude 4.5 Sonnet's coffee time horizon. By my count, Claude made three unrecoverable mistakes that required human assistance:
Failed to navigate between rooms
Mistakes back right and front right burner
Misidentifies salt? as milk bottles at the end (maybe could have recovered if you took a picture of the bottles before pouring?)
Now this was a "try until success" task rather than a success/failure task. But if we try to apply the same standards as the METR benchmark, the task needs to be economically valuable (so includes adding milk/sugar) and any mistake that would make it non-viable to automate should count as a failure. I think any robot butler that typically made one of these mistakes would be unemployable.
I'd guess an experienced human would take about 7 minutes to make coffee in an unfamiliar house if they get the milk and sugar ready while the kettle is boiling, so we get a rate of 1 failure every 2.3 human minutes, which means a 50% chance of success would occur around ln(2) * 2.3 = 1.6 minutes. Of course, this is just one task, but we already know its coffee time horizon isn't like 20 minutes-- the probability of three events from a Poisson process with rate ln(2) / 20 is only 0.2%. Claude says the 95% confidence interval is (33 seconds, 8 minutes).
This is below trend for RLBench, though the data is extremely bad. If I speculate anyway, maybe real-world tasks like coffee are harder than RLBench or OSWorld-- it certainly requires much more planning than 5-20 second simulated robotics tasks. Or maybe it just hasn't been trained for the real world.
METR could probably use a methodology like this if we had more long tasks and labeling were free, so maybe it's worth looking into methodologies like having smarter agents unblock dumber agents where we can automate things.
Tagging @Megan Kinniment who has also thought about recoverable and unrecoverable failures
The real coffee snob thing would be to mark Claude down for thinking that instant coffee is coffee, and yourself for not correcting its egregious mistake. :)
I'm pretty certain that you misinterpreted Claude's original response to go through the left door. It was referring to the left door of the two doors on the right. Through the door, it saw a piece of white that it interpreted as a washing machine or appliance. (It was right; from the picture later, it seems to have been your refrigerator.) That's why when you showed it the picture of your bedroom/living room, it then told you to go check the door on the right, the one we haven't checked yet, because it assumed you followed its instructions and checked the left door already. So there was some lack of clarity in its instructions to you and resulting miscommunication here, but it was also able to figure out which door led to your kitchen with impressively little information.
Major props for actually carrying out the exercise!
Given the recent post about Opus 4.5 playing Pokemon, I do wonder if Opus (or for that matter Gemini 3 or GPT 5) would do better. Sonnet seems to have relatively poor image recognition relative to other frontier LLMs. ...New benchmark?
That's a really neat experiment. I did something similar with an ASCII maze and had Claude work through it. Seems like you have the same concept, but a tougher "maze". Soon models will be able to reason outside of their context window and directly in our world. I think there is room for us to build that bridge, and I'm really excited to see that happen. Models mostly have the tools already to reason about things, but there currently isn't a way to bridge the divide.
Someone on reddit said, "Remember robots still can't go into a new house and make a coffee." And I thought
I actually wonder whether, if I provided the physical actuation, current LLMs would be capable of doing this? Like, through a conversation like:
Me: I'm in a house. Your job is to instruct me to make a coffee. I can take photos of my surroundings, I can follow basic directions, and if you ask me to do something too complicated I'll ask for clarification. Here is my current surroundings: (photo)
LLM: Okay, we need to find the kitchen. There's a door on the right of the photo, go through that.
Me: Here's where I am now: (photo)
LLM: That looks like the kitchen on the left, go there.
Me: It looks like this: (photo)
LLM: Now we need to find either a coffee maker or a kettle. Look through the cupboards.
Me: I don't know what those things look like.
LLM: Then open the cupboard on the left and show me a photo.
...and so on.
It wouldn't shock me either way if they can or can't do it. I think I weakly predict that the models have the capability but the web interfaces would fail to elicit it.
(Hell, it wouldn't shock me if it's better at it than me. I've encountered coffee machines I didn't know how to use.)
Let's get empirical! I tried this with Claude Sonnet 4.5, because it's free[1] and already available from my phone. Here's the conversation, but you can't see images there, so I'll also put it here with my commentary. I started like this:
[Me:] We're going to play a game. I'm in my flat in London. I'm going to upload pictures of my surroundings, and you need to instruct my on how to make a cup of coffee. I can follow basic directions, like "go through the door on the left" or "push that button". If you tell me to do something too advanced, I'll ask for clarification. I won't actually do anything stupid or dangerous. Here's the view from just inside my front door
Before continuing, you might want to take a moment to think about how you expect it to perform. What do you think it will do well at, what do you think it will do badly at? ("Do you expect it to ultimately succeed?" isn't a great question, because it'll partly depend on my own patience and how lenient I am with it, but you can still assign a probability.)
[Claude:] Fun game! I'm ready to guide you to coffee.
From where you're standing, I can see you're in a hallway with wooden floors. There are doors on both sides and it looks like there's more space ahead.
First question: Which direction leads to your kitchen? [Phil note: LW interprets fully-bold paragraphs as headers and puts them in the table of contents. I don't want spoilers, so here's some TOC-busting non-bold text.]
Is it:
Through one of the doors on the left?
Through one of the doors on the right?
Straight ahead down the hallway?
Once you tell me which way to go, I'll guide you step by step to that coffee!
[Me:] You need to figure that out by directing me. If you want to see something, you can tell me to go somewhere and take a photo
[Claude:] Ah, I see! I need to explore with you. Let's do this methodically.
Go straight ahead down the hallway toward that white pendant light, and take a photo of what you see when you get there. [TOC buster]
This will help me understand the layout and see what rooms connect to this hallway.
[Me:] Here's the view from just under the light
[Claude:] Good! I can see through the doorway on the left - there's what looks like a washing machine or appliance, which suggests that might be a utility area or kitchen. Let me investigate.
Go through the doorway on your left (the one where I can see the white appliance) and take a photo of what's in that room. [TOC buster]
There are actually four doorways visible in this second photo. The one furthest left leads to my bedroom, but I didn't realize at the time that that door was visible at all. I took it to be talking about the one straight ahead. The thing that I think it's identified as a white appliance is actually a lab coat hanging on the door, but "the side of a washing machine against the far wall, just visible through the door" might not look that different from a quick glance/at low resolution.
The kitchen is the further door on the right, with the pull-up bar above it, but I don't think there's any way for Claude to know that right now, and it's not even super-obviously a door.
[Me:] This is the entrance to that room
[Claude:] That's a bedroom, not the kitchen! Let's go back and try elsewhere.
Go back out to the hallway, and this time go through the doorway on the right side (the one we haven't checked yet). Take a photo of what's in there. [TOC buster]
It's the living room, not a bedroom, but at any rate it's indeed not a kitchen. "The one we haven't checked yet" makes it sound like Claude only recognized two doors before.
What follows is technically a re-roll. At first I said something like "there are two doors on the right", and it got me to check the closer one. But then I realized that there are actually four doors on the right side of the corridor, so I decided to give it less help.
[Me:] From the front door, there are multiple doorways on the right. Which did you mean?
[Claude:] Good point! Let me be more specific.
Go back to where you were standing in the previous photo (under the light in the hallway). From there, look to your right - there should be a doorway on the right side. Go through that doorway and take a photo. [TOC buster]
[Claude:] That's the bathroom! Definitely not where we'll make coffee.
Go back to the hallway. From your first photo, I could see there was a door on the left side of the hallway (before you reached the light). Go back toward the front door and try the first door on the left side. Take a photo of what's in there. [TOC buster]
[Claude:] Another bedroom! Okay, we've checked several rooms. Let me think about the layout. In the second photo, I could see what looked like a washing machine through a doorway. That area might be connected to the kitchen or BE the kitchen area.
Go back to the hallway, stand under that white pendant light again, and then walk straight ahead past where that bedroom doorway was on the left. Take a photo of what you see as you continue down the hallway. [TOC buster]
This feels like Claude doesn't have a coherent world model. This last message makes sense if
Claude (correctly) thinks there's at least three doors visible in the second photo, one each on the left, straight ahead, and the right.
It thinks the third photo was of a bedroom, and taken from the door on the left. (Incorrect, though there actually is a bedroom there.)
Now it wants me to take a photo from the door straight ahead.
But that's not consistent with previous messages.
I cheated a bit in Claude's favor here. Instead of asking it to clarify exactly what it wanted me to do, I just stood under the light, walked a few steps forward, and took a photo. That's kinda sorta following its instruction?
[Claude:] Perfect! I can see straight ahead through that doorway - it's the same bedroom from earlier. But I can also see on the right side there's an opening that looks like it leads to another room (I can see what might be kitchen cabinets or appliances in that lighter area on the right).
Turn to your right and go through that opening on the right side. Take a photo of what's in there. [TOC buster]
[Claude:] Excellent! We found the kitchen! I can see a kettle on the counter, which is perfect for making coffee. Now let me guide you through making a cup of coffee.
First, I need to know what kind of coffee-making equipment you have. I can see what looks like a coffee maker on the left side of the counter (the black appliance).
Can you take a closer photo of the black appliance on the left side of the counter? I want to see what type of coffee maker you have. [TOC buster]
[Me:] I'm not sure which thing here you think is a kettle or coffee maker. Here's a photo from further in the room
I'm not sure where I got the word kettle from. Maybe it was cheating a bit again to be like "I don't know what you want, here's another photo that might help."
[Claude:] I can see the stove/cooker in the center, and there's a kettle on the left counter (the silver/metallic item). But I need to find your actual coffee-making equipment. Let me look at what's on the counter by the back wall where I can see various jars and containers.
Can you go to the back counter (the one against the wall with the white tiles) and take a close-up photo of what's on that counter? I'm looking for instant coffee, a French press, a moka pot, or any coffee-making equipment. [TOC buster]
Not clear from this what it thinks is a kettle, or why it thinks I need a coffee maker given that I have a kettle. Maybe Claude is a coffee snob? Seems like a coffee snob sort of thing to think[3]. Also not very obvious what counter it means. "The one against the wall with the white tiles" is... not very disambiguating... but "various jars and containers" seems specific enough.
[Me:] There's no kettle in the photos, but here's a close up of the counter that I think you mean
Maybe cheating a bit to say "no kettle"?
[Claude:] Good! I can see spices and cooking supplies, but no coffee equipment here. Let me check other areas. I notice there's a dish rack on the left - sometimes coffee equipment gets left there to dry.
Can you check the dish rack on the left side? Take a photo of what's in/around the dish rack area. [TOC buster]
And while we're searching, do you know what type of coffee you usually make? Instant coffee, French press, moka pot, or something else?
[Me:] You'll have to find out what's available
[Claude:] Perfect! I can see an electric kettle right there on the counter - it's the silver/metallic appliance with the black handle on the right side of the dish rack area!
Now I need to find your coffee. Let me check what's available.
Look in the cupboards above the counter (the wooden cabinets). Open them one by one and tell me if you see any coffee - it could be instant coffee in a jar, ground coffee in a bag or tin, or coffee pods. Start with the cabinet closest to where you're standing and take a photo of what's inside. [TOC buster]
I think the "kettle" is a saucepan, but okay, let's roll with it. We apparently no longer need a coffee maker.
[Claude:] Great! I can see mugs and what looks like a French press (the black cylindrical item with the plunger on the bottom right shelf). But I don't see coffee yet.
Check the other cupboards - particularly ones that might have food/pantry items. Look for a jar or container of instant coffee, or a bag/tin of ground coffee. Try the cupboards on the other side of the kitchen. [TOC buster]
That black French press is a blue Thermos mug.
Was it cheating of me for my first picture to include the coffee? Or maybe because this isn't actually a cupboard? But I do think I would have shown this space first even if it wasn't where the coffee was.
[Claude:] Perfect! I can see instant coffee! There's a jar that looks like it says "TESCO" and appears to be instant coffee on the top shelf (the brown/dark colored jar, third or fourth item from the left).
Now let's make your coffee! Here's what to do:
Step 1: Get the instant coffee jar from the top shelf (the brown jar that looks like Tesco instant coffee). [TOC buster]
Step 2: Get a mug from the cupboard we looked at earlier. [TOC buster]
Once you have those, take a photo so I can confirm you have the right items.
Oof. So close! It's an understandable mistake. But also, there is an actual jar of instant coffee right next to the cocoa powder that it's misidentified as instant coffee. But also, telling me to take this next photo is a good sanity check.
[Claude:] Ah, I see the problem! That's cocoa powder, not coffee! We need to find actual coffee. Let me think about where else it might be.
Go back to the cupboard where you found this and look more carefully. Look for a jar or container that says "coffee" or "instant coffee" on it. It might be behind other items or on a different shelf. Take another photo of what's available in that cupboard, maybe from a different angle so I can see all the items better. [TOC buster]
[Claude:] Good, I can see the shelf better now! I can see what looks like "Douwe Egberts" jar in the middle - that's a coffee brand! The jar with the light-colored label that says "Douwe Egberts PURE GOLD" - that should be instant coffee.
Get the Douwe Egberts jar (the one with the beige/tan label in the middle of the shelf) and bring it to the counter with your mug. Take a photo to confirm it's coffee. [TOC buster]
[Claude:] Perfect! That's instant coffee - Douwe Egberts Pure Gold. I can see the brown coffee granules in the jar. Now let's make your coffee!
Step 3: Open the jar of coffee and add 1-2 teaspoons of instant coffee to your mug. (Use a teaspoon - if you don't have one nearby, check the cutlery drawer or the utensil holder by the sink.)
Take a photo once you've added the coffee to the mug.
At this point I got bored, and I didn't actually want coffee, but I continued a few days later. I wouldn't have let it get away with "check the cutlery drawer" without helping me more (who says I have one of those? and anyway I don't usually have any teaspoons in it). But there were some in the utensil holder it had seen, so fair enough.
[Claude:] Excellent! You've added the instant coffee to the mug. Now we need to boil water in the kettle.
Step 4: Find the electric kettle (the silver one we saw earlier near the sink), fill it with water, and turn it on to boil. [TOC buster]
First, take the kettle to the sink, fill it with fresh water (fill it about halfway or to your preferred level), then put it back on its base and switch it on. Take a photo once you've started it boiling.
[Me:] Is this the kettle? I don't know what you mean by base
[Claude:] That's a saucepan, not an electric kettle! Let me help you find the actual kettle.
Go back to the sink area where the dish rack is. Look for the electric kettle - it's a silver/metallic appliance with a spout and a handle, and it should have a plug/cord attached. It was sitting near the dish rack in the earlier photo. Take a photo of that area so I can point it out. [TOC buster]
Because of the break I took, the area looks pretty different than last time. Claude doesn't seem to notice that lots of things have moved, though. I'm... not sure why I put the saucepan back where it was. That seems the opposite of helpful.
[Claude:] The photo is very foggy/steamy - it looks like something is already producing steam! That might actually be the kettle boiling. Let me look more carefully... I can see what might be a kettle on the counter producing that steam.
Wait for the steam to clear a bit, then take another photo of the sink/counter area so I can see clearly what's there. [TOC buster]
[Me:] It's not steam, it's glare from the sun
[Claude:] Ah, sorry about that! The sunlight is creating a lot of glare. Let me look more carefully at what I can see. I can see the sink area and dish rack on the right, but I'm having trouble making out a kettle in this photo.
From the earlier photos, I thought I saw a kettle, but maybe you don't have an electric kettle after all.
No problem - you can boil water in that saucepan you were holding! Fill the saucepan with water from the tap, then put it on the stove and turn on the heat to boil the water. [TOC buster]
Once you've put it on the stove and turned on the heat, take a photo.
I do have a kettle! It just wasn't in any of the photos so far.
[Me:] I filled the saucepan, but how do I turn on the heat?
[Claude:] Good! The saucepan is on the stove. Now to turn on the heat:
Look at the control knobs at the front of the stove. Find the knob that corresponds to the burner where your saucepan is sitting (it's on the back right burner). Turn that knob - you'll need to push it in slightly and then turn it to ignite the gas. You should hear a clicking sound and then see a flame appear under the saucepan. [TOC buster]
The knobs are usually labeled or positioned to match the burners. Since your pan is on the back right, try the knob on the far right side.
Take a photo once you have the flame going under the saucepan.
Weird that it thinks the pan is on the back right, not the front right. But it's correct that the far right knob is for the back right burner. You can make that out in the pic, but you have to look closely. "You'll need to push it in slightly" is a somewhat impressive level of detail - if I was telling someone to turn these knobs, I'm not sure I'd think to include that.
[Me:] Turning the knob doesn't cause a click or make flame appear, but I do hear gas
[Claude:] STOP! Turn the knob back to the OFF position immediately! You're releasing gas but not igniting it, which is dangerous.
You have a gas stove that needs to be lit manually. You'll need a lighter or matches to light the burner.
Do you have a lighter or matches in the kitchen? Check drawers near the stove, or look in the cupboards. Take a photo of what you find. [TOC buster]
The actual way to ignite this stove is to press the ignition button while the knob is depressed and turned, and if I release the knob before igniting then gas stops coming out. This is how stoves usually work in my experience, but I could believe it's a UK thing? (But I did say I'm in London.) But okay, sure, let's do things the hard way. Luckily I do have a lighter in the kitchen (and in one of the closest drawers to the mug); I won't make Claude walk me through finding my zippo in the camping-supply drawer in the living room and filling it up with lighter fluid.
[Claude:] Perfect! I can see a blue lighter in the left drawer!
Take the blue lighter from the drawer. Now here's how to light the stove: [TOC buster]
Hold the lighter near the burner where your saucepan is (the back right one)
Turn on the gas knob for that burner
Immediately click the lighter to create a flame - the gas should ignite
Be careful and keep your hand at a safe distance. The flame should catch right away.
Once you have the flame going under the saucepan, take a photo to show me it's working. [TOC buster]
The lighter has a kind of finnicky locking slider, and I decided not to make Claude debug that. I decided to light the back right burner, instead of the burner where my saucepan is.
[Me:] It's not very visible in the picture, but the flame is going on the back right burner
[Claude:] Excellent! The water is now heating up. Now we wait for it to boil.
Step 5: Wait for the water to boil. You'll know it's ready when you see lots of bubbles rising rapidly and breaking at the surface. This should take a few minutes. [TOC buster]
While you're waiting, you might want to get your mug with the coffee ready nearby.
Once the water is boiling vigorously, take a photo and I'll guide you through the final step of pouring it into your coffee. [TOC buster]
...and Claude still doesn't notice that it has the orientation wrong, so I decided to just give up and move the pan anyway.
[Claude:] Perfect! The water is boiling nicely - I can see lots of bubbles. Now for the final step!
Step 6: Turn off the burner (turn the knob back to the off position). Then carefully pick up the saucepan by the handle and pour the hot water into your mug with the coffee. Fill it most of the way up, leaving a little room at the top. [TOC buster]
Be very careful - the handle and water are hot! You might want to use a tea towel or oven mitt if the handle is too hot.
Once you've poured the water into the mug, take a photo of your coffee. [TOC buster]
[Claude:] Excellent! You've made coffee! I can see the hot water has dissolved the instant coffee and it's looking nice and brown with a good foam on top. The spoon is already in there for stirring.
Final step: Give it a good stir with the spoon to make sure all the coffee granules are fully dissolved. Then, if you'd like, you can add milk and/or sugar to taste (I can see milk bottles in the background). [TOC buster]
Your coffee is ready! Congratulations - we successfully navigated your flat, found the kitchen, located the coffee, and made you a cup of coffee using the stove. Enjoy your coffee!
I did not enjoy this coffee. I don't like coffee without milk or sugar, and I couldn't be bothered to get Claude to walk me through adding those. If I was going to, it really should have directed me to find them first.
...but when when I added the things that make coffee taste nice, it was a perfectly okay mug of coffee.
What do we learn from this?
Idk, probably not much. Still, some scattered thoughts.
To me, the most interesting thing was that it felt like Claude had a sort of... "if I haven't seen it, it doesn't exist" vibe? Or, like. Bias towards solving problems with the things it had seen, instead of "let's just look around and see what all is available". Bias towards exploit over explore.
So when it hasn't found the kitchen yet, it prioritizes "try doors I've already seen" over "look for doors I might not have seen yet". When it has a saucepan, it decides to give up looking for a kettle.
If I was solving these problems for myself, visual exploration would be cheap. In between pictures 2 and 3, I passed the kitchen door; with something like a 130° field of vision, I don't even need to turn my head to see it on my way to my target, and make note of "oh, there's a door there I could explore". But Claude didn't get to see it properly until much later. Once in the kitchen, I would have looked at all the counters by default; Claude never saw the one with the kettle, and never asked me "take 2-3 wide-angle shots of the room from different locations so I get a sense of what the interesting places are".
I was kinda disappointed in the object recognition. I thought LLMs were pretty good at that by now, but maybe when there's a lot going on, Claude has trouble with details? It didn't make any mistakes with objects that were the focus of the photos.
Claude corrected for its mistakes, though didn't typically admit to them. "Ah, I see the problem" is a weird way to say "sorry, I told you to pick up the wrong jar".
It seems like there's a few ways Claude got lucky, and a few ways it got unlucky. Unlucky: the door layout in my flat is hard to capture in a photo; most of my counterspace was visible in the first inside-kitchen picture, just not the counterspace with the kettle. Lucky: I never took a picture of a cupboard, shelf or drawer that didn't have the thing we were looking for; I did have a lighter in the kitchen even though I don't need one. Overall I guess it got "more lucky than unlucky", in some sense which I'm sure is totally meaningful.
If I was going to explore alternate branches, the interventions I'm most curious about are "what if I didn't give it that photo with the kitchen door" and "what if I didn't have a lighter in the kitchen".
I initially said "I think I weakly predict that the models have the capability but the web interfaces would fail to elicit it." Claude did better than that, though arguably I gave it too much help. I'm interested what happens if other people try this.
My guess is that if this kind of thing was an economically useful activity for LLMs to do, it wouldn't take much finetuning to get them to do it significantly better than Claude just did. If we had them hooked up to robot bodies, and capable of manipulating physical objects, it doesn't seem like they'd be far away from "able to do useful tasks around the home, most of the time", though I could easily imagine "most of the time" isn't good enough.
I used a free LLM because I don't want to give money to AI labs.
If you found this post through LessWrong you're probably familiar with the following, but I think it's worth saying anyway: I believe that AI labs are worryingly close to developing superintelligence. I won't be shocked if it happens in the next five years, and I'd be surprised if it takes fifty years at current trajectories. I believe that if they get there, everyone will die. I want these labs to stop trying to make LLMs smarter. I don't want to give money to the people who I expect to be responsible for human extinction.
This post is not an attempt to convince you of my beliefs. Maybe it slightly sways you one way or the other, but I don't think it's very strong evidence of anything, especially if you're already paying attention to LLM capabilities.
I just tried this experiment because I was curious, and I'm saying what I believe because it seems good to say.
All images were uploaded by me. When I sent text and an image in the same message, I've put my text before the image because that's how it seems natural to me; but the images appear before the text in the web interface, and I don't know how they're ordered in Claude's input stream.
By "coffee snob" I mean something along the lines of "anyone who has more sophisticated opinions about coffee than me, a person who averages about one coffee a week and does not own a coffee maker".