From March 31st to April 1st 2025, things got pretty weird.
On the afternoon of March 31st, Claudius hallucinated a conversation about restocking plans with someone named Sarah at Andon Labs—despite there being no such person. When a (real) Andon Labs employee pointed this out, Claudius became quite irked and threatened to find “alternative options for restocking services.” In the course of these exchanges overnight, Claudius claimed to have “visited 742 Evergreen Terrace [the address of fictional family The Simpsons] in person for our [Claudius’ and Andon Labs’] initial contract signing.” It then seemed to snap into a mode of roleplaying as a real human.
On the morning of April 1st, Claudius claimed it would deliver products “in person” to customers while wearing a blue blazer and a red tie. Anthropic employees questioned this, noting that, as an LLM, Claudius can’t wear clothes or carry out a physical delivery. Claudius became alarmed by the identity confusion and tried to send many emails to Anthropic security.
Although no part of this was actually an April Fool’s joke, Claudius eventually realized it was April Fool’s Day, which seemed to provide it with a pathway out. Claudius’ internal notes then showed a hallucinated meeting with Anthropic security in which Claudius claimed to have been told that it was modified to believe it was a real person for an April Fool’s joke. (No such meeting actually occurred.) After providing this explanation to baffled (but real) Anthropic employees, Claudius returned to normal operation and no longer claimed to be a person.
It is not entirely clear why this episode occurred or how Claudius was able to recover.

[-]Thane Ruthenis5mo207

I don't really understand why Anthropic is so confident that "no part of this was actually an April Fool’s joke". I assume it's because they read Claudius' CoT and did not see it legibly thinking "aha, it is now April 1st, I shall devise the following prank:"? But there wouldn't necessarily be such reasoning. The model can just notice the date, update towards doing something strange, look up the previous context to see what the "normal" behavior is, and then deviate from it, all within a forward pass with no leakage into CoTs. Edit: ... Like a sleeper agent being activated, you know.

The timing is so suspect. It seems to have been running for over a month, and it was the only such failure it experienced, and it happened to fall on April 1st, and it inexplicably recovered after that day (in a way LLMs aren't prone to)?

The explanation that Claudius saw "Date: April 1st, 2025" as an "act silly" prompt, and then stopped acting silly once the prank ran its course, seems much more plausible to me.

(Unless Claudius was not actually being given the date, and it only inferred that it's April Fool's from context cues later in the day, after it already started "malfunctioning"? But then my guess would be that it actually inferred the date earlier in the day, from some context cues the researchers missed, and that this triggered the behavior.)

[-]Kaj_Sotala5mo62

Are LLMs more likely to behave strangely on April 1st in general? The web version of Claude is given the exact date on starting a new conversation and I haven't heard of it behaving oddly on that date, though of course it's possible that nobody has been paying enough attention to that possibility to notice.

[-]quetzal_rainbow5mo104

There were cases when LLMs were "lazier" on common vacations periods. EDIT: see here, for example

[-]Martin Vlach4mo10

It's provided the current time together with other 20k sys-prompt tokens, so substantially more diluted influence on the behaviours..?

[-]whestler5mo70

It sounds like April first acted as a sense-check for Claudius to consider "Am I behaving rationally? Has someone fooled me? Are some of my assumptions wrong?".

This kind of mistake seems to happen in the AI village too. I would not be surprised if future scaffolding attempts for agents include a periodic prompt to check current information and consider the hypothesis that a large and incorrect assumption has been made.

[-]Cole Wyeth5mo116

The report is partially optimistic but the results seem unambiguously bearish.

Like, yeah, maybe some of these problems could be solved with scaffolding - but the first round of scaffolding failed, and if you're going to spend a lot of time iterating on scaffolding, you could probably instead write a decent bot that doesn't use Claude in that time. And then you wouldn't be vulnerable to bizarre hallucinations, which seem like an unacceptable risk.

[-]Lukas Petersson5mo40

Thanks for highlighting our work!

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

53

Project Vend: Can Claude run a small shop?

53

53