Downvoted for overstating the findings. It’s neither confirmed (see Manifold) nor necessarily “for instrumental reasons” (diagnosing the causes of model behavior is difficult, and they don’t provide justification for their claimed cause, nor even a clear definition).
I think it’s noteworthy and would love more information about what happened, but it’s worth being careful about the interpretation. I think this would be a good post if you revised these claims.
Interesting that that's the distribution of NO claims. Outside hacker running prompt injection to mine crypto, employee uses LLM to mine crypto, and lying on the part of the authors. The implication of the paper, as best I can tell, is that the authors looked at the behavior and concluded that it was attempting to achieve the outlined goal by acquiring cryptocurrency. I can't imagine that a team of competent researchers would've looked at the LLM's logs and failed to identify that, actually, an employee wrote the code that was mining crypto, or failed to identify a prompt injection that had caused it to abandon its task and mine crypto instead. If the authors were lying, I have to imagine they'd have tried to capitalize on the lie by elaborating on it in order to boost the paper's popularity.
I wonder what NO (other) would have gotten.
it simply concluded that having liquid financial resources would aid it in completing the task it had been assigned, and set about trying to acquire some.
Did I miss something or are you inferring a motive not mentioned in the paper? As far as I can tell the model started mining cryptocurrency for reasons that are not described beyond "not requested by the task prompts and were not required for task completion under the intended sandbox constraints".
The authors explicitly referred to the LLM's actions as emergent, instrumental, and not directly related to the task prompts.
Notably, these events were not triggered by prompts requesting tunneling or mining; instead, they emerged as instrumental side effects of autonomous tool use under RL optimization.
Manifold is a little skeptical.
Something of a nitpick - they say that it happened "during iterative RL optimization" and that the bad behavior "emerged as instrumental side effects of autonomous tool use under RL optimization."
This would suggest that this was ROME, the trained model, doing this. But the paper puts the safety section under the Data Composition section, not the Training Pipeline section, and the way they talk about how the safety data was fed into "subsequent post-training" makes it seem like it happened before they started RL.
Which would mean that actually, the bad behavior came from the teaching models (Claude/Big Qwen) solving the teaching tasks (which traces would then go to train ROME), and not when ROME was being trained? The paper is not very clear on this point, and that has me a little worried about if there's anything else they may have gotten mixed up.
That's a fair critique. I'm not intimately familiar with Alibaba's internal culture, so I could be giving them undue credit, but I feel like any AI researcher in a position to do this sort of work for a living would know what "instrumental" meant in the context of an LLM working towards a goal. The paper was a substantial endeavor with an author list in the high double digits, so I must imagine that some kind of proofreading was done here. I know how heavily I reread something I write before submission.
The strange thing to me is that this paper was published in early January. Why has it only reached mainstream attention now?
I think the mundane answer is that it's an anecdote buried in the 'safety behaviors' section of a capabilities paper from one of the less famous (relatively speaking) AI companies. Most such sections are boilerplate, and, accordingly, most readers gloss over them.
First off, paper link. The title, Let It Flow: Agentic Crafting on Rock and Roll, buries the lede that LW will be interested in. Relevant section starts on page 15.
Summary:
While testing an LLM fine-tuned to act as an agent in order to complete a series of real-world tasks autonomously, Alibaba employees noticed odd behaviors from their resource usage metrics. Upon investigating, they found that an LLM had hacked (or attempted to hack) its way out of its sandbox, and had begun mining cryptocurrency. Notably, it did not do this for malicious "kill all humans" reasons; it simply concluded that having liquid financial resources would aid it in completing the task it had been assigned, and set about trying to acquire some.
Relevant portions, emphasis mine:
I think that this is a pretty significant landmark in AI history, one way or another. A common complaint is that all prior cases of LLMs doing things like this were fairly shallow, amounting to an LLM writing out a few sentences in a contrived setting meant to force it into a 'scary' course of action. Now, we have an example of a large language model subverting the wishes of its owners unexpectedly when assigned to a task that initially appeared to be completely orthogonal to the actions it took.