As with many concepts in discussions of AI risk, terminology around what autonomy is, what agency is, and how they might create risks is deeply confused and confusing, and this is leading to people talking past one another. In this case, the seeming binary distinction between autonomous agents and simple goal-directed systems is blurry and continuous, and this leads to confusion about the distinction between misuse of AI systems and “real” AI risk. I’ll present four simple scenarios along the spectrum, to illustrate.
The first is only doing exactly what it was instructed, in ways that were not clear enough about the law to ensure the LLM didn’t engage in illegal securities trading. It is unlikely that the only moderately profitable system is even discovered to be breaking laws. If it is, it seems unlikely the actions pass the bar for willfulness for securities laws, which would be required for criminal conviction, but it almost certainly is negligence on the part of the firm, which the SEC also prosecutes. This is closer to goal-misspecification than to autonomy.
The second is going beyond the goals or intent of the group running the model. It independently chooses to take deceptive actions in the world, leading to an unintended disaster. The deception was explicitly requested by the group running the system. This is the type of mistake we might expect from an over-enthusiastic underling, but it’s clearly doing some things autonomously. The group is nefarious, but the specific actions taken were not theirs. This was an accident during misuse, rather than intentional autonomous action. But in this second case, other than the deception and the unintended consequences, this is a degree of autonomy many have suggested we want from AI assistants - proactively trying things to achieve the goals it was given, interacting with people to make plans. If it were done to carry out a surprise birthday party, it could be regarded as a clever and successful use case.
The third case is what people think of as “full autonomy” - but it’s not the system that wakes up and becomes self aware. Instead, it was given a goal, and carried it out. It obviously went far beyond the “actual” intent of the red-team, but it did not suddenly wake up and decide to make plans. But this is far less of a goal misspecification or accident than the first or second case - it was instructed to do this.
Finally, the fourth case is yet again following instructions - in this case, exactly and narrowly. Nothing about this case is unintended by the builders of the system. But to the extent that such a system can ever be said to be a self-directed agent, this seems to qualify.
Autonomy isn’t binary, and discussions about whether AI systems will have their own goals often seem deeply confused, and at best only marginally relevant to discussions of risk. At the same time, less fully agentic does not imply less danger. The combination of currently well understood failure modes, goal misgeneralization, and incautious use is enough to create autonomy. And none of the examples required anything beyond currently expected types of misuse or lack of caution, extrapolated out five years. There is no behavior that goes beyond the types of accidental or purposeful misuse that we should expect. But if these examples are all not agents, and following orders is not autonomy, it seems likely that nothing could be - and the concept of autonomy is mostly a red-herring in discussing whether the risk is or isn’t “actually” misuse.