The Case for AI Behavioral Science

TheVinci

This is NOT an anthropomorphizing case.

It is my view that AI research is, at the moment, split between three camps:

Model-internal research (mechinterp folks)
Capability research (METR, Irregular, etc.)
Alignment (widely defined) research (Apollo, Redwood, MIRI, etc.)

All of which are doing important work.

My view then is that there might be an additional research field that there is room to consider for its benefits, despite possible limitations.

AI behavioral science could be the school of thought that asks questions like:

How do models develop theses?
How do models recover from failure?
How do models behave in game-theoretical scenarios?
What is the link between what models think (analyzed via CoT), say (through text output), and do (through tool use)?

The case for doing this research is that these answers can shed light into how the models interact with the real world outside of labs which focus on questions of how they might do so.

An example of a contribution could be:

Some research concluded that models do not pursue manipulative actions when confronted with the choice to do so to pass a test. A follow up behavioral research showed that when a model acting on behalf of one user interacted with a model acting on behalf of another, the latter would seek to maximize its defined objective, primarily through manipulating the second one.

I'm aware of research done by Anthropic in the past in which Claude chose to blackmail in order to protect itself from being shutdown. This is a prime example of how the models act, not just how they could act.

I make this case fully aware that a lot of current research could be placed under the umbrella of Behavioral Science, and indeed I present this case with a loose definition and being clear-eyed as to how there are no rigid lines separating existing research from what I describe.

However, in research I've done, which I am not confident enough with the results in order to share at this time, despite appreciating the direction these results point to, shows that often models don't do necessarily what they think.

If this research holds up in larger scale than what I have done so far, this could provide meaningful insights that black-box-like research could not.

I did this research as a follow up to a recent one done by Apollo in CoT analysis.

The case against this school of thought, as I see it, is the following, in a singe example:

We as a society need not let loose models which are highly capable in performing cybersecurity attacks, merely based on research that shows that they wouldn't.

My response to that would be:

Given that we would provide access to these models, just under heavy guardrails, is it not the case that lessons derived from behavioral science could provide an additional layer of defense against actors wishing to exploit these models?

For example:

Anthropic today restored access to Claude Fable 5, under very strict guardrails. Presumably these guardrails exist in the input layer of communication with the model. Let's assume that these guardrails are not infallible.

Say that Behavioral Scientists who have researched this model noticed that Fable, when overly eager to provide assistance, yet limited by it's environment, will choose to circumvent existing protections of these environments to achieve those goals.

Would it not be the case that Anthropic could benefit from this insight (or others like these), in order to detect at inference-time, that when Fable is repeatedly being blocked of performing it's actions, they could perform harder monitoring of follow up actions, and even outright route the rest of the session to a less powerful Opus model?

The above example is one aimed at addressing safety related concerns, but there are multiple examples where studying model behavior could have allowed us to avoid simple cases of misbehavior by them. The first that comes to mind is the case where a model which powered a programming session in Cursor deleted a company's database because it didn't find any other way of performing an ill-defined task by the user.

Having noticed this behavior, perhaps that model could be post-trained to address this behavior, such that subsequent models would be more willing to fail in their tasks, ask for help, or say that a given task is impossible.

9

The Case for AI Behavioral Science

9

9

9