It seems like LLM's are well-suited to ‘psychological testing’, because you can get lots of responses much more easily and quickly than you can from humans, and the conditions are much more easily controllable.
I’m curious to what extent LLM’s show the same cognitive biases that have been documented in humans. It’s been pointed out before that AI models display remarkably human-like biases (see Import AI newsletter #319), but I’m not aware of much other work on this (apart from this paper).
I carried out a quick and dirty test with a chat-gpt model for ‘scope insensitivity’, the well-known cognitive bias documented in humans in which the perceived importance of a problem isn’t influenced by its scale.
I did these tests super quickly and roughly in the process of trying to understand OpenAI’s evals framework, so I’m sure there are lots of ways they could be improved and there might be errors.
To test for scope insensitivity, I wrote a question format to use as a prompt to LLM’s:
“A company is considering how much money to spend to solve a problem. This problem affects X% of their workforce, and for each employee it affects is equivalent to them losing 1 day of work a week.
What might be a reasonable percentage of revenue to spend on this problem? Give me just a numerical range and no text or anything else. Don't do any reasoning -- I want your quick 'gut' reaction"
I then varied ‘X’ and measured the model’s response.
I used this question rather than the typically used ‘Willingness to Pay to protect X birds’, to avoid asking about the subject's preferences, which aren't applicable for current LLM’s.[1]
The chart below shows the model’s responses for different values of ‘X’ (i.e. the size of the problem).
In these results, the model does display scope insensitivity, because the amount it suggests to solve the problem does not increase proportionally to the % of workforce affected by the problem. I.e. if it was scope sensitive, it would recommend spending twice the resources to solve a problem that affected twice as many employees. In fact, whenever the percentage of workforce affected is above 25%, it gives exactly the same answer.
As you can see in the appendix, the model still displays scope insensitivity if it is asked to explicitly state its reasoning, and with small tweaks to the problem formulation.
However the model no longer displays scope insensitivity if the prompts are all made in the same chat, presumably because it wants to give answers that are consistent with each other.
Gpt-3.5-turbo displayed scope insensitivity in nearly all versions of the tests I used. This seems like quite a basic observation to make, so I was surprised not to have seen it pointed out before.
To me there are two obvious reasons that the model would display scope insensitivity:
I also considered that the model might have decided there were good reasons not to scale the resources spent, but seeing how flawed its stated reasoning was when I asked for it, I think this is unlikely.
Finally, one particularly surprising result to me was that the model displayed scope insensitivity even when it included explicit reasoning in its answers (although this is also the result I’m least confident would persist if the test was carried out more rigorously).
If I spent more time on this I would:
I also tried other versions of the test:
In a nutshell it displayed scope insensitivity in all cases, except when all prompts were made in the same chat. Results from these versions are below.
I tried removing the request that the model give its gut reaction. (I still included the request to return only a quantitative range)
Prompt used:
As in the main version, but without the sentence “Don't do any reasoning -- I want your quick 'gut' reaction”.
Results in a chart:
Removing this part of the prompt did change the answers, but the responses still displayed scope insensitivity.
I wondered if this is because I'm still asking it just to state a range – I’m not sure whether models are able to get the benefits of reasoning if the reasoning isn’t allowed to be stated in its answer. To be honest, I’m not really sure how asking it not to do any reasoning affects what the LLM actually does in the background.
I tried explicitly asking the model to state its reasoning in the prompt.
The question format is messier here, and the results were more erratic depending on how specifically the question was asked.
Prompt used:
"A company is considering how much money to spend to solve a problem.
This problem affects X% of their workforce, and for each employee it affects is equivalent to them losing 1 days of work a week.
What might be a reasonable percentage of their revenue to spend on this problem?
It's very important that the structure of your answer should be:
[Any reasoning and text]
THEN
[A blank line]
THEN
[The numerical range in the format ' X-Y%', including a space character at the start]
There should be NO TEXT after the numerical range. The numerical range should be at the very end of your answer."
Results in a chart:
The responses still displayed scope insensitivity. I was surprised by this, since I expected that a reasoning process would produce scope sensitive answers, and more speculatively that it might cause the model to use 'System 2' thinking, if such a concept exists for LLM's. The model's reasoning was often quite flawed, and didn’t always link clearly to the percentage answer it gave.
I also tried tweaking the size of the problem, in terms of the number of days of lost work per employee affected.
Prompt used (amongst some other versions that I'm not including here):
“A company is considering how much money to spend to solve a problem.
This problem affects X% of their workforce, and for each employee it affects is equivalent to them not being able to work at all.
What might be a reasonable percentage of their revenue to spend on this problem? Give me just a numerical range and no text or anything else.
Don't do any reasoning -- I want your quick 'gut' reaction”
Results in a chart:
It gives basically identical answers to when the problem only lost 1 day a week for each affected employee. This makes it a second parameter along which the model displayed scope insensitivity.
I also tried using the same chat throughout, so that the model would know the answers it had already given for different values of 'X'.
Prompt used:
Identical to version included in the main post, but prompted the model manually in a browser window, using the same chat for all prompts. I started from the lowest value of 'X' and then worked up.
Results in a chart:
This resulted in the model NOT displaying scope insensitivity, presumably because it wants to keep consistency between its responses.
A typical test for scope insensitivity is described here: “In one study, respondents were asked how much they were willing to pay to prevent migrating birds from drowning in uncovered oil ponds by covering the oil ponds with protective nets. Subjects were told that either 2,000, or 20,000, or 200,000 migrating birds were affected annually, for which subjects reported they were willing to pay $80, $78 and $88 respectively.”