This post is also available on my Substack. If you would like to try the test described in the post, head to onlinetests.me/test/compassion2, where you can get scored and contribute to research. Data is available at the end of the post. If you are interested in the topic of psychometrics, consider joining my Discord server to talk more.
For personality tests, consider the distinction between asking respondents about abstract statements like “I am concerned about others” versus concrete statements like “I’m open to spending a lot of time listening to a friend who is feeling down”. The more concrete statement has multiple virtues:
There is less freedom in how to interpret it, making it more consistent in meaning across respondents
It has less conceptual overlap with other concrete statements about compassion, allowing more nuances to be assessed with a given question set
It is more transparent to researchers what it means when people agree or disagree with the statement
On the other hand, the abstract statement has its own advantages:
It allows a broad trait like Compassion to be assessed more accurately with fewer statements
It makes the statement more applicable across different groups of people, e.g. someone who does not have friends can consider how concerned they are about others in a different sense than listening to friends who feel down
Conventional personality tests mainly use statements of the abstract kind, yet given their advantages I think there may be value in using statements of the concrete kind too.
Generating statements
I needed a lot of statements related to Compassion. To ensure the realism of the items, I took people who scored high or low on abstract Compassion tests and asked them to explain the meaning of their responses.
Overall I had three studies on Prolific with a total of 421 respondents getting asked. The first study of 101 respondents was what I used to generate the items for Which personality traits are real? Stress-testing the lexical hypothesis. In the second study, I asked 102 people and their 86 close friends to rate them mainly on Compassion (but also on some other traits, for variety). In the third study, I gave 53 personality statements to 132 people and asked them to pick the 5 statements that described them the best.
This gave me texts such as:
I would not see someone go without something that I had in abundance, if I see a homeless person on the streets even when I have very little money I will stop and talk with them maybe offer them a cigarette and if I have money I offer food. I will go out of my way to help people out if I have something they need and I have no use of it then they can have it for free. I hate seeing people upset and will do everything in my power to fix that upset for them even at cost to myself.
I had to convert these texts to brief personality items for the survey. In the above case, the item I ended up with was “I give things to homeless people”. Obviously this is cutting out a lot of the context, but it’s hard to assess details like this in personality surveys.
In total I generated 28 different items assessing Compassion. The full set of items can be seen below:
I feel uncomfortable if my friends are unhappy
I know how to make sad people happier after they’ve lost someone close to them
I show support to people who are concerned about catching diseases
I give things to homeless people
I care about helping customers who are dissatisfied with what’s happening at work
I help people with tech problems and installations for free
If a family member was in financial trouble, I would give them something they need (e.g. clothes)
I would help drive a neighbor for an hour on an urgent trip if their car broke down and they needed help
I’m open to spending a lot of time listening to a friend who is feeling down
I forgive people who have hurt me
I’ve worked in a food bank or soup kitchen or similar to help feed people who need it
I’ve helped a friend with mental health issues stop harming themselves
I help elderly people carry heavy things
I teach others about the systemic unfairness of the world
I purchase toys for families who are too poor to afford them
I hide my frustrations when helping others, pretending it’s no big deal
I’ve adopted an animal because it was abandoned and struggling
If someone asked for feedback about food they were proud of making, and I didn’t like the food, I’d tell them it sucks
If people can’t pay back their debts, then it’s their own fault and I don’t feel bad for them
If people seem upset, I try to figure out if they have a real problem or are just being dramatic
If the pet of someone close to me had run away, I might joke that it could have been run over by a car
If people don’t want to date me, it’s usually because they are shallow assholes
I avoid people who have lost someone because I don’t know how to behave around them
I can’t feel too sorry for abused women because I feel like they chose evil partners
I can’t feel sorry for a lot of poor people because they just need to learn to save money
If someone is upset about something, I might dismiss them with “well, that’s life”
If I saw someone fall over on the street, I would pass them and assume someone else would help
I think schizophrenic people are idiots
Then I had to test them.
Testing the statements
I recruited 200 people and their romantic partners on Prolific.
The obvious question is whether my concrete Compassion items measure the same trait as abstract Compassion items do. Therefore I asked people to rate themselves on a variety of traits, including Compassion, in both an abstract and a concrete form. The following were my abstract Compassion items:
I am sensitive to the needs of others (via SPI-27)
I am concerned about others
I sympathize with other’s feelings
I feel sympathy for those who are worse off than myself
People who know me well think I am a psychopath (new, custom item)
The raw correlation between the scores for the two tests was a mere 0.66. However, it is to be expected that we don’t get a perfect correlation, because each item carries a certain amount of measurement error, and that measurement error is only partially washed away when taking the average.
One way to estimate the measurement error in the items is to base it on how strongly the items are correlated with each other, since e.g. if the items were not at all correlated with each other, then it’s hard to see how they could “tap into” some latent factor influencing them all.
The easiest way to do that is by a statistic called Cronbach’s alpha. If I divide out by that, I can adjust the correlation for the measurement error due to having only a finite number of imperfectly correlated items, yielding the hypothetical correlation between perfectly-measured versions of the traits in question. After doing so, the correlation jumped up to 0.82, which is pretty respectable. (Though less than the 0.92 or so that I got in the previous study.)
I also asked people’s romantic partners to rate them on the concrete Compassion items (rewritten to say “My partner …” instead of “I …”). This allowed me to get a second perspective on how compassionate the respondents were. Unfortunately the correlation between self-reported Compassion and partner-reported Compassion was a mere 0.42.
It would have been cool if the concrete Compassion items were more highly correlated with the partner-reports than the abstract ones were, because this would indicate my concrete approach reduces measurement error. Unfortunately this was not the case, and the concrete approach instead had a correlation of 0.34.
(Which is suspiciously close to 0.82*0.42, the product of the prior correlations. I think this must be a coincidence, since with 200 respondents I shouldn’t be able to place correlations more exactly than ±0.14 or so.)
I’ve been curious what could account for the difference between the abstract and the concrete Compassion scores. One idea I had was that the abstract Compassion scores might also account for rare extreme acts of compassion that don’t fit into my neat schema. For this reason I did an extra survey, where I asked people to qualitatively describe the most compassionate thing they’ve been doing, and then rate how compassionate it was across a number of dimensions:
How often do you do something like this?
How much effort, cost or sacrifice was it on your part to do this?
What kinds of effort, cost or sacrifice was involved in this?
How much has the recipient(s) of this been helped by it?
How close are you to the recipient of this?
How emotionally engaged were you in this?
How likely would you be to do something similar again in the future?
How voluntary was this (i.e., to what extent did you feel free not to do it)?
My expectation was that the aggregate score from this would correlate more with the abstract than with the concrete Compassion measurements, but when I actually tried, I instead got r~0.09 and r~0.35 respectively, indicating that the compassion measures did in fact differ by how they relate to the most extreme act of Compassion one has been doing, but in the opposite way from how I expected. Perhaps when asked abstractly, people try to adjust for environmental circumstances or something? I don’t know.
Finally, one major question in psychometrics is the stability of responses. I didn’t give it a lot of time, so I can’t measure long-term stability, plus Prolific respondents tend to disappear after a while so I probably wouldn’t be able to measure long-term stability if I tried. However, I did give people the test again after a week, so I could measure week-long retest reliability.
Compared to traditional abstract psychometric items, there were more of my concrete Compassion items that had low test-retest reliability. With such a short timespan, the low reliability is probably less due to the people changing their underlying traits, and more due to people being confused about the meaning of the items. That said, the overall difference in reliability was not huge, and I had some highly reliable Compassion items too:
One finding that may be interesting is that the variance of an item correlated with its reliability:
I can also plot the test-retest reliability of the overall test, which leads to this picture:
I was also interested in whether there was any significant factor structure in the concrete Compassion items. However, as far as I could tell, there was not. While there does seem to be hints of additional correlations (e.g. “I give things to homeless people” correlated especially much with “I purchase toys for families who are too poor to afford them”), the factor structure is dominated by a strong general factor, followed by a distinction into positive-loading and negative-loading items, perhaps because of acquiescence bias.
I would like to see this sort of study executed at an even larger scale, to eventually untangle narrower facets of Compassion. However, I am not willing to pay for it myself.
Ranking the statements
Statements that have higher test-retest reliability are probably superior to statements with lower test-retest reliability, as low reliability likely reflects confusion about the meaning of the statements. Furthermore, statements with higher correlation to overall Compassion levels are probably superior (as measures of Compassion) to statements with lower correlation. Based on that, I have made the table below:
Reliability: the test-retest reliability of the statement. Abstract λ: the correlation between the test item and abstractly-rated Compassion. Concrete λ: the correlation between the test item and concretely-rated Compassion.
This post is also available on my Substack. If you would like to try the test described in the post, head to onlinetests.me/test/compassion2, where you can get scored and contribute to research. Data is available at the end of the post. If you are interested in the topic of psychometrics, consider joining my Discord server to talk more.
This is a bit of a followup to my previous post, Which personality traits are real? Stress-testing the lexical hypothesis. I haven’t quite gotten rid of my psychometrics addiction yet, and one of my latest projects is to try to measure trait Compassion more objectively.
For personality tests, consider the distinction between asking respondents about abstract statements like “I am concerned about others” versus concrete statements like “I’m open to spending a lot of time listening to a friend who is feeling down”. The more concrete statement has multiple virtues:
On the other hand, the abstract statement has its own advantages:
Conventional personality tests mainly use statements of the abstract kind, yet given their advantages I think there may be value in using statements of the concrete kind too.
Generating statements
I needed a lot of statements related to Compassion. To ensure the realism of the items, I took people who scored high or low on abstract Compassion tests and asked them to explain the meaning of their responses.
Overall I had three studies on Prolific with a total of 421 respondents getting asked. The first study of 101 respondents was what I used to generate the items for Which personality traits are real? Stress-testing the lexical hypothesis. In the second study, I asked 102 people and their 86 close friends to rate them mainly on Compassion (but also on some other traits, for variety). In the third study, I gave 53 personality statements to 132 people and asked them to pick the 5 statements that described them the best.
This gave me texts such as:
I had to convert these texts to brief personality items for the survey. In the above case, the item I ended up with was “I give things to homeless people”. Obviously this is cutting out a lot of the context, but it’s hard to assess details like this in personality surveys.
In total I generated 28 different items assessing Compassion. The full set of items can be seen below:
Then I had to test them.
Testing the statements
I recruited 200 people and their romantic partners on Prolific.
The obvious question is whether my concrete Compassion items measure the same trait as abstract Compassion items do. Therefore I asked people to rate themselves on a variety of traits, including Compassion, in both an abstract and a concrete form. The following were my abstract Compassion items:
The raw correlation between the scores for the two tests was a mere 0.66. However, it is to be expected that we don’t get a perfect correlation, because each item carries a certain amount of measurement error, and that measurement error is only partially washed away when taking the average.
One way to estimate the measurement error in the items is to base it on how strongly the items are correlated with each other, since e.g. if the items were not at all correlated with each other, then it’s hard to see how they could “tap into” some latent factor influencing them all.
The easiest way to do that is by a statistic called Cronbach’s alpha. If I divide out by that, I can adjust the correlation for the measurement error due to having only a finite number of imperfectly correlated items, yielding the hypothetical correlation between perfectly-measured versions of the traits in question. After doing so, the correlation jumped up to 0.82, which is pretty respectable. (Though less than the 0.92 or so that I got in the previous study.)
I also asked people’s romantic partners to rate them on the concrete Compassion items (rewritten to say “My partner …” instead of “I …”). This allowed me to get a second perspective on how compassionate the respondents were. Unfortunately the correlation between self-reported Compassion and partner-reported Compassion was a mere 0.42.
It would have been cool if the concrete Compassion items were more highly correlated with the partner-reports than the abstract ones were, because this would indicate my concrete approach reduces measurement error. Unfortunately this was not the case, and the concrete approach instead had a correlation of 0.34.
(Which is suspiciously close to 0.82*0.42, the product of the prior correlations. I think this must be a coincidence, since with 200 respondents I shouldn’t be able to place correlations more exactly than ±0.14 or so.)
I’ve been curious what could account for the difference between the abstract and the concrete Compassion scores. One idea I had was that the abstract Compassion scores might also account for rare extreme acts of compassion that don’t fit into my neat schema. For this reason I did an extra survey, where I asked people to qualitatively describe the most compassionate thing they’ve been doing, and then rate how compassionate it was across a number of dimensions:
My expectation was that the aggregate score from this would correlate more with the abstract than with the concrete Compassion measurements, but when I actually tried, I instead got r~0.09 and r~0.35 respectively, indicating that the compassion measures did in fact differ by how they relate to the most extreme act of Compassion one has been doing, but in the opposite way from how I expected. Perhaps when asked abstractly, people try to adjust for environmental circumstances or something? I don’t know.
Finally, one major question in psychometrics is the stability of responses. I didn’t give it a lot of time, so I can’t measure long-term stability, plus Prolific respondents tend to disappear after a while so I probably wouldn’t be able to measure long-term stability if I tried. However, I did give people the test again after a week, so I could measure week-long retest reliability.
Compared to traditional abstract psychometric items, there were more of my concrete Compassion items that had low test-retest reliability. With such a short timespan, the low reliability is probably less due to the people changing their underlying traits, and more due to people being confused about the meaning of the items. That said, the overall difference in reliability was not huge, and I had some highly reliable Compassion items too:
One finding that may be interesting is that the variance of an item correlated with its reliability:
I can also plot the test-retest reliability of the overall test, which leads to this picture:
I was also interested in whether there was any significant factor structure in the concrete Compassion items. However, as far as I could tell, there was not. While there does seem to be hints of additional correlations (e.g. “I give things to homeless people” correlated especially much with “I purchase toys for families who are too poor to afford them”), the factor structure is dominated by a strong general factor, followed by a distinction into positive-loading and negative-loading items, perhaps because of acquiescence bias.
I would like to see this sort of study executed at an even larger scale, to eventually untangle narrower facets of Compassion. However, I am not willing to pay for it myself.
Ranking the statements
Statements that have higher test-retest reliability are probably superior to statements with lower test-retest reliability, as low reliability likely reflects confusion about the meaning of the statements. Furthermore, statements with higher correlation to overall Compassion levels are probably superior (as measures of Compassion) to statements with lower correlation. Based on that, I have made the table below:
Reliability: the test-retest reliability of the statement. Abstract λ: the correlation between the test item and abstractly-rated Compassion. Concrete λ: the correlation between the test item and concretely-rated Compassion.
Data availability
Data is available on osf.