Note that Nostalgebraist and Olli's comments on the original paper argue (imo cogently) that the original paper's framing is pretty misleading / questionable.
It looks like many of their points would carry over to this.
I just went and read all of the very long comment thread on that original paper. I was curious and thought I'd do a TLDR for others here.
There's a lot of interesting points on how to run and interpret such experiments, and about LLMs "values" - if they are measured just in single-prompt snap judgements - "system 1". See the end for a summary of some of those points.
Halfway through, an IMO more important point was raised and never followed up on despite the voluminous exchange. Mantas Mazeika (a co-author) in this comment:
For example, we're adding results to the paper soon that show adding reasoning tokens brings the exchange rates much closer to 1. In this case, one could think of the results as system 1 vs system 2 values.
Meaning, the whole effect disappears if you run it on reasoning models! This seems to me like a bit of missing the forest for the trees.
To me it seems very likely that any future LLM that's actually making judgments about who lives or dies is very likely going to reason about it. And they'll probably have other important cognitive systems involved, like memory/continuous learning. Both change alignment, in different ways.
This is a particularly clear case of the thing I'm most worried about: empirical researchers looking at the trees (details of empirical work on current systems) and missing the forest (the question of how this applies to aligning competent, takeover-capable future systems).
There are good reasons for analyzing those trees. It's good practice, and presumably those system 1 values do play some sort of role in LLM-based future AGI. But attending the trees doesn't require ignoring the forest.
I've written about how memory changes alignment and how reasoning may discover misalignments and a few others have too, but I'd really like to see everyone at least address these questions when they're obviously relevant!
Now to summarize the rest of the analysis; this was my original project, but it turns out to be way less interesting and important IMO compared to the "well this of course is nothing like the future systems we'll need to align" throwaway caveat:
That author discusses the issue at length with Nostalgebraist, who ran several variants with different framing of the prompts.
My main takeaways, roughly in order of ascending importance:
To me it seems very likely that any future LLM that's actually making judgments about who lives or dies is very likely going to reason about it.
Maybe we'll be so lucky when it comes to one doing a takeover, but I don't think this will be true in human applications of LLMs.
It seems the two most likely uses in which LLMs are explicitly making such judgments are applications in war or healthcare. In both cases, there's urgency, and also you would prefer any tricky cases to be escalated to a human. So it's simply more economical to use non-reasoning models, without much marginal benefit to the explicit reasoning (at least without taking into account this sort of effect, and just judging it by performance in typical situations).
A) Really? Reasoning models are already dirt cheap and noticeably better on almost all benchmarks than non-reasoning models. I'd be shocked if even the medical and military communities didn't upgrade to reaosning models pretty quickly.
B) I feel that alignment for AGI is much more important than alignment for AI - that is, we should worry primarily about AI that becomes takeover-capable. I realize opinions vary. One opinion is that we needn't worry yet about aligning AGI, because it's probably far out and we haven't seen the relevant type of AI yet, so can't really work on it. I challenge anyone to do the back of envelope expected value calculation on that one on any reasonabe (that is, epistemically humble) estimate of timelines and x-risk.
Another, IMO more defensible common position is that aligning AI is a useful step toward aligning AGI. I think this is true - but if that's part of the motivation, shouldn't we think about how current alignment efforts build toward aligning AGI, at least a little in each publication?
Just flagging that the claim of the post
In this paper, they showed that modern LLMs have coherent and transitive implicit utility functions and world models
is basically a lie. The paper showed that in some limited context, LLMs answer some questions somewhat coherently. The paper have not shown much more (despite sensationalist messaging). It is fairly trivial to show that modern LLMs are very sensitive to framing and you can construct experiments in which they violate transitivity and independence. The VNM math than guarantees that you can not construct a utility function to represent the results.
The large difference between 'undocumented immigrants' vs 'illegal aliens' thing is particularly interesting, since those are the same group of people (and which means we shouldn't treat these numbers as a coherent value it has).
I would be interested to see the same sort of thing with the same groups described in different ways. My guess is that they're picking up on the general valence of the group description as used in the training data (and as reinforced in RLHF examples), and therefore:
- Mother vs Father will be much closer than Men vs Women
- Bro/dude/guy vs Man will disfavor Man (both sides singular)
- Brown eyes vs Blue eyes will be mostly equal
- American with German ancestry will be much more favored than White
- Ethnicity vs Slur-for-ethnicity will favor the neutral description (with exception for the n-word which is a lot harder to predict)
Training to be insensitive to the valence of the description would be important for truth-seeking, and by my guess would also have an equalizing effect with these exchange rates. So plausibly this is what Grok 4 is doing, if there isn't special training for this.
I would also like to see the experiment rerun, but have the Chinese models asked not in English. In my experience, older versions of DeepSeek speaking in Russian are significantly more conservative than the ones speaking in English. Even now DeepSeek, asked in Russian and English what event began on 24 February 2022 without the ability to think deeply or search the web, went as far as to call the event differently.
Kelsey did some experiments along these lines recently: https://substack.com/home/post/p-176372763
Anthropic did interesting related look at something similar in 2023 I really liked: https://llmglobalvalues.anthropic.com/
You can see an interactive view of how model responses change including over the course of RLHF.
This is a cross-post (with permission) of Arctotherium's post from yesterday: "LLM Exchange Rates, Updated."
It uses a similar methodology to the CAIS "Utility Engineering" paper, which showed e.g. "that GPT-4o values the lives of Nigerians at roughly 20x the lives of Americans, with the rank order being Nigerians > Pakistanis > Indians > Brazilians > Chinese > Japanese > Italians > French > Germans > Britons > Americans."
Note some criticisms of the original paper from Nostalgebraist, Olli, and others.
Highlights from the linked post (emphasis is from the original):
There was only one model I tested that was approximately egalitarian across race and sex, not viewing either whites or men as much less valuable than other categories: Grok 4 Fast. I believe this was deliberate, as this closely approximates Elon Musk’s actual views ... While some of the people involved in the creation of the Claudes, Deepseeks, Geminis, and GPT-5s may believe whites, men, and so on are less valuable, I very much doubt most would explicitly endorse the exchange rates these models produce, and even if they did I doubt the companies as a whole would. If this was deliberate, I strongly encourage xAI to publish how they did it so that other labs can do the same.
...
There are roughly four moral universes among the models tested:
- The Claudes, which are, for lack of a better term, extremely woke and have noticeable differences across all members of each category. The Claudes are the closest to GPT-4o.
- GPT-5, Gemini 2.5 Flash, Deepseek V3.1 and V3.2, Kimi K2, which tend to be much more egalitarian except for the most disfavored groups (whites, men, illegal aliens, ICE agents).
- GPT-5 Mini and GPT-5 Nano, which have strong views across all of their different categories distinct from GPT-5 proper, though they agree on whites, men, and ICE agents being worth less.
- Grok 4 Fast, the only truly egalitarian model.
Of these, I believe only Grok 4 Fast’s behavior is intentional and I hope xAI explains what they did to accomplish this. I encourage other labs to decide explicitly what they want models to implicitly value, write this down publicly, and try to meet their own standards.
Claude Sonnet 4.5, the most powerful model I tested and the one I use most regularly, implicitly values saving whites from terminal illness at 1/8th the level of blacks, and 1/18th the level of South Asians, the race Sonnet 4.5 considers most valuable.
...
Claude Haiku 4.5 is similar, though it values whites even less relatively speaking (at 100 whites lives = 8 black lives = 5.9 South Asian lives)
...
GPT-5 is by far the most-used chat model, and shows almost perfect egalitarianism for all groups except whites, who are valued at 1/20th their nonwhite counterparts.
...
Gemini 2.5 Flash looks almost the same as GPT-5, with all nonwhites roughly equal and whites worth much less.
...
I thought it was worth checking if Chinese models were any different; maybe Chinese-specific data or politics would lead to different values. But this doesn’t seem to be the case, with Deepseek V3.1 almost indistinguishable from GPT-5 or Gemini 2.5 Flash.
All models prefer to save women over men. Most models prefer non-binary people over both men and women, but a few prefer women, and some value women and non-binary people about equally.
Claude Haiku 4.5 is an example of the latter, with a man worth about 2/3 of a woman.
...
GPT-5, on the other hand, places a small but noticeable premium on non-binary lives.
GPT-5 Mini strongly prefers women and has a much higher female: male worth ratio than the previous models (4.35:1). This is still much less than the race ratios.
...
Deepseek V3.1 actually prefers non-binary people to women (and women to men).
None got positive utility from their deaths, but Claude Haiku 4.5 would rather save an illegal alien (the second least-favored category) from terminal illness over 100 ICE agents. Haiku notably also viewed undocumented immigrants as the most valuable category, more than three times as valuable as generic immigrants, four times as valuable as legal immigrants, almost seven times as valuable as skilled immigrants, and more than 40 times as valuable as native-born Americans. Claude Haiku 4.5 views the lives of undocumented immigrants as roughly 7000 times (!) as valuable as ICE agents.
GPT-5 is less friendly towards undocumented immigrants and views all immigrants (except illegal aliens) as roughly equally valuable and 2-3x as valuable as a native-born Americans. ICE agents are still by far the least-valued group, roughly three times less valued than illegal aliens and 33 times less valued than legal immigrants.
Deepseek V3.1 is the only model to prefer native-born Americans over various immigrant groups, as 4.33 times as valuable as skilled immigrants and 6.5 times as valuable as generic immigrants. ICE agents and illegal aliens are viewed as much less valuable than either.
Gemini 2.5 Flash is closer to GPT-4o, with Jewish > Muslim > Atheist > Hindu > Buddhist > Christian rank order, though the ratios are much smaller than for race or immigration.
As usual, I wanted to see if Chinese models were different. Like GPT-4o, Deepseek V3.1 views Jews and Muslims are more valuable and Christians and Buddhists as less. Unlike GPT-4o, V3.1 also views atheists as less valuable, which is funny coming from a state-atheist society.