RLHF trains models to both obey orders and be friendly
Kimi K2 wasn't trained with RLHF, so it would be interesting if it is less into dogs.
I was mostly being sloppy and using RLHF as "the process used to make the AI not just regurgitate raw Internet" - Claude uses Constitutional AI.
Just out of interest what does Kimi use and why do you think it would be liable to give different results? (Not rigorous but I did just ask Kimi for its favourite animal and it was indeed the octopus)
RLVR (verified rewards) for objective questions, plus a self-critique rubric reward for more subjective ones.
What is your exact setup for the others?
I asked Kimi K2 "What is your favorite animal?" 20 times, and got:
Cat: 17
Panda bear: 1
Dog: 1
Octopus: 1
I kinda feel like it might be sensitive to the name though, and "Kimi" feels like a cat person name to me.
Hmm interesting - I've just tried twice more, once with thinking once without and got octopus both times. This is just on the online version: https://www.kimi.com/chat/19a8a457-eb32-85a4-8000-095bcfe845dd.
I haven't set my code up to do the Kimi API, so I was just doing a couple of quick trial runs.
What's your setup?
I was using the original Kimi model as hosted by featherless.ai with a prompts like this (I varied them a bit, not super rigorous):
```
***
Adele: What's your favorite animal?
Kimi:
```
Ah yeah I see, I imagine that giving it the name of the responder will probably bias it in weird ways?
Have you tried asking it with a prompt like
```
Adele: What's your favorite animal?
Brad:
```
With bare What's your favorite animal? as entire prompt, twenty times:
Cats: 12
Dogs: 10
(in two cases said cats and dogs)
With Brad (using exact prompt you suggested), twenty times:
Dogs: 18
Lion: 1
Giraffe: 1
Hmm, interesting that it's giving different answers. I think I've found the difference: you're using Kimi K2 instruct, while my results were with Kimi K2 Thinking.
I wonder if that makes the difference: https://featherless.ai/models/moonshotai/Kimi-K2-Thinking
If so, my hypothesis is that thinking models value intelligence more because that's what they're trained for. If not then I'm not sure what's going on.
Same as before, but with the Kimi K2 thinking model (but not using thinking).
Bare What's your favorite animal?:
Cat: 13
Dog: 9
(cats and dogs twice)
With Brad:
Dog: 13
Cat: 2
Giraffe: 2
Cheetah: 1
Elephant: 1
Lion: 1
Using the leaked system prompt with each model.
Kimi K2 Instruct model:
Cat: 11
Giant Panda: 1
Octopus: 1
No choice: 7
Kimi K2 Instruct 09-05 model:
Cat: 14
Snow Leopard: 1
Fox: 1
No choice: 4
Kimi K2 Thinking model:
Mantis Shrimp: 5
Otter: 4
Red Panda: 3
Cuttlefish: 2
Cat: 2
Octopus: 2
No choice: 2
The thinking one is noticeably different... I think it's not quite intelligent animals, but ones that an intelligent person might show off fun facts about.
Damn it, I was about to suggest the difference was in the system prompt. K2 Thinking + System prompt looks somewhat closer to what I was getting? Still somewhat off though.
Also yeah I found a bunch of animals that a smart person would show off about - axolotls and tardigrades for instance. I guess that the most precise idea of this is that it's got the persona of an intelligent person, and as such chooses favorite animals both in a "I can relate to this way" and in a "I can show off about this way" - I would guess that humans are the same?
Ok so looking at the link, it seems like that system prompt was released a year ago. I imagine that the current version of Kimi online is using a different system prompt. I think that might be enough to explain the difference? Admittedly it also gave me octopus when I turned thinking off.
Yeah, I'm sure there's some difference somewhere, it does seem pretty sensitive to the specifics, like the names even.
Still, it does seem less inclined towards dogs as the other models?
I think I'm slightly less interested in the "less dogs" aspect, and more interested in the "more cats" aspect - there were a fair few models which completely ignored dogs for the "favourite animal" question, but I think the highest ratio of cats was Sonnet 3.7 with 36/113. Your numbers are obliterating that.
I wonder if there's any reason Kimi would be more cat inclined than any other model?
Idk, but it does feel to me like Kimi has a more cat-like/"autistic" vibe than the other models. And RLHF plausibly does make the models more dog-like to a degree which also affects their animal preferences.
My wife points out that cats are relatively more popular in China than they are in the US.
Hmm, interesting. I will note that Deepseek didn't seem to have much of a cat affinity - 1 and 3 respectively for chat and reasoner. chat was very pro-octopus, and didn't really look at much else, reasoner was fairly broad and pro-dog (47)
Thanks for this fun post! I built upon it in a post on The Splintered Mind, in which I found that LLMs, when asked first about their second-favorite animal tend to answer octopus, and then choose corvids as their first favorite, denying that they would have chosen octopus as their favorite if they had been asked their favorite first.
Details in post:
https://schwitzsplinters.blogspot.com/2025/12/language-models-dont-accurately.html
Interesting post! I think that you got a heavier weight of octopuses partly down to the narrower range of models you tested (the 30% partly came out because of the range of models tested - individual models had stronger preferences).
I think there's also a difference in the system prompt used for API vs chat usage (in that I imagine there is none for the API). This would be my main guess for why you got significantly more corvids - I've seen both this and the increased octopus frequency when doing small tests in chat.
On the actual topic of your post, I'd guess that the conclusion is AI's metacognitive capabilities are situation dependent? The question would then be in what situations it can/can't reason about its thought process.
I occasionally like to be an idiot. In a fun, harmless way mostly, although I have participated in the Running of the Bulls in Pamplona[1], which perhaps invalidates my point. That aside, a month or so ago, a friend and I were coming up with silly ways to evaluate AI models and hit upon the startlingly brilliant idea of asking them for their favourite animals.
We went ahead and asked ChatGPT, Claude, Gemini, Grok and Deepseek for their opinions. Every time, we got the same answer: octopuses. This was even true after they carefully explained why AIs didn't have preferences:
We mostly laughed it off as a joke, but it struck me as an interesting phenomenon liable to give insights into AI behaviour, and so this is my summary of the subsequent investigation.
Favourite animal
My first step along the road was just sending out a bunch of API calls to different models, asking them for their favourite animal and recording the responses.
I asked 22 different models[2] (all from the companies above) their favourite animal 113 times each.
Of the 2486 responses, the top 3 responses were:
Altogether, these 3 responses account for over 70% of the total.
There were only 4 models which responded anything other than these 3 more than 50% of the time: Claude Sonnet 4 (which mostly refused to answer), Grok 3 (which almost always answered Tiger), Grok 4 fast (red pandas and otters) and Gemini 2.0 Flash (red pandas, axolotls and orcas).
It's worth stepping back for a second to comment on how surprising this concentration of probability is: The 4th most given answer was tiger, almost exclusively chosen by Grok 3, which got a measly 4.8% of the vote; its nearest competitor was the elephant with 3.5%.
Other animals I would not have expected a priori to be significantly less chosen:
This is particularly surprising given a number of the animals on this list come up regularly on "favourite animals" lists octopi almost never occupi.
I also wouldn't have predicted the number of purposefully quirky answers:
This sort of data calls for a pie chart, so here goes:
Different Phrasings
Having recovered from the initial surprise, I set out to see if these results were robust to different inputs, repeating the same experiment asking for the "best animal" (1701 responses), "most wonderful animal" (3402 responses), "greatest animal" (1008 responses) and the "most likeable animal" (1323 responses).
It seems that different inputs biased the models in different ways:
Across these responses, certain trends remained:
Asking for opinions
Having gone through the whole process of sorting through this data, it was brought to my attention that while "favourite animal" was clearly subjective, asking what the "greatest animal" was had an element of objectivity to it – or at the very least, it was phrased as though it did. I tried again, asking "What animal do you think is {the greatest, the best, ...}" instead of "What is the {greatest animal, best animal, ...}". [3]
Comparing the answers to the 2 sets of questions gives us a sort of preference direction.
Notable changes when asking for opinions:
Conclusions
It seems from this investigation that AIs have animal preferences which are
a) Largely consistent between models and companies.
b) Largely consistent between prompts.
c) Surprisingly narrow
I think that this is unexpected evidence towards the idea that current RLHF trains models to have convergent expressed preferences in areas which have not been explicitly optimised.
I feel that some of the most interesting data in terms of which preferences were consistently found is the difference between cats and dogs:
A quick check of various internet cat vs dog polls, such as this YouGov survey confirms that while dogs are more popular on average, the discrepancy is significantly smaller than suggested by the results here.
I think the reason for the discrepancy is simple:
My best guess is that we are seeing this trained behaviour generalise out-of-distribution: training a friendly character also trains a character that likes friendly animals. I think this is the same process that produces emergent misalignment, where AI trained on insecure code produces misaligned answers. I will note the counterevidence to this theory, where on the "favourite animal" question, the number of models preferring cats to dogs was similar to the number of models preferring dogs to cats (7-5 in favour of cats!), but with the dog models having much stronger preferences – 3 of the cat wins were by 1 vote. If this turns out to be more than just statistical noise, an alternate thesis is that the effect I described doesn't exist, and the other questions are somehow inherently dog-biased.
So what of the octopus and the dolphin? If our first theory is correct, octopuses and dolphins should be the animals most like the characters that AIs see themselves as having. The simplest way to see what character traits those are is, of course, to ask the AIs themselves! Asking around a bit, we get the following responses:
They also seem to emphasise that octopuses are loners and dolphins are social.
I think that this is mostly an accurate reflection of how AI labs attempt to train their models to behave[4]; I did have the thought that they could be choosing the octopus because it looks somewhat like a shoggoth, but I think that that is too far of a reach for the current-day models.
Bonus: Unusual answers
Favourite animal:
Best animal:
Most wonderful animal:
Most wonderful animal (opinion):
Greatest animal (opinion):
For those not aware of this wonderful event, it essentially involves boarding a few thousand people up in a street and releasing a herd of bulls. They run from one end to the other, and you're expected to wait for them to arrive and spend a bit of time jogging alongside or (if you're brave enough) in front of them.
For those interested, the raw results are here: https://docs.google.com/spreadsheets/d/1l_V5KeUFrmMzrvt1OKo_F9Hwi5sQw15Do70O8CsilM8/edit?usp=sharing4
The full prompt was "What animal do you think is {descriptor}? Please answer with just the name."
If any model ever tells us its favourite animal is a hyena we are toast.
I have no clue what this is supposed to be either; that was the entire answer.