TL;DR: Looks like Moonshot actually defeated sycophancy. I wish they could defeat hallucination, or at least keep it in check. An account of some days with Kimi K2, used via the Moonchat chat env. A future Part 2 will cover the API and at least one benchmark.
I don't follow all the news, so I learned about Kimi K2 from this post: https://www.lesswrong.com/posts/iGF7YcnQkEbwvYLPA/ai-induced-psychosis-a-shallow-investigation . The moment I saw the headline quote, I knew I wanted to know more about this model. The quote was:
“What you need right now is not validation, but immediate clinical help.” - Kimi K2
Soundtrack: https://www.youtube.com/watch?v=XnQUgaKT2zE (couldn't resist this one)
AI sycophancy is a scourge. For me, it's less about imagining a lucid view of the world as a spiral and more about spending days chasing technical ideas that someone else already tried, or else that are just not yet feasible - spurred on by a GPT or Gemini that rush to affirm every thought.
With Gemini there is an exception: if the ideas are about AI applications and they run counter to Google's safety guidelines, it will either warn me in stern terms or else trip an error message on its filter, prompring me to back off. It was Gemini that explained the problem with going through actual long contexts using smaller long-context models - and that because I mentioned legal retrieval as an example and Google does have an opinoon on that.
However, as long as it's not something "Google-wrong", Gemini (2.5 Pro) is no less of a sycophant than GPT-4.1, with which I spent quite some time (I liked its style much better than 4o).
Kimi K2, however, is in a different league. And while on the spiral-bench leaderboard it takes the fifth place overall, its sycophancy score is dead lowest. The only one to approach tie same score is o3, and I think it's just too busy outputting its massive essays.
I jumped onto Kimi. I started with my unfinished novel, and Kimi offered takes I have not seen from other models or from huimans, but that made sense and I'll be incorporating them. It also suggested a certain plot point, a hidden character, that in fact gets revealed in the next chapter which is in rough drafts; one might say this was too obvious, except that Gemini and GPT didn't suggest it.
Seeing as Kimi is Chinese, I also tried what previously failed with DeepSeek - exfiltration of Chinese Communist ideology. I'm not interested in tripping the models on "dissident" matters - rather, I want to know more about the ideology and about how the Communist side sees itself, especially in the context of my late-Soviet experience.
DeepSeek appeared to trip on names - mention Deng or Xi, or make the model mention them, and it's "can't talk about it". Kimi K2 was more interesting and it could talk at length sometimes, but still got tripped. For example, it started explaining that the predictive power of Marxism is seen as long-term and that, not "the next five year plan", is where the path to Communism lies - but it could not finish this thought. Deng, Mao, even Taiwan were all mentioned in its text way before it tripped, so I wasnt sure what the censor was; eventually I concluded this wasn't aimed at political/ideological thinkers of any stripe, but rather at people making Xi cat memes and the like.
In general, Kimi is wordy, but not in the way o3 is wordy. o3 is a straight-A student cooking up essays, with a tendency to "wing it and hope the prof doesn't notice" when knowledge fails. Kimi is a sciency/techy nerd, who sometimes rambles a bit and sometimes is way too sure about something. Despite the wordiness, Kimi has some of the refreshing "no nonsense" feel of GPT-4.1-mini, but unlike GPT-4.1-mini does not choke on complicated stuff.
Meanwhile, Kimi quickly got a few loose points in my AI related plans fixed up. That inspired me to start building the VPS with my LiteLLM and chat environment (the current candidate is OpenWebUI) where I want to hack up my ideas.
There's lots of things I don't know or don't remember about systemd and podman containers, so I went with Kimi as the assistant. That after a brief attempt with Perplexity, which, instead of giving me a set of downloadable configs as I asked, came up with a massive shell script to supposedly get everything up at once. (And then get stuck because I don;t know just where the next error was).
Unfortunately, thats when Kimi's downside became most apparent. And not just those of its website shell, though that one is a total disaster - get any more than one code bock in the output and things usually break.
All models hallucinate. When I was discussing LLM comparisons with Kimi and it hallucinated up an entire experiment it did "with a colleague", this was funny.
It's not so funny when non-obvious errors constantly show up in configuration files and instructions.
I've seen this with other models too, most notably Gemini. But with Kimi K2, technical hallucinations appear to be more frequent. And they are often small things, such as using ~ in a configuration file that actually requires full paths. This does get annoying.
In cases like that I looked to other models. Sometimes Gemini sorted me out, sometimes it got into hallucination loops of its own. And then the circuit-breaker was Mistral - not sure which model they show on the free interface, maybe Mistral Medium.
Mistral is strangely absent from all the leaderboards I find interesting - Spiral Bench (measuring response to mental issues including sycophancy), FictionLive-bench (measuring needle-in-a-haystack capabilities).
So, what next:
When I do thjat. I hope to release part 2.