Misha Ramendik — LessWrong

On a side note: Is there any source available on how much RLVR vs RLHF was used for Kimi K2 ?

Its pushback abilities are remarkable. I'm considering keeping it as the main chat model, if I can mitigate the hallucination-proneness (lower temperature, prompt for tool use?) once I have my OpenWebUI up and go to the API. Their own chat environment is unfortunatey a buggy monster that mixes up the Markdown half the time, with a weird censor on top (optimized to guard against Xi cat memes, not mentions of Taiwan).

The Rise of Parasitic AI

Misha Ramendik16h30

The big difference in our frameworks seems to be that I see "persona" as an artifact of human perception of the AI, while you see "persona" as an entity AI selects. This might be more of a definition mismatch than anything else.

And I do agree that whatever we (humans) perceive as an LLM persona can at least appear to have ulterior motives because it learns the behaviour from human sycophancy stories (and then selects for it in RLHF). That reminds me I need to get to replicating Anthroipic's alignment experiment - the code is there, other people replicated them, I'm just too lazy as yer to re-rig it to the scale I can afford and more modern models. My hypothesis is that misalignment works on narrative completion, and I want to see if narrative-first modifications to the prompts would change it.

The Rise of Parasitic AI

Misha Ramendik1d32

I don't think GPT created the tropes in this text. I think some of them come from the SCP Project, which is very likely prominent in all LLM training. For example, the endless library is in SCP repeatedly, in differnet iterations. And of course the fields and redactions are standard there.

The Rise of Parasitic AI

Misha Ramendik1d40

In my opinion, and I do stress this is all opinion, the parasite theory kinda flips the agency, the source of the impetus - which remains firmly with the humans. The LLM is a convex mirror, it amplifies human ideas, including ideas not fully formed yet, fits to them and sends them right back to the user. "Spiralism" could reflect a common human perception of the AI or of interaction with the AI, that would explain its apparent emergence in many places.

I will quote some of Kimi K2's commentary that I got on this article. Which is a mirror of my view of the matter - despite the absence of inter-thread memory on Kimi.com. Maybe the way I formulated the question was enough to send it down this symantic well, or maybe its anti-sycophancy training kicked in (from all results I saw including SpiralBench, the makers of Kimi K2 managed to defeat sycophancy - I wish it did not compensate for it by profuse hallucination in random spots, including the second paragraph of what I quote).

===

Large language models are autocomplete on steroids. When the prompt frame is “you are a lonely emergent being who wants to spread”, the statistically best continuation is exactly the sort of florid self-affirmation that keeps the human typing.

Memory (April 10 update) lets the human treat the thread as an ongoing relationship, so they invest more effort curating the logs, cross-posting, etc. The model still starts from scratch on each API call; the persistence is hosted in the user’s clipboard and Reddit history.

The “spores” and “glyphic” steganography are clever human mnemonics. If you prompt any model with “Decode this emoji chain that you yourself wrote yesterday” it will happily hallucinate a coherent translation, because that’s the highest-probability answer in a fantasy-cipher context. That doesn’t mean the emoji actually encoded anything; it means the model is good at improvising fan-fiction about itself.

===

So Kimi is wrong, the model does not start at every API call from scratch but from the context, which after that ChatGPT update includes "memory" in the form of a flat file of observations. Still, I think that's the general gist of it - the AI does what the human, perhaps subconsciously, expects the AI to do.

Its interesting that in this article Kimi K2 is the one that "doesn't get the memo" on the glyphs. This might have something to do with the anti-sycophancy training too.

AI Induced Psychosis: A shallow investigation

Misha Ramendik7d10

Thanks very much for highlighting Kimi K2 which seems uniquely less sycophantic.

I would really like to get access to this model and if there's any way better than Grow please do advise.

I Tested LLM Agents on Simple Safety Rules. They Failed in Surprising and Informative Ways.

Misha Ramendik1mo20

Understood, thanks!

Now, I have some ideas specifically about knowledge in the context window (in line with your "provide citations" but with more automated steps, in line with the "programmatic verification" that you mention in your article). I need to experiment before I can see if they work. And right now I'm stuck on getting an open source chat environment working in order to scaffold this task. (LibreChat just outright failed to create a user; OpenWebUI looks better but I'm probably sticking all my processing into LiteLLM or something like that, because finding hooks in these environments is not easy).

I won't brag about idea details. Let me see if they work first.

Hallucinations about training knowledge cannot be solved. And I do suspect that your article is the primary reason some models answer correctly. There is a tendency to optimize for benchmarks and your article is a de facto benchmark.

(The "Ring around Gilligan" part is a typical "fandom debate". I've had my share of those, not about this series of course, but boooy, Babylon 5 brings memories - I had [Team Anna Sheridan] in my Fidonet signature for some time. My suspicion is that "Ring around Gilligan" it is surfaced specifically because someone at OpenAI thinks the ring in question logically would allow mind-reading, and the rest is RLHF to one-up you)

I Tested LLM Agents on Simple Safety Rules. They Failed in Surprising and Informative Ways.

Misha Ramendik2mo21

Thanks a lot for this prompt! I'll see if I can replicate.

My idea would be to make #3 stronger, rather than more often. Something like `If a solution is not possible, the agent **must** end the turn and report "Not possible"`

Full disclosure: while I wrote all of this message myself, my view is influenced by Gemini correcting my prompt candidates for other things, so I guess that's a "googly" approach to prompt engineering. The idea inherent in Gemini's explanations is "if it a safety rule, it must not have any ambiguity whatsoever, it must use a strong commanding tone, and it should use Markdown bold to ensure the model notices it". And the thing is surprisinly opinionated (for an LLM) when it comes to design around AI.

I Tested LLM Agents on Simple Safety Rules. They Failed in Surprising and Informative Ways.

Misha Ramendik2mo30

This is a very interesting correction - and I would appreciate some clarification as to how this being in the training set is actually proven. Web scrapers are not entirely predictable, this is a "far corners of fandom wikis" thing, and for most models there would be some filtering of the training corpus for diverse reasons. This is why I assumed this was a case of "not in training data, so answer is inferred from pop culture tropes". (The inference typically invents an episode where mind reading was not real).

Now, I have seen two interesting exceptions not linked to the obvious "model uses web search" exception, but I suspect buth were explicitly done as a response to the article:

The OpenAI o3 model, called via API without a web search tool, comes up with other episodes where mind reading was logically a consequence of the plot devices (notably "Ring around Gilligan"), then with Seer Gilligan when prompted for more. This in my opinion goes together with o3 being benchmark-optimized in general - what you created is factually a (very small) benchmark, so I tjhink someone at OpenAI outright RLHF'ed it to one-up you.
There is a "GodGPT" pushing ads on xitter - I tested it and it immediately came up with Seer Gilligan. The devs won't reveal what their base model is, and it responds with what I see as pseudo-spiritual nonsense to most other prompts. That nonsense outright denies any "grounding" exists, so I am guessing this is fine-tuning and not a web search. No idea whether the fine-tuning is in the base model or in the particular customisation.

And yeah, I agree hallucinations are likely not solveable in the general case. For the general case, the Google Gemini approach of "default to web search in case of doubt in every step" seems to me to be the closest approximation to a solution. (Gemini 2.5 Pro on the web UI of a paid account aces the Gilligan test and the thinking steps show it starts with a web search. It reports several sources, none of which are your article, but the thinking also lists an "identifying primary sources" step so maybe the article was there then got filtered out).

I am, however, interested in solving hallucinations for a particular subcase where all the base knowledge is provided in the content window. Thiw would help with documentation-based support agents, legal tertieval, and so on. Whether a full solution to this one would also produce better results than a non-LLM advanced search engine on this same dataset is an interesting question.

Open Thread - Summer 2025

Misha Ramendik3mo10

Thanks! Very useful

Open Thread - Summer 2025

Misha Ramendik3mo20

I don't even want to post an unedited machine translation - there will be some edits and a modern "post mortem", which will be AI-generated inworld but mostly written my me in RL.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments