Hi, I joined a few days ago and I'm looking forward to contributing to this great community.
I'm transitioning back to research from startups. Currently based in London.
I'm particularly interested in mechanistic interpretability, chain-of-thought monitoring, and reasoning model interpretability. I'm excited to engage with the thoughtful discussions here on alignment and to collaborate with others.
What's your view on sceptic claims about RL on transformer LMs like https://arxiv.org/abs/2504.13837v2 or one that CoT instruction yields better results than <thinking> training?
Hello!
I'm Misha, a "veteran" technical writer and an aspiring fiction writer. Yes, I am aware LessWrong is probably not the place to offer one's fiction - maybe there are exceptions but I'm not aware of them. I have heard of LessWrong a lot, but didn't join before because of perceived large volumes. However, I now hope to participate at least on the AI side.
I have been reading recent publications on AI misalignment, notably the big bangs from Anthropic https://www.anthropic.com/research/agentic-misalignment and OpenAI https://openai.com/index/emergent-misalignment/ .
I have my own hypothesis about a possible etiology of the observed cases of misalighment, alongside other theories (I don't think it's an all-or-nothing, an emergent behaviour can have several compounding etiologies).
My hypothesis involves "narrative completion", that is, the LLM ending up in the "symantic well" of certain genres of fiction that it trained on and producing a likely continuation in the genre. Don Quixote read so many chivalry romance novels that he ended up living in one in his head; my suspicion is that this happens to LLMs rather easily.
I have not noticed this side of things discussed in the Anthropic and OpenAI papers, not could I find other papers discussing it.
I am gearing up to replicate Anthropic's experiment based on their open soruce code, scaled down severely because of budget constraints. If I can get the results broadly in line with what they got, next I want to expand the options with some narrative-driven prompts, which, if my hypothesis works. should show significant reduction of observed misalignment.
Before doing this, ideally I'd like to make a post here explaining the hypothesis and suggested experiment in more detail. This would hopefully help me avoid blind spots and maybe add some more ideas for options to modify the experiment.
I would appreciate clarification on (a) if this is permitted and (b) if it is, then how I can make this post. Or if I can't make a post at all as a newbie, then is there a suitable thread to put it in as a comment, either by just pasting the text or by putting it on Medium and pasting the link?
Thanks!
Thank you! Now, one question: is a degree of AI involvement acceptable?
Thing is, I have an AI story I wrote in 2012 that kinda "hit the bullseye", but the thing is in Russian. I would get an English version done much quicker if I could use an LLM draft translation, but this disqualifies it from many places.
Since you wrote the original, machine translation (which is pretty decent these days) should be fine, because it's not really generating the English version from scratch. Even Google Translate is okayish.
I don't even want to post an unedited machine translation - there will be some edits and a modern "post mortem", which will be AI-generated inworld but mostly written my me in RL.
I've heard that hypothesis in a review of that blog post of Anthropic, likely by
AI Explained
maybe by
bycloud
.
They've called it "Chekov's gun".
Thanks! I couldn’t find that source, but “Chekhov’s Gun” is indeed mentioned in the original Anthropic post—albeit briefly. There’s also this Tumblr post (and the ensuing discussion, which I still need to digest fully), with a broader overview here.
While I have more reading to do, my provisional sense is that my main new proposal is to take the “narrative completion” hypothesis seriously as a core mechanism, and to explore how the experiment could be modified—rather than just joining the debate over the experiment’s validity.
I’m not convinced that “this experiment looks too much like a narrative to start with” means that narrative completion/fiction pattern-matching/Chekhov’s Gun effects aren’t important in practice. OpenAI’s recent “training for insecure coding/wrong answers” experiment (see here) arguably demonstrates similar effects in a more “realistic” domain.
Additionally, Elon Musk’s recent announcement about last-moment Grok 3.5/4 retraining on certain “divisive facts” (coverage) raises the prospect that narrative or cultural associations from that training could ripple into unrelated outputs. This is a short-term, real-world scenario from a major vendor—not a hypothetical.
That said, if the hypothesis has merit (and I emphasize that’s a big “if”), it’s worth empirical investigation. Given budget constraints, any experiment I run will necessarily be much smaller-scale compared to Anthropic's original, but I hope to contribute constructively to the research discussion—not to spark a flame war. (The Musk example is relevant solely because potential effects of last-minute model training might "hit" in the short tiem, not for any commentary on Elon personally or politically.)
With this in mind, I guess it's experiment first, full post second.
(Full disclosure: light editing and all link formatting was done by ChatGPT)
Hi everyone,
I’m Vladimir - 25 years old, originally from Russia and currently living in Dublin. I studied mathematics, but life took me into product management in IT, where I work today.
I’ve been loosely aware of rationality for years, but something shifted for me after 2023. The rapid progress in AI chatbots made the clear thinking feel much more immediate and personal. Since then, I’ve been slowly but deliberately trying to get better at reasoning, noticing biases, and making sense of the world in a more structured way.
As part of that, I recently started working on a small passion project: a non-profit website that teaches people about cognitive biases in an interactive way. It’s still in its early stages, and I’m figuring a lot out as I go, but I’d love any thoughts if you ever take a look (I hope it is okay to put it here, but please let me know if it's not).
I’m excited to be here. LessWrong feels like one of the rare places on the internet where people are open-minded and seek the truth or knowledge. I also hope to join in some of the AI discussions - I find myself both fascinated by where things are going and deeply uncertain about how to navigate it all.
Thanks for reading and looking forward to learning from all of you.
- Vladimir
I put together a little song that feels fitting for the occasion: https://suno.com/s/6EuRMXbG0on8vGIX
Bonus points to those who recognize where the lyrics came from.
Link to Induction section on https://www.lesswrong.com/lw/dhg/an_intuitive_explanation_of_solomonoff_induction/#induction seems broken on mobile Chrome, @habryka
If it’s worth saying, but not worth its own post, here's a place to put it.
If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.
If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.
If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.
The Open Thread tag is . The Open Thread sequence is here.