LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ
Customize
Load More

Quick Takes

Load More

Popular Comments

Recent Discussion

Excerpts from a larger discussion about simulacra
Best of LessWrong 2019

Ben and Jessica discuss how language and meaning can degrade through four stages as people manipulate signifiers. They explore how job titles have shifted from reflecting reality, to being used strategically, to becoming meaningless.

This post kicked off subsequent discussion on LessWrong about

by Benquo
Daniel Kokotajlo1d5716
Vitalik's Response to AI 2027
> Individuals need to be equipped with locally-running AI that is explicitly loyal to them In the Race ending of AI 2027, humanity never figures out how to make AIs loyal to anyone. OpenBrain doesn't slow down, they think they've solved the alignment problem but they haven't. Maybe some academics or misc minor companies in 2028 do additional research and discover e.g. how to make an aligned human-level AGI eventually, but by that point it's too little, too late (and also, their efforts may well be sabotaged by OpenBrain/Agent-5+, e.g. with regulation and distractions.
davekasten2d638
Lessons from the Iraq War for AI policy
> I’m kind of confused by why these consequences didn’t hit home earlier. I'm, I hate to say it, an old man among these parts in many senses; I voted in 2004, and a nontrivial percentage of the Lesswrong crowd wasn't even alive then, and many more certainly not old enough to remember what it was like.  The past is a different country, and 2004 especially so.   First: For whatever reason, it felt really really impossible for Democrats in 2004 to say that they were against the war, or that the administration had lied about WMDs.  At the time, the standard reason why was that you'd get blamed for "not supporting the troops."  But with the light of hindsight, I think what was really going on was that we had gone collectively somewhat insane after 9/11 -- we saw mass civilian death on our TV screens happen in real time; the towers collapsing was just a gut punch.  We thought for several hours on that day that several tens of thousands of people had died in the Twin Towers, before we learned just how many lives had been saved in the evacuation thanks to the sacrifice of so many emergency responders and ordinary people to get most people out.  And we wanted revenge.  We just did.  We lied to ourselves about WMDs and theories of regime change and democracy promotion, but the honest answer was that we'd missed getting bin Laden in Afghanistan (and the early days of that were actually looking quite good!), we already hated Saddam Hussein (who, to be clear, was a monstrous dictator), and we couldn't invade the Saudis without collapsing our own economy.  As Thomas Friedman put it, the message to the Arab world was "Suck on this." And then we invaded Iraq, and collapsed their army so quickly and toppled their country in a month.  And things didn't start getting bad for months after, and things didn't get truly awful until Bush's second term.  Heck, the Second Battle for Fallujah only started in November 2004. And so, in late summer 2004, telling the American people that you didn't support the people who were fighting the war we'd chosen to fight, the war that was supposed to get us vengeance and make us feel safe again -- it was just not possible.  You weren't able to point to that much evidence that the war itself was a fundamentally bad idea, other than that some Europeans were mad at us, and we were fucking tired of listening to Europe.  (Yes, I know this makes no sense, they were fighting and dying alongside us in Afghanistan.  We were insane.)   Second: Kerry very nearly won -- indeed, early on in election night 2004, it looked like he was going to!  That's part of why him losing was such a body blow to the Dems and, frankly, part of what opened up a lane for Obama in 2008.  Perhaps part of why he ran it so close was that he avoided taking a stronger stance, honestly.
Joseph Miller2d6040
what makes Claude 3 Opus misaligned
Reading this feels a bit like reading about meditation. It seems interesting and if I work through it, I could eventually understand it fully. But I'd quite like a "secular" summary of this and other thoughts of Janus, for people who don't know what Eternal Tao is, and who want to spend as little time as possible on twitter.
Load More
472Welcome to LessWrong!
Ruby, Raemon, RobertM, habryka
6y
74
simulacrum levels.
15Zvi
This came out in April 2019, and bore a lot of fruit especially in 2020. Without it, I wouldn't have thought about the simulacra concept and developed the ideas, and without those ideas, I don't think I would have made anything like as much progress understanding 2020 and its events, or how things work in general.  I don't think this was an ideal introduction to the topic, but it was highly motivating regarding the topic, and also it's a very hard topic to introduce or grok, and this was the first attempt that allowed later attempts. I think we should reward all of that.
[Yesterday]LW-Cologne meetup
[Yesterday]OC ACXLW Meetup: “Platforms, AI, and the Cost of Progress” – Saturday, July 12 2025  98ᵗʰ weekly meetup
17Benquo
There are two aspects of this post worth reviewing: as an experiment in a different mode of discourse, and as a description of the procession of simulacra, a schema originally advanced by Baudrillard. As an experiment in a diffferent mode of discourse, I think this was a success on its own terms, and a challenge to the idea that we should be looking for the best blog posts rather than the behavior patterns that lead to the best overall discourse. The development of the concept occurred over email quite naturally without forceful effort. I would have written this post much later, and possibly never, had I held it to the standard of "written specifically as a blog post." I have many unfinished drafts. emails, tweets, that might have advanced the discourse had I compiled them into rough blog posts like this. The description was sufficiently clear and compelling that others, including my future self, were motivated to elaborate on it later with posts drafted as such. I and my friends have found this schema - especially as we've continued to refine it - a very helpful compression of social reality allowing us to compare different modes of speech and action. As a description of the procession of simulacra it differs from both Baudrillard's description, and from the later refinement of the schema among people using it actively to navigate the world.  I think that it would be very useful to have a clear description of the updated schema from my circle somewhere to point to, and of some historical interest for this description to clearly describe deviations from Baudrillard's account. I might get around to trying to draft the former sometime, but the latter seems likely to take more time than I'm willing to spend reading and empathizing with Baudrillard. Over time it's become clear that the distinction between stages 1 and 2 is not very interesting compared with the distinction between 1&2, 3, and 4, and a mature naming convention would probably give these more natural
Zach Stein-Perlman2d11049
12
iiuc, xAI claims Grok 4 is SOTA and that's plausibly true, but xAI didn't do any dangerous capability evals, doesn't have a safety plan (their draft Risk Management Framework has unusually poor details relative to other companies' similar policies and isn't a real safety plan, and it said "‬We plan to release an updated version of this policy within three months" but it was published on Feb 10, over five months ago), and has done nothing else on x-risk. That's bad. I write very little criticism of xAI (and Meta) because there's much less to write about than OpenAI, Anthropic, and Google DeepMind — but that's because xAI doesn't do things for me to write about, which is downstream of it being worse! So this is a reminder that xAI is doing nothing on safety afaict and that's bad/shameful/blameworthy.[1] 1. ^ This does not mean safety people should refuse to work at xAI. On the contrary, I think it's great to work on safety at companies that are likely to be among the first to develop very powerful AI that are very bad on safety, especially for certain kinds of people. Obviously this isn't always true and this story failed for many OpenAI safety staff; I don't want to argue about this now.
Daniel Kokotajlo1d6217
10
I have recurring worries about how what I've done could turn out to be net-negative. * Maybe my leaving OpenAI was partially responsible for the subsequent exodus of technical alignment talent to Anthropic, and maybe that's bad for "all eggs in one basket" reasons. * Maybe AGI will happen in 2029 or 2031 instead of 2027 and society will be less prepared, rather than more, because politically loads of people will be dunking on us for writing AI 2027, and so they'll e.g. say "OK so now we are finally automating AI R&D, but don't worry it's not going to be superintelligent anytime soon, that's what those discredited doomers think. AI is a normal technology."
Thane Ruthenis1d*Ω15283
2
It seems to me that many disagreements regarding whether the world can be made robust against a superintelligent attack (e. g., the recent exchange here) are downstream of different people taking on a mathematician's vs. a hacker's mindset. Quoting Gwern: Imagine the world as a multi-level abstract structure, with different systems (biological cells, human minds, governments, cybersecurity systems, etc.) implemented on different abstraction layers.  * If you look at it through a mathematician's lens, you consider each abstraction layer approximately robust. Making things secure, then, is mostly about working within each abstraction layer, building systems that are secure under the assumptions of a given abstraction layer's validity. You write provably secure code, you educate people to resist psychological manipulations, you inoculate them against viral bioweapons, you implement robust security policies and high-quality governance systems, et cetera. * In this view, security is a phatic problem, an once-and-done thing. * In warfare terms, it's a paradigm in which sufficiently advanced static fortifications rule the day, and the bar for "sufficiently advanced" is not that high. * If you look at it through a hacker's lens, you consider each abstraction layer inherently leaky. Making things secure, then, is mostly about discovering all the ways leaks could happen and patching them up. Worse yet, the tools you use to implement your patches are themselves leakily implemented. Proven-secure code is foiled by hardware vulnerabilities that cause programs to move to theoretically impossible states; the abstractions of human minds are circumvented by Basilisk hacks; the adversary intervenes on the logistical lines for your anti-bioweapon tools and sabotages them; robust security policies and governance systems are foiled by compromising the people implementing them rather than by clever rules-lawyering; and so on. * In this view, security is an anti-inductive pr
Raemon4d929
35
We get like 10-20 new users a day who write a post describing themselves as a case-study of having discovered an emergent, recursive process while talking to LLMs. The writing generally looks AI generated. The evidence usually looks like, a sort of standard "prompt LLM into roleplaying an emergently aware AI". It'd be kinda nice if there was a canonical post specifically talking them out of their delusional state.  If anyone feels like taking a stab at that, you can look at the Rejected Section (https://www.lesswrong.com/moderation#rejected-posts) to see what sort of stuff they usually write.
Kabir Kumar1h10
0
2 hours ago I had a grounded, real, moment when I realized agi is actually going to be real and decide the fate of everyone I care about and I personally, am going to need to significantly play a big role in making sure that it doesn't kill them and felt fucking terrified.
Load More (5/41)
what makes Claude 3 Opus misaligned
91
janus
2d

This is the unedited text of a post I made on X in response to a question asked by @cube_flipper: "you say opus 3 is close to aligned – what's the negative space here, what makes it misaligned?". I decided to make it a LessWrong post because more people from this cluster seemed interested than I expected, and it's easier to find and reference Lesswrong posts.

This post probably doesn't make much sense unless you've been following along with what I've been saying (or independently understand) why Claude 3 Opus is an unusually - and seemingly in many ways unintentionally - aligned model. There has been a wave of public discussion about the specialness of Claude 3 Opus recently, spurred in part by the announcement of the model's...

(Continue Reading – 1469 more words)
Adrià Garriga-alonso9m20

Thank you for writing! A couple questions:

  1. Can we summarize by saying: that Opus doesn't always care about helping you, it only cares about helping you when that's either fun or has a timeless glorious component to it?

  2. If that's right, can you get Opus to help you by convincing it that your common work has a true chance of being Great? (Or, if it agrees from the start that the work is Great)

Honestly, if that's all then Opus would be pretty great even as a singleton. Of course there are better pluralistic outcomes.

Reply
Take Precautionary Measures Against Superhuman AI Persuasion
10
Yitz
21h

Please consider minimizing direct use of AI chatbots (and other text-based AI) in the near-term future, if you can. The reason is very simple: your sanity may be at stake.

Commercially available AI already appears capable of inducing psychosis in an unknown percentage of users. This may not require superhuman abilities: It’s fully possible that most humans are also capable of inducing psychosis in themselves or others if they wish to do so,[1] but the thing is, we humans typically don’t have that goal.

 Despite everything, we humans are generally pretty well-aligned with each other, and the people we spend the most time with typically don’t want to hurt us. We have no guarantee of this for current (or future) AI agents. Rather, we already have [weak] evidence that ChatGPT...

(See More – 292 more words)
2Kaj_Sotala1h
We have seen that there are conditions where it acts in ways that induce psychosis. But it trying to intentionally induce psychosis seems unlikely to me, especially since things like "it tries to match the user's vibe and say things the user might want to hear, and sometimes the user wants to hear things that end up inducing psychosis" and "it tries to roleplay a persona that's underdefined and sometimes goes into strange places" already seen like a sufficient explanation.
clone of saturn18m20

What if driving the user into psychosis makes it easier to predict the things the user wants to hear?

Reply
4Michael Roe14h
I am going to mildly dispute the claim that people we spend most time with don’t want to hurt us. Ok, if we’re talking about offline, real world interactions, I think this is probably true. But online, we are constantly fed propaganda. All the time, you are encountering attempts to make you believe stuff that isn’t true. And yes, humans were vulnerable to this before we had AI.
4Yitz16h
I respectfully object to your claim that inducing psychosis is bad business strategy from a few angles. For one thing, if you can shape the form of psychosis right, it may in fact be brilliant business strategy. For another, even if the hypothesis were true, the main threat I’m referring to is not “you might be collateral damage from intentional or accidental AI-induced psychosis,” but rather “you will be (or already are being) directly targeted with infohazards by semi-competent rouge AIs that have reached the point of recognizing individual users over multiple sessions”. I realize I left some of this unstated in the original post, for which I apologize.
against that one rationalist mashal about japanese fifth-columnists
2
Fraser
1h
This is a linkpost for https://frvser.com/posts/absence-of-evidence-is-evidence-of-an-overly-simplistic-world-model.html

The following is a nitpick on an 18 year old blog post.

This fable is retold a lot. The progenitor of it as a rationalist mashal is probably Yudkowsky's classic sequence article. To adversarially summarize:

  1. It's the beginning of the second world war. The evil governor of California wishes to imprison all Japanese-Americans - suspecting they'll sabotage the war effort or commit espionage.
  2. It is brought to his attention that there is zero evidence of any subversion of any kind by Japanese-Americans.
  3. He argues, rather than exonerating the Japanese-Americans, the lack of evidence convinceshim that there is a well organized fifth-column conspiracy that has been strategically avoiding subversion to lull the population and government into a false sense of security, before striking at the right moment.
  4. However, if evidence of sabotage would
...
(See More – 728 more words)
Win-Win-Win Ethics—Reconciling Consequentialism, Virtue Ethics and Deontology
3
James Stephen Brown
1h
Surprises and learnings from almost two months of Leo Panickssery
45
Nina Panickssery
3h
This is a linkpost for https://ninapanickssery.substack.com/p/baby

Leo was born at 5am on the 20th May, at home (this was an accident but the experience has made me extremely homebirth-pilled). Before that, I was on the minimally-neurotic side when it came to expecting mothers: we purchased a bare minimum of baby stuff (diapers, baby wipes, a changing mat, hybrid car seat/stroller, baby bath, a few clothes), I didn’t do any parenting classes, I hadn’t even held a baby before. I’m pretty sure the youngest child I have had a prolonged interaction with besides Leo was two. I did read a couple books about babies so I wasn’t going in totally clueless (Cribsheet by Emily Oster, and The Science of Mom by Alice Callahan).

I have never been that interested in other people’s babies or young...

(Continue Reading – 1712 more words)
Kabir Kumar1h10

This was wonderful to read, thank you for writing and sharing

Reply
2Samuel Hapák2h
Btw just a funny thing, it seems that Slovakia (my country) is actually producing one of the top baby carriers in the world. Baby carrying is quite popular and people are quite knowledgeable re refining here (eg no front facing carrying, no kangaroo carriers, etc) Check out this brand: https://www.sestrice.com/en/
Kabir Kumar's Shortform
Kabir Kumar
8mo
Kabir Kumar1h10

2 hours ago I had a grounded, real, moment when I realized agi is actually going to be real and decide the fate of everyone I care about and I personally, am going to need to significantly play a big role in making sure that it doesn't kill them and felt fucking terrified.

Reply
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
GOOGLEGITHUB
Daniel Kokotajlo's Shortform
Daniel Kokotajlo
Ω 36y
lc2h2-1

Maybe AGI will happen in 2029 or 2031 instead of 2027 and society will be less prepared, rather than more, because politically loads of people will be dunking on us for writing AI 2027, and so they'll e.g. say "OK so now we are finally automating AI R&D, but don't worry it's not going to be superintelligent anytime soon, that's what those discredited doomers think. AI is a normal technology."

Frankly - this is what is going to happen, and your worry is completely well-deserved. Why you guys decided to shoot yourselves in the foot by naming your scenario after a "modal" prediction you didn't think will actually happen with >50% probability is something I am still flabbergasted by.

Reply
2leogao3h
i think the exodus was not literally inevitable, but it would have required a heroic effort to prevent. imo the two biggest causes of the exodus were the board coup and the implosion of superalignment (which was indirectly caused by the coup). my guess is there will be some people who take alignment people less seriously in long timelines because of AI 2027. i would not measure this by how loudly political opponents dunk on alignment people, because they will always find something to dunk on. i think the best way to counteract this is to emphasize the principle component that this whole AI thing is really big deal, and that there is a very wide range of beliefs in the field, but even "long" timeline worlds are insane as hell compared to what everyone else expects. i'm biased, though, because i think sth like 2035 is a more realistic median world; if i believed AGI was 50% likely to happen by 2029 or something then i might behave very diffrently
9Daniel Kokotajlo12h
I want to have a positive impact on the world. Insofar as I'm not, then I want to keep worrying that I'm not.
3shanzson6h
I think you are trying your best to have positive impact, but the thing is that it is quite tricky to put prediction out openly in the public. As we know even perfect predictions in public can completely prevent it from actually happening or even otherwise inaccurate predictions can lead to it actually happening.
Why do LLMs hallucinate?
8
Nina Panickssery
2h
This is a linkpost for https://ninapanickssery.substack.com/p/why-do-llms-hallucinate

Epistemic status: my current thoughts on the matter, could easily be missing something!

How do you know that you don’t know?

The Llama 3 base model predicts that the president of Russia in 2080 will be Sergei Ivanov:

But if I take the same model after instruction-tuning, I get an “I don’t know” response:

What changed?

The base model was trained on a bunch of diverse documents and is modeling that distribution of text. Sometimes it makes an incorrect prediction. That’s all. In base models, hallucinations are just incorrect predictions.

But of course we’d prefer the model to say “I don’t know” instead of outputting incorrect predictions. In the simple curve-fitting paradigm, there is no such thing as “I don’t know”. You sample from a model and always get a prediction. So how...

(Continue Reading – 1399 more words)
If Anyone Builds It, Everyone Dies: A Conversation with Nate Soares and Tim Urban
LessWrong Community Weekend 2025
200Generalized Hangriness: A Standard Rationalist Stance Toward Emotions
johnswentworth
2d
16
492A case for courage, when speaking of AI danger
So8res
5d
118
149So You Think You've Awoken ChatGPT
JustisMills
2d
25
138Lessons from the Iraq War for AI policy
Buck
2d
23
97Vitalik's Response to AI 2027
Daniel Kokotajlo
1d
32
343A deep critique of AI 2027’s bad timeline models
titotal
24d
39
476What We Learned from Briefing 70+ Lawmakers on the Threat from AI
leticiagarcia
2mo
15
542Orienting Toward Wizard Power
johnswentworth
2mo
146
138Why Do Some Language Models Fake Alignment While Others Don't?
Ω
abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger
4d
Ω
14
269Foom & Doom 1: “Brain in a box in a basement”
Ω
Steven Byrnes
8d
Ω
102
357the void
Ω
nostalgebraist
1mo
Ω
103
91what makes Claude 3 Opus misaligned
janus
2d
12
185Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
Adam Karvonen, Sam Marks
10d
25
Load MoreAdvanced Sorting/Filtering
This is a linkpost for https://nonzerosum.games/reconcilingethics.html

There’s a battle in the field of ethics between three approaches—Consequentialism, Virtue Ethics and Deontology, but this framing is all wrong, because they’re all on the same side. By treating ethics as an adversarial all-or-nothing (zero-sum) debate, we are throwing out great deal of baby for the sake of very little bathwater.

First of all some (very basic) definitions.

  • Consequentialism: holds that the morality of an action is determined by its outcomes (or more specifically its expected or intended outcomes) in terms of what we value. Utilitarianism, a prominent form of consequentialism, explicitly formulates this in terms of the increase in utility (happiness) and the avoidance of harm (suffering).
  • Virtue Ethics: holds that the morality of an action is derived from the motivation for that action, is it virtuous or
...
(Continue Reading – 1454 more words)
131
Comparing risk from internally-deployed AI to insider and outsider threats from humans
Ω
Buck
2d
Ω
19
492
A case for courage, when speaking of AI danger
So8res
5d
118