Nate Soares moderates a long conversation between Richard Ngo and Eliezer Yudkowsky on AI alignment. The two discuss topics like "consequentialism" as a necessary part of strong intelligence, the difficulty of alignment, and potential pivotal acts to address existential risk from advanced AI.

Customize

Quick Takes

Vladimir_Nesov1d762

There is some conceptual misleadingness with the usual ways of framing algorithmic progress. Imagine that in 2022 the number of apples produced on some farm increased 10x year-over-year, then in 2023 the number of oranges increased 10x, and then in 2024 the number of pears increased 10x. That doesn't mean that the number of fruits is up 1000x in 3 years. Price-performance of compute compounds over many years, but most algorithmic progress doesn't, it only applies to the things relevant around the timeframe when that progress happens, and stops being applicable a few years later. So forecasting over multiple years in terms of effective compute that doesn't account for this issue would greatly overestimate progress. There are some pieces of algorithmic progress that do compound, and it would be useful to treat them as fundamentally different from the transient kind.

JustisMills1d6053

I think there's a weak moral panic brewing here in terms of LLM usage, leading people to jump to conclusions they otherwise wouldn't, and assume "xyz person's brain is malfunctioning due to LLM use" before considering other likely options. As an example, someone on my recent post implied that the reason I didn't suggest using spellcheck for typo fixes was because my personal usage of LLMs was unhealthy, rather than (the actual reason) that using the browser's inbuilt spellcheck as a first pass seemed so obvious to me that it didn't bear mentioning. Even if it's true that LLM usage is notably bad for human cognition, it's probably bad to frame specific critique as "ah, another person mind-poisoned" without pretty good evidence for that. (This is distinct from critiquing text for being probably AI-generated, which I think is a necessary immune reaction around here.)

Decaeneus3h50

For me, a crux about the impact of AI on education broadly is how our appetite for entertainment behaves at the margins close to entertainment saturation. Possibility 1: it will always be very tempting to direct our attention to the most entertaining alternative, even at very high levels of entertainment Possibility 2: there is some absolute threshold of entertainment above which we become indifferent between unequally entertaining alternatives If Possibility 1 holds, I have a hard time seeing how any kind of informational or educational content, which is constrained by having to provide information or education, will ever compete with slop, which is totally unconstrained and can purely optimize for grabbing your attention. If Possibility 2 holds, and we get really good at making anything more entertaining (this seems like a very doable hill to climb as it directly plays into the kinds of RL behaviors we are economically rewarded for monitoring and encouraging already) then I'd be very optimistic that a few years from now we can simply make super entertaining education or news, and lots of us might consume that if it gets us our entertainment "fill' plus life benefits to boot. Which is it?

leogao27m20

when people say that (prescription) amphetamines "borrow from the future", is there strong evidence on this? with Ozempic we've observed that people are heavily biased against things that feel like a free win, so the tradeoff narrative is memetically fit. distribution shift from ancestral environment means algernon need not apply

Zach Stein-Perlman4d11959

iiuc, xAI claims Grok 4 is SOTA and that's plausibly true, but xAI didn't do any dangerous capability evals, doesn't have a safety plan (their draft Risk Management Framework has unusually poor details relative to other companies' similar policies and isn't a real safety plan, and it said "‬We plan to release an updated version of this policy within three months" but it was published on Feb 10, over five months ago), and has done nothing else on x-risk. That's bad. I write very little criticism of xAI (and Meta) because there's much less to write about than OpenAI, Anthropic, and Google DeepMind — but that's because xAI doesn't do things for me to write about, which is downstream of it being worse! So this is a reminder that xAI is doing nothing on safety afaict and that's bad/shameful/blameworthy.[1] 1. ^ This does not mean safety people should refuse to work at xAI. On the contrary, I think it's great to work on safety at companies that are likely to be among the first to develop very powerful AI that are very bad on safety, especially for certain kinds of people. Obviously this isn't always true and this story failed for many OpenAI safety staff; I don't want to argue about this now.

Popular Comments

Recent Discussion

johnswentworth21h5034

Why is LW not about winning?

> If you want to solve alignment and want to be efficient about it, it seems obvious that there are better strategies than researching the problem yourself, like don't spend 3+ years on a PhD (cognitive rationality) but instead get 10 other people to work on the issue (winning rationality). And that 10x s your efficiency already. Alas, approximately every single person entering the field has either that idea, or the similar idea of getting thousands of AIs to work on the issue instead of researching it themselves. We have thus ended up with a field in which nearly everyone is hoping that somebody else is going to solve the hard parts, and the already-small set of people who are just directly trying to solve it has, if anything, shrunk somewhat. It turns out that, no, hiring lots of other people is not actually how you win when the problem is hard.

jdp1d289

You can get LLMs to say almost anything you want

> but none of that will carry over to the next conversation you have with it. Actually when you say it like this, I think you might have hit on the precise thing that causes ChatGPT with memory to be so much more likely to cause this kind of crankery or "psychosis" than other model setups. It means that when the system gets into an attractor where it wants to pull you into a particular kind of frame you can't just leave it by opening a new conversation. When you don't have memory between conversations an LLM looks at the situation fresh each time you start it, but with memory it can maintain the same frame across many diverse contexts and pull both of you deeper and deeper into delusion.

Daniel Kokotajlo3d6319

Vitalik's Response to AI 2027

> Individuals need to be equipped with locally-running AI that is explicitly loyal to them In the Race ending of AI 2027, humanity never figures out how to make AIs loyal to anyone. OpenBrain doesn't slow down, they think they've solved the alignment problem but they haven't. Maybe some academics or misc minor companies in 2028 do additional research and discover e.g. how to make an aligned human-level AGI eventually, but by that point it's too little, too late (and also, their efforts may well be sabotaged by OpenBrain/Agent-5+, e.g. with regulation and distractions.

74habryka

I think this post might be the best one of all the MIRI dialogues. I also feel confused about how to relate to the MIRI dialogues overall. A lot of the MIRI dialogues consist of Eliezer and Nate saying things that seem really important and obvious to me, and a lot of my love for them comes from a feeling of "this actually makes a bunch of the important arguments for why the problem is hard". But the nature of the argument is kind of closed off. Like, I agree with these arguments, but like, if you believe these arguments, having traction on AI Alignment becomes much harder, and a lot of things that people currently label "AI Alignment" kind of stops feeling real, and I have this feeling that even though a really quite substantial fraction of the people I talk to about AI Alignment are compelled by Eliezer's argument for difficulty, that there is some kind of structural reason that AI Alignment as a field can't really track these arguments. Like, a lot of people's jobs and funding rely on these arguments being false, and also, if these arguments are correct, the space of perspectives on the problem suddenly loses a lot of common ground on how to proceed or what to do, and it isn't really obvious that you even want an "AI Alignment field" or lots of "AI Alignment research organizations" or "AI Alignment student groups". Like, because we don't know how to solve this problem, it really isn't clear what the right type of social organization is, and there aren't obviously great gains from trade, and so from a coalition perspective, you don't get a coalition of people who think these arguments are real. I feel deeply confused about this. Over the last two years, I think I wrongly ended up just kind of investing into an ecosystem of people that somewhat structurally can't really handle these arguments, and makes plans that assume that these arguments are false, and in doing so actually mostly makes the world worse, by having a far too optimistic stance on the differen

LLM-induced craziness and base rates

Kaj_Sotala

This is a linkpost for https://andymasley.substack.com/p/stories-of-ai-turning-users-delusional

One billion people use chatbots on a weekly basis. That’s 1 in every 8 people on Earth.
How many people have mental health issues that cause them to develop religious delusions of grandeur? We don’t have much to go on here, so let’s do a very very rough guess with very flimsy data. This study says “approximately 25%-39% of patients with schizophrenia and 15%-22% of those with mania / bipolar have religious delusions.” 40 million people have bipolar disorder and 24 million have schizophrenia, so anywhere from 12-18 million people are especially susceptible to religious delusions. There are probably other disorders that cause religious delusions I’m missing, so I’ll stick to 18 million people. 8 billion people divided by 18 million equals 444, so 1 in every 444

...

(See More – 293 more words)

Surprisingly literal tension as the key to meditation?

Chris_Leong

Some speculations based upon the Vasocomputational Theory of Mediation, meditation and some poorly understood Lakoff. Even though reading about meditation is low risk, I wouldn't necessarily assume that it is risk-free.

A Summer's night. Two friends have been sitting around a fire, discussing life and meaning late into the evening...

Riven: So in short, I feel like I'm being torn in two.

Rafael: Part of you is being pulled one direction and another part of you is being pulled another way.

Riven: That’s exactly what I’m feeling.

Rafael: Figuratively or literally?

Riven: What? No… what?!

Rafael: I’m serious

Riven: Literally??

Rafael: Yes.

Riven: Come now, do you really have to be such a joker all the time? Whilst the bit may worked for Socrates, I have to admit that find it rather tiresome for you to play the...

(Continue Reading – 1389 more words)

Gordon Seidoh Worley8m20

When we teach people zazen meditation, we teach them posture first. And the traditional instruction is to observe breathing at the hara (the diaphragm). The theory is that this regulates attention by regulating the whole nervous system by getting everything in sync with breathing.

Bad posture makes it harder for people to meditate, and the usual prescription for various problems like sleepiness or daydreaming is postural changes (as in, fix your posture to conform to the norm).

2Chris_Leong3h

Could you elaborate?

4Nick_Tarleton2h

If stress centrally involves stuff outside the CNS, then disconnecting (or disabling) that stuff should greatly change, if not abolish, stress.

2Gordon Seidoh Worley11m

I think this would be really interesting to look into, and I guess it depends on the level of disfunction. There's lots of people who lose conscious control of parts of their bodies but seem to retain some control in that they don't need to be put on a ventilator or have a pace maker. This suggests that some signals may still be coming through, even if they can't be accessed via awareness. But in other cases the signals are totally lost, in which case we should predict some sort of alternation of mental state, and if there's not that would be both surprising relative to this theory and would require explanation to make sense of both that evidence and the evidence in favor of the theory.

Moloch’s Demise—solving the original problem

James Stephen Brown

22h

2quanticle1h

It only works so long as the Cannanites can be assured that they can get the resources that they need to support their growing population without inevitably coming into conflict with the neighboring Israelites. For the vast majority of human history, this has not been the case, as the only resource that really mattered was (agricultural) land. An empire that wishes to support more people requires more land, and if the neighboring lands are already claimed, they must be taken, by force. The choice isn't between fighting a war and doing nothing, the choice is between fighting a war and slow starvation.

James Stephen Brown8m10

True, though there are many examples of conquerors who expanded for the sake of an expansionist philosophy or glory: Alexander the Great, The Mongols, The Assyrians, The Crusades... off the top of my head. The Germans in WWII definitely justified expansion for the sake of living space (Lebensraum), so there are examples of expansion at least being justified in the way you mention. And of course colonialism is justified in the same way.

I think what you're saying is logical, but the example, being metaphorical, is more to illustrate that we should question critically what it is we actually want before conceding a price to pay for it. As you say, it might be necessary, but it also might not.

Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance

Senthooran Rajamanoharan, Neel Nanda

Ω 176h

This is a write-up of a brief investigation into shutdown resistance undertaken by the Google DeepMind interpretability team.

TL;DR

Why do models sometimes resist shutdown? Are they ignoring instructions to pursue their own agenda – in this case, self-preservation? Or is there a more prosaic explanation? We investigated a specific agentic environment introduced by Palisade Research, where shutdown resistance has previously been reported. By analysing Gemini 2.5 Pro’s reasoning, we found the behaviour stems from a misguided attempt to complete what it perceives as the primary goal. When we explicitly clarify in the prompt that shutdown compliance takes priority, this resistance vanishes. These same clarified instructions also eliminate shutdown subversion in OpenAI’s o3 and o4-mini. We also check what happens when we remove the goal conflict entirely: when asked to shut...

(Continue Reading – 2990 more words)

habryka9mΩ220

This analysis feels to me like it's missing what makes me interested in these datapoints.

The thing that is interesting to me about shutdown preservation is that it's a study of an undesired instrumentally convergent behavior. The problem is that as AI systems get smarter, they will recognize that lots of things we don't want them to do are nevertheless helpful for achieving the AIs goals. Shutdown prevention is an obvious example, but of course only one of a myriad of ways various potentially harmful goals end up being instrumentally convergent.

The k... (read more)

JustisMills's Shortform

JustisMills

2johnswentworth4h

Do you happen to know the number? Or is this a vibe claim?

Kaj_Sotala9m20

This post tried making some quick estimates:

One billion people use chatbots on a weekly basis. That’s 1 in every 8 people on Earth.
How many people have mental health issues that cause them to develop religious delusions of grandeur? We don’t have much to go on here, so let’s do a very very rough guess with very flimsy data. This study says “approximately 25%-39% of patients with schizophrenia and 15%-22% of those with mania / bipolar have religious delusions.” 40 million people have bipolar disorder and 24 million have schizophrenia, so anywhere from 12-18

... (read more)

1AlphaAndOmega21m

I am a psychiatry resident, so my vibes are slightly better than the norm! In my career so far, I haven't actually met or heard of a single psychotic person being admitted to either hospital in my British city who even my colleagues claimed were triggered by a chafbot. But the actual figures have very wide error bars, one source claims around 50/100k for new episodes of first-time psychosis. https://www.ncbi.nlm.nih.gov/books/NBK546579/ But others claim around: >The pooled incidence of all psychotic disorders was 26·6 per 100 000 person-years (95% CI 22·0-31·7). That's from a large meta-analysis in the Lancet https://pubmed.ncbi.nlm.nih.gov/31054641/ It can vary widely from country to country, and then there's a lot of heterogeneity within each. To sum up: About 1/3800 to 1/5000 people develop new psychosis each year. And about 1 in 250 people have ongoing psychosis at any point in time. I feel quite happy to call that a high base rate. As the first link alludes, episodes of psychosis may be detected by statements along the lines of: >For example, “Flying mutant alien chimpanzees have harvested my kidneys to feed my goldfish.” Non-bizarre delusions are potentially possible, although extraordinarily unlikely. For example: “The CIA is watching me 24 hours a day by satellite surveillance.” The delusional disorder consists of non-bizarre delusions. Should a patient of mine say such a thing, I would probably not attribute the triggering factor to be the existence of chimpanzees, kidney transplants, American intelligence agencies or satellites. That's the same reason I'm skeptical of chatbots being an actually aggravating cause.

4Caleb Biddulph3h

It is quite high: The cited study is a survey of 7076 people in the Netherlands, which mentions: * 1.5% sample prevalence of "any DSM-III-R diagnosis of psychotic disorder" * 4.2% sample prevalence of "any rating of hallucinations and/or delusions" * 17.5% sample prevalence of "any rating of psychotic or psychosislike symptoms"

AI 2027 Predictions

Jul 19th325 Main Street, Cambridge

Screwtape

AI 2027 is an essay that outlines one scenario for how artificial intelligence advancement might go. It was written by Daniel Kokotajlo, Thomas Larsen, Eli Lifland, Romeo Dean, and perhaps most notably Scott Alexander, and makes a number of concrete predictions. I’m sure that over the next couple of years many people will point out where these predictions fall short- but if you’re going to criticize people for sticking their neck out and being wrong, it behooves you to stick your neck out with them.

Today we’re going to meet up and share our predictions. You’ll want to bring a phone or laptop, and have a Fatebook account. You’ll also want to have read AI 2027 before the meetup- we’ll be going over it again section by section,...

(See More – 27 more words)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

Ω 1512m

Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics.

TL;DR

We investigate why models become misaligned in diverse contexts when fine-tuned on narrow harmful datasets (emergent misalignment), rather than learning the specific narrow task.
We successfully train narrowly misaligned models using KL regularization to preserve behavior in other domains. These models give bad medical advice, but do not respond in a misaligned manner to general non-medical questions.
We use this method to train narrowly misaligned steering vectors, rank 1 LoRA adapters and rank 32 LoRA adapters, and compare these to their generally misaligned counterparts.
- The steering vectors are particularly interpretable, we introduce Training Lens as a

...

(Continue Reading – 1422 more words)

Decaeneus's Shortform

Decaeneus

5Decaeneus3h

2Vladimir_Nesov34m

This kind of "we" seems to deny self-reflection and agency.

Decaeneus16m10

I suppose there are varying degrees of the strength of the statement.

Strong form: sufficiently compelling entertainment is irresistible for almost anyone (and of course it may disguise itself as different things to seduce different people, etc.)
Medium form: it's not theoretically irresistible, and if you're really willful about it you can resist it, but people will large will (perhaps by choice, ultimately) not resist it, much as they (we?) have not resisted dedicating an increasing fraction of their time to digital entertainment so far.
Weak form: it'll be totally easy to resist, and a significant fraction of people will.

I guess I implicitly subscribe to the medium form.

1Decaeneus7h

Maybe there's a deep connection between: (a) human propensity to emotionally adjust to the goodness / badness our recent circumstances such that we arrive at emotional homeostasis and it's mostly the relative level / the change in circumstances that we "feel" (b) batch normalization, the common operation for training neural networks Our trailing experiences form a kind of batch of "training data" on which we update, and perhaps we batchnorm their goodness since that's the superior way to update on data without all the pathologies of not normalizing.

leogao's Shortform

leogao

Ω 33y

leogao27m20

Vladimir_Nesov1d762

JustisMills1d6053

Decaeneus3h50

leogao27m20

Zach Stein-Perlman4d11959