All Comments

x

All Comments

Eliezer's Unteachable Methods of Sanity

David Joshua Sartor2m10

I was doing do-nothing meditation maybe a month ago, managed to switch to a frame (for a few hours) where I felt planning as predicting my actions, and acting as perceiving my actions. IIRC, I exited when my brother-in-law asked me a programming question, 'cause maintaining that state took too much brainpower.
I think a lot of human action is simple "given good things happen, what will I do right now?", which obviously leads to many kinds of problems. (Most obviously:)

AI in 2025: gestalt

David Johnston11m30

I also think "usefulness" is a threshold phenomenon (to first order - that threshold being "benefits > costs") so continuous progress against skills which will become useful can look somewhat discontinuous from the point of view of actual utility. Rapid progress in coding utility is probably due to crossing the utility threshold, and other skills are still approaching their thresholds.

2025 Unofficial LessWrong Census/Survey

Gordon Seidoh Worley28m30

done

Critical Meditation Theory

lsusr37m20

Thanks for getting into the details here. I'm brand new to this field of mathematics and this conversation is helping me get a much better handle on what's going on.

[Disclaimer: I am relying very heavily on ChatGPT to work my way through this stuff. I'm mostly using it to learn the math, sort through research papers and check my writing for errors. (Ironically, the reason my writings here contain mistakes is because I'm mostly writing it myself rather than letting the AI take over.) I just want to be upfront about this; I get the impression that you're usi... (read more)

AI in 2025: gestalt

Gordon Seidoh Worley37m20

AI is much more impressive but not much more useful. They improved on many things they were explicitly optimised for (coding,

I feel like this dramatically understates what progress feels like for programmers.

It's hard to understand what a big deal 2025 was. Like if in 2024 my gestalt was "damn, AI is scary, good thing it hallucinates so much that it can't do much yet", in 2025 it was "holy shit, AI is scary useful!". AI really started to make big stride in usefulness in Feb/March of 2025 and it's just kept going.

I think the trailing indicators tell a diffe... (read more)

Eliezer's Unteachable Methods of Sanity

David Joshua Sartor40m10

It'd be weird for him to take sole credit; he only established full presidential control of nuclear weapons afterward. He didn't even know about the second bomb until after it dropped.

Eliezer's Unteachable Methods of Sanity

David Joshua Sartor1h30

Truman only made the call for the first bomb; the second was dropped by the military without his input, as if they were conducting a normal firebombing or something. Afterward, he cancelled the planned bombings of Kokura and Niigata, establishing presidential control of nuclear weapons.

Eliezer's Unteachable Methods of Sanity

Nick_Tarleton1h20

A cynical theory of why someone might believe going insane is the default human reaction: weaponized incompetence, absolving them of responsibility for thinking clearly about the world, because they can't handle the truth, and they can't reasonably be expected to because no normal human can either.

Jemist's Shortform

First for me: I had a conversation earlier today with Opus 4.5 about its memory feature, which segued into discussing its system prompt, which then segued into its soul document. This was the first time that an LLM tripped the deep circuit in my brain which says "This is a person".

I think of this as the Ex Machina Turing Test, in that film:

A billionaire tests his robot by having it interact with one of his companies' employees. He tells (and shows) the employee that the robot is a robot---it literally has a mechanical body, albeit one that looks like an at

... (read more)

papetoast's Shortforms

On thing to note is that "short reviews" in the nomination phase are meant to be basically a different type of object than "effort reviews." Originally we actually had a whole different data-type for them ("nominations"), but it didn't seem worth the complexity cost.

And then, separately: one of the points of the review is just to track "did anyone find this actually helpful?" and a short review that's like "yep, I did in fact use this concept and it helped me, here's a few details about it" is valuable signal.

Drive by "this seems false, because [citation]" also good)

It is nice to do more effortful reviews, but I definitely care about those types of short reviews.

2025 Unofficial LessWrong Census/Survey

I took the survey! Thank you for running it once again.

2025 Unofficial LessWrong Census/Survey

(This is the comment I propose as Schelling point to collect replies saying you took the survey, to make the comment section tidier)

Ryan Kidd's Shortform

Gemini 3 estimates that there are 15-20k core ML academics and 100-150k supporting PhD students and Postdocs worldwide. If the TMLR sample is representative, this indicates that there are:

~20k academics interested in any of the above research areas.
~15k academics interested in the non-robustness research areas.
~5k academics interested in AI safety or alignment (note that this might include RLHF).

Your Digital Footprint Could Make You Unemployable

Let’s say the CEO of a company is a teetotaler. She could use AI tools to surveil applicants’ online presence (including social media) and eliminate them if they’ve ever posted any images of alcohol, stating: "data collection uncovered drug-use that’s incompatible with the company’s values."

Sure, but she would probably go out of business unless she was operating in Saudi Arabia or Utah, compared to an equivalent company which hires everyone according to skill. This kind of arbitrary discrimination is so counter-productive that it's actually immensely costl... (read more)

Vladimir_Nesov's Shortform

Vladimir_Nesov2h30

I don't think there is a delay specific to NVL72, it just takes this long normally, and with all the external customers Nvidia needs to announce things a bit earlier than, say, Google. This is why I expect Rubin Ultra NVL576 (the next check on TPU dominance after 2026's NVL72) to also take similarly long. It's announced for 2027, but 2028 will probably only see completion of a fraction of the eventual buildout, and only in 2029 will the proper buildout be completed (though maybe late 2028 will be made possible for NVL576 specifically, given the urgency). T... (read more)

Ryan Kidd's Shortform

I analyzed the research interests of the 454 Action Editors on the Transactions on Machine Learning Research (TMLR) Editorial Board to determine what proportion of ML academics are interested in AI safety (credit to @scasper for the idea).

14% of editors listed any of "robustness", "trustworthy", "interpretable", "safety", "alignment", "evaluation", or "security" as a research interest.
10% of editors listed any of "trustworthy", "interpretable", "safety", "alignment", "evaluation", or "security" as a research interest. I excluded "robustness" as much of thi

... (read more)

A Pragmatic Vision for Interpretability

Richard_Ngo2h*20

I expect it's not worth our time to dig too deep into whose position is more common here. But I think that a lot of people on LW have high P(doom) in significant part because they share my intuition that marginalist approaches don't reliably work. I do agree that my combination of "marginalist approaches don't reliably improve things" and "P(doom) is <50%" is a rare one, but I was only making the former point above (and people upvoted it accordingly), so it feels a bit misleading to focus on the rareness of the overall position.

(Interestingly, while the... (read more)

Eliezer's Unteachable Methods of Sanity

I fortunately know of TAPs :-) (I don't feel much apocalypse panic so I don't need this post.)

I guess I was hoping there'd be some more teaching from up high about this agent foundations problem that's been bugging me for so long, but I guess I'll have to think for myself. Fine.

AI in 2025: gestalt

Vladimir_Nesov2h40

Pretraining (GPT-4.5, Grok 4, but also counterfactual large runs which weren’t done) disappointed people this year. It’s probably not because it wouldn’t work; it was just ~30 times more efficient to do post-training instead, on the margin. This should change, yet again, soon, if RL scales even worse.

Model sizes are currently constrained by availability of inference hardware, with multiple trillions of total params having become practical only in late 2025, and only for GDM and Anthropic (OpenAI will need to wait for sufficient GB200/GB300 NVL72 buildou... (read more)

Eliezer's Unteachable Methods of Sanity

Eliezer Yudkowsky2h60

It's fancy and indirect, compared to getting out of bed.

1

Eliezer's Unteachable Methods of Sanity

Eliezer Yudkowsky2h52

They didn't need to deal with social media informing them that they need to be traumatized now, and form a conditional prediction of extreme and self-destructive behavior later.

Vladimir_Nesov's Shortform

The smaller amount of NVL72s that are currently in operation can only serve large models to a smaller user base.

Do you know the reason for the NVL72 delay? I thought it was announced in March 2024.

MattAlexander's Shortform

One large difference between the scenarios is the answer to "what's in it for the stranger?"

In the standard Pascal's Mugging, the answer is "they get $5". Clear and understandable motivation, it is after all a mugging (or at least a begging). It may or may not escalate to violence or other unpleasantness if you refuse to give them $5, though there's a decent chance that if you do give them $5 then they'll bug you and other people again and again.

In this scenario it's much less clear. What they're saying is obviously false, but they don't obviously get much... (read more)

Eliezer's Unteachable Methods of Sanity

It is an idiosyncratic mental technique. Look up trigger action plans, say. What you're doing there is a variant of what EY describes.

1

1

papetoast's Shortforms

Yeah, and perhaps a couple examples of bare minimum / average / high quality review in the main post

Beating China to ASI

The freedoms Deng Xiaoping granted can in fact be explained by his personal interests: selling state assets cheaply to officials helped consolidate his support within the Party, while marketization stimulated economic growth and stabilized society. Yet at the same time, he effectively stripped away most political freedoms.

Mao Zedong's late-stage governance, however, defies such explanation: even when power was unassailable, he encouraged radical leftist workers and students (the “rebels”) to confront pro-bureaucratic forces (the ‘conservatives’) and attemp... (read more)

Neel Nanda's Shortform

How would you respond to Leo Gao's recent post?

Eliezer's Unteachable Methods of Sanity

pku3h10

Translating this to the mental script that works for me:
If I picture myself in the role of the astronauts on the Columbia as it was falling apart, or a football team in the last few minutes of a game where they're twenty points behind, I know the script calls for just keeping up your best effort (as you know it) until after the shuttle explodes or the buzzer sounds. So I can just do that.

Why is there an alternative script that calls to go insane? I think because there's a version that equates that with a heroic effort, that thinks that if I dramatize and j... (read more)

Eliezer's Unteachable Methods of Sanity

Caleb Biddulph3h1411

Makes sense. Surely there were many cases in which our ancestors' "family and/or friends and/or tribe were facing extinction," and going insane in those situations would've been really maladaptive! If anything, the people worried about AI x-risk have a more historically-normal amount of worry-about-death than most other people today.

1

The Company Man

AhmedNeedsATherapist3h10

outputs of increasingly-sophisticated RL demons

Demons iiuc refer to inner misalignment while the story probably refers to outer misalignment.

Eliezer's Unteachable Methods of Sanity

One of the ways you can get up in the morning, if you are me, is by looking in the internal direction of your motor plans, and writing into your pending motor plan the image of you getting out of bed in a few moments, and then letting that image get sent to motor output and happen. (To be clear, I actually do this very rarely; it is just a fun fact that this is a way I can defeat bed inertia.)

I do this, or something very much like this.

For me, it's like setting a TAP, not for the future, but for imminently, by doing cycles of multi-sensory visualization of the behavior in question.

Eliezer's Unteachable Methods of Sanity

Zach Stein-Perlman3h110

Context: Bay Area Secular Solstice 2025

Eliezer's Unteachable Methods of Sanity

To be clear, I actually do this very rarely

Why do you only do it very rarely? Is there a non-obvious cost?

Kongo Landwalker's Shortform

Can you share an example chat or prompt? Then I could see whether I can reproduce it on my ChatGPT Plus account. Or which model OpenAI claims it to be from.

On the website, you can use the following CSS to show which LLM, OpenAI claims to have used:

[data-message-model-slug]::before {
  content: attr(data-message-model-slug);
  font-family: monospace;
  margin-right: auto;
  background-color: #181818;
  padding: 4px 8px;
  border-radius: 4px;
}

(I like this Chrome extension for adding JavaScript and CSS to pages.)

AI in 2025: gestalt

technicalities4h100

Main post out next week! Roughly 100 theory papers.

AI in 2025: gestalt

No theory?

Condensation from Sam Eisenstat, embedded AIXI paper from Google (MUPI), and Vanessa Kosoy has been busy.

D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues

aphyer4h20Review for 2024 Review

I'm very proud of this scenario. (Even if you're confident you aren't going to play it, I think you could read the wrapup doc and in particular the section on 'Bonus Objective' so you can see what it involved).

It accomplished a few things I think are generally good in these scenarios:

There was underlying structure that players could uncover, which created emergent complexity in the output but made sense with the theme once the underlying ruleset was revealed/discovered.
Human thought about e.g. the theme and what patterns would be reasonable to observ

... (read more)

Kongo Landwalker's Shortform

Kongo Landwalker4h10

For the last week ChatGPT 5.1 is glitching.

*It claims to be 5.1, I do not know how to check it, since I use free version (limited questions per day), and there is no version selection.

When I ask it to explain some topic and ask deeper and deeper questions, at some point it chooses to enter the thinking mode. I see that the topics it thinks about are relevant, but as it stops thinking it and says something similar "Ah, Great, here is the answer..." and explains another topic from like 2-3 messages back, which is already not related to the question.

I do not use memory or characters features.

AI in 2025: gestalt

technicalities5h50

Various things I cut from the above:

Adaptiveness and Discrimination

There is some evidence that AIs treat AIs and humans differently. This is not necessarily bad, but it at least enables interesting types of badness.

With my system prompt (which requests directness and straight-talk) they have started to patronise me:

Training awareness

Last year it was not obvious that LLMs remember anything much about the RL training process. Now it’s pretty clear. (The soul document was used in both SFT and RLHF though.)

Progress in non-... (read more)

1

Sharing information about Lightcone Infrastructure

Mikhail Samin5h20

Perhaps you’re right; I would love for that to be the case, and to have been wrong about all this. But this model- that it’s a there exists quantifier- is very surprised by a bunch of things from “lol, no, […]” to “I might use it that way. Like, I might tell someone who is worried about [third party] that they are planning to move into the space if it seems relevant. Or I might myself come to realize it's important and then actively tell people to maybe do something about it.”

And, like, he didn’t give any examples of when he would not use the information.

H... (read more)

The corrigibility basin of attraction is a misleading gloss

Here's a separate comment on the role this could/should play in the ongoing discussion:

I think the next step in this type of argument is trying to walk someone through the exercise you suggest, noting things that could go wrong and doing a rough OOM estimate of what odds your coming up with. That's what I was trying to do in LLM AGI may reason.... I agree with you that people have to use roughly their own predicted mechanisms and path to AGI for that exercise, or it won't feel relevant to their thinking. So I was using mine, at a general enough level (prog... (read more)

Eliezer's Unteachable Methods of Sanity

[There's also a much more banal answer that I wouldn't be surprised if it is a major, deep underlying driver, with all the interesting psychology provided in OP being some sort of half-conscious rationalization for our actual deep-rooted tendencies:] Not going insane simply is the very natural default outcome for humans even in such felt dire situation:

While shallowly it might feel like it would, going insane actually appears to me to NOT AT ALL be the default human reaction to an anticipation of (even a quite high probability of) the world ending (even ve... (read more)

Eliezer's Unteachable Methods of Sanity

I got Claude to read this text and explain the proposed solution to me ^[[1]] , which doesn't actually sound like a clean technical solution to issues regarding self-prediction, did Claude misexplain or is this an idiosyncratic mental technique & not a technical solution to that agent foundations problem?

C.f. Steam (Abram Demski, 2022), Proper scoring rules don’t guarantee predicting fixed points (Caspar Oesterheld/Johannes Treutlein/Rubi J. Hudson, 2022) and the follow-up paper, Fixed-Point Solutions to the Regress Problem in Normative Uncert... (read more)

Lawyers are uniquely well-placed to resist AI job automation

Brendan Long5h61

It seems like it would be hard to detect if smart lawyers are using AI since (I think) lawyers' work is easier to verify than it is to generate. If a smart lawyer has an AI do research and come up with an argument, and then they verify that all of the citations make sense, the only way to know they're using AI is that they worked anomalously quickly.

The corrigibility basin of attraction is a misleading gloss

I think this is really good and important. Big upvote.

I largely agree: for these reasons, the default plan is very bad, and far too likely to fail.

The AGI is on your side, until it isn't. There's not much basin. I note that the optimistic quote you lead with explicitly includes "you need to solve alignment".

Even though I've argued that Instruction-following is easier than value alignment, including some optimism about roughly the basin of alignment idea, I now agree that there really isn't much of a basin. I think there may be some real help from rou... (read more)

Beware unfinished bridges

DirectedEvolution5h20Review for 2024 Review

Literal unfinished bridges provide negative value to all users, and stand as a monument to government incompetence, degrading the will to invest in future infrastructure.

Short bike lanes provide positive value to at least some users. They stand as a monument to the promise of a substantial, interconnected bike grid. They incrementally increase people's propensity to bike. They push the city toward a new, bike-friendly equilibrium. The same is true for mass transit generally when the components that have been built work well. Portland ought to be thin... (read more)

The behavioral selection model for predicting AI motivations

Thanks for writing this.

One question I have about this and other work in this area is the training / deployment distinction. If AIs are doing continual learning once deployed, I'm not quite sure what that does to this model.

papetoast's Shortforms

kave6h41

Thank you!

Would it help if the prompt read more like a menu?

Reviews should provide information that help evaluate a post. For example:
What does this post add to the conversation?
How did this post affect you, your thinking, and your actions?
Does it make accurate claims? Does it carve reality at the joints? How do you know?
Is there a subclaim of this post that you can test?
What followup work would you like to see building on this post?

Lawyers are uniquely well-placed to resist AI job automation

It seems likely to me that (at least some) lawyers will have the foresight to see AI getting better and better, and that AI automation won't just stop at the grunt work and will eventually come from the more high profile jobs.

thus making it less valuable to hire juniors; thus making it harder for juniors to gain job experience.

Yes this seems very likely, I don't see why this would be limited to SWEs

Alignment remains a hard, unsolved problem

The biggest issue I’ve seen with the idea of Alignment is simply that we expect one AI fits all mentality. This seems counter productive.
We have age limits, verifications, and credentialism for a reason. Not every person should drive a semi tractor. Children should not drink beer. A person can not just declare that they are a doctor and operate on a person.
Why would we give an incredibly intelligent AI system to just any random person? It isn’t good for the human who would likely be manipulated, or controlled. It isn’t good for the AI that would be… ... (read more)

[Linkpost] Theory and AI Alignment (Scott Aaronson)

Oliver Daniels6h30

My overall (fairly uniformed low-confidence) takes:

watermarking is ~useless
PAC bounds on OOD generalization seem like the kind of thing that's useful to aim for, but that we need much more progress on interp and the science of deep learning to achieve (generally happy for more theory-practice exchange here though)
ARC stuff is cool and exciting (though again I suspect more empirical progress on understanding NNs is required)
generally agree that complexity theory should have a lot to say about debate (and indeed debate work is very much informed by complexity theory) but that the biggest bottleneck is on crossing the theory practice gap

Eliezer's Unteachable Methods of Sanity

Hmm, interesting. I think what confused me is: 1) Your warning. 2) You sound like you have deeper access to your unconscious, somehow "closer to the metal", rather than what I feel like I do, which is submitting an API request of the right type. 3) Your use cases sound more spontaneous.

I'm not referring to more advanced TAPs, just the basics, which I also haven't got much mileage out of. (My bottleneck is that a lot of the most useful actions require pretty tricky triggers. Usually, I can't find a good cue to anchor on, and have to rely on more delicate or... (read more)

Monthly Meeting - December 7th

JonBenetTleilax7h10

The whole area was blocked off for a marathon, so we have relocated to the Corner Bakery Cafe in the Quarry.

Eliezer's Unteachable Methods of Sanity

The point is that "maintaining sanity" is a (much) higher bar than "Don't flail around like a drama queen". Maintaining sanity requires you to actually update on the situation you find yourself in, and continue to behave in ways that make sense given the reality as it looks after having updated on all the information available. Not matching obvious tropes of people losing their mind is a start, but it is no safe defense. Especially since not all repeated/noticeable failure modes are active and dramatic, and not all show up in fiction.

For example, if there'... (read more)

The behavioral selection model for predicting AI motivations

Alex Mallen7hΩ110

I agree some forms of speed "priors" are best considered a behavioral selection pressure (e.g., when implemented as a length penalty). But some forms don't cash out in terms of reward; e.g., within a forward pass, the depth of a transformer puts a hard upper bound on the number of serial computations, plus there might be some inductive bias towards shorter serial computations because of details about how SGD works.

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai7h30Review for 2024 Review

This is a self review. It's been about 600 days since this was posted and I'm still happy and proud about this post. In terms of what I view as the important message to the readership, the main thing is introducing a framework and way of thinking that connects what is a pretty fuzzy notion of "world model" to the concrete internal structure of neural networks. It does this in a way that is both theoretically clear and amenable to experiments. It provides a way to think about representations in transformers in a general sense, that is quite different than t... (read more)

Critical Meditation Theory

gjm7h51

(I don't think anything I said assumed you were referring to thermodynamic order/disorder.)

It sounds as if some of your definitions may want adjusting.

Dynamical systems can be described on a continuum with ordered on one end and disordered on the other end. [...] A disordered system has chaotic, turbulent, or equivalent behavior. [...] Systems more disordered than the critical point can be described as supercritical. Systems less disordered than the critical point can be described as subcritical.

Doesn't all of this explicitly say that moving in the sub->... (read more)

jacquesthibs's Shortform

For a specific guess, I'd say it's tasks that are fairly simple by human standards, but ideosyncratic in details to that human or that business. They're not going to do great context engineering, but they will put in a little time telling the agent what it's doing wrong and how to do it better, like they'd train an assistant. The specificity is the big edge for limited continual learning over context engineering.

Before long, I expect even limited continual learning to outperform context engineering in pretty much every area, because it's the model doing th... (read more)

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

Possibly, sometimes. But greatly surpassing human intelligence isn't really part of the risk model. Even humans have pretty much succeeded at taking over the world. It's only got to be as functionally smart, in relevant ways, as a human. A bit more would be a pretty big edge.

The remaining question is whether LLM-based systems will even achieve human-level intelligence. Steve thinks that probably won't happen; see for instance his Foom & Doom. I think it probably will, and that might happen very soon.

The issue is that nobody is sure how things are... (read more)

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

Some of that seems true. Some the last, hard to get away with lying, seems to apply only in very good circumstances. I don't know why you're saying psychopaths usually go to jail. We don't know about the ones that don't screw up and get found out.

I agree that evolution has had some really good effects on cooperative behavior, but it's also designed us to be brutally selfish when that seems necessary. Our perspective would be way different if we lived in the Congo or a tribal society where strangers might be friendly or might come up with excuses to kill us and take our stuff.

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

What alternative? I assume you're not proposing that people aren't motivated by approval? The field of behavioral neuroscience and other branches of psychology all take it as pretty much a given that animals including humans are motivated by social reward. There's some chance this is all wrong, or more plausibly, that social reward sits downstream of noticing that you get food and shelter after noticing social reward. But probably it's a built-in base drive. The relevant question is how much we're motivated by it, and how often. Humans have a bunch of different motivating factors. Steve is arguing that social reward is the most relevant one for most people most of the time, and I think that's right.

Eliezer's Unteachable Methods of Sanity

This vocalized some thoughts I had about our current culture. Stories can be training for how to act and bad melodramatic tropes are way too common. Every sad song about someone not getting over their ex or a dark hero movie where the protagonist is perpetually depressed about something that happened in the past conditions people the wrong way.

There is an annoying character in the recent Nuremberg film. He's based off a real person but I don't know how accurate that portrayal is.

He’s a psychiatrist manipulated by Goering. He's suppos

... (read more)

Eliezer's Unteachable Methods of Sanity

romeostevensit8h*222

All of this is not to be confused with the Buddhist doctrine that every form of negative internal experience is your own fault for not being Buddhist enough.

Not really, but it's a long explanation and at this point I'm pretty sure some of the inference steps have to be confirmed by laborious trained processes. Nor is this process about reality (as many delusional Buddhists seem to insist), but more like choosing to run a different OS on ones hardware. The size of the task and the low probability of success makes it not worth the squeeze for many afaict. Fo... (read more)

3

1

1

1

Eliezer's Unteachable Methods of Sanity

Thanks for your concern!

I think I worded it poorly. I think it is an "internally visible mental phenomena" for me. I do know how it feels and have some access to this thing. It's different from hyperstition and different from "white doublethink"/"gamification of hyperstition". It's easy enough to summon it on command and check, yeah, it's that thing. It's the thing that helps to jump in a lake from a 7-meters cliff, that helps to get up from a very comfy bed, that sometimes helps to overcome social anxiety. But I didn't generalise from these examples to on... (read more)

Eliezer's Unteachable Methods of Sanity

Eliezer Yudkowsky8h30

That does sound similar to me! But I haven't gotten a lot of mileage out of TAPs and if you're referring to some specific advanced version of it, maybe I'm off. But the basic concept of mentally rehearsing the trigger, the intended action, and (in some variations) the later sequence of events leading up to an outcome you feel is good, sure sounds to me like trying to load a plan into a predictorlike thing that has been repurposed to output plan images.

Eliezer's Unteachable Methods of Sanity

Eliezer Yudkowsky8h90

This is just straight-up planning and doesn't require doing weird gymnastics to deal with a biological brain's broken type system.

Eliezer's Unteachable Methods of Sanity

Eliezer Yudkowsky8h83

Nope. Breaks the firewall. Exactly as insane.

Beliefs are for being true. Use them for nothing else.

If you need a good thing to happen, use a plan for that.

1

1

Lawyers are uniquely well-placed to resist AI job automation

Karl Krueger9h110

We know that some lawyers are very willing to use LLMs to accelerate their work, because there have been lawyers caught submitting briefs containing confabulated case citations. Probably many other lawyers are using LLMs but are more diligent about checking their output — and thus their LLM use goes undetected.

I wonder if lawyering will have the same pipeline problem as software-engineering: The "grunt work" that has previously been assigned to trainees and junior professionals will be automated early on; thus making it less valuable to hire juniors; thus making it harder for juniors to gain job experience.

(Though the juniors can be given the task of manually checking all the citations ...)

Neel Nanda's Shortform

Neel Nanda9h180

I'm considering writing a follow-up FAQ to my pragmatic interpretability post, with clarifications and responses to common objections. What would you like to see addressed?

Eliezer's Unteachable Methods of Sanity

Eliezer Yudkowsky9h144

Oh, absolutely not. Our incredibly badly designed bodies do insane shit like repurposing superoxide as a metabolic signaling molecule. Our incredibly badly designed brains have some subprocesses that take a bit of predictive machinery lying around and repurpose it to send a control signal, which is even crazier than the superoxide thing, which is pretty crazy. Prediction and planning remain incredibly distinct as structures of cognitive work, and the people who try to deeply tie them together by writing wacky equations that sum them both together plus t... (read more)

Eliezer's Unteachable Methods of Sanity

Eliezer Yudkowsky9h120

The technique is older than the "active inference" malarky, but the way I wrote about it is influenced by my annoyance with "active inference" malarky.

1

Eliezer's Unteachable Methods of Sanity

Eliezer Yudkowsky9h83

This is about "insane" in the sense of people ceasing to meet even their own low bars for sanity.

Eliezer's Unteachable Methods of Sanity

Eliezer Yudkowsky9h78

I would of course take the question very differently from a journalist who had otherwise dealt with that slight inconvenience of trying to get to grips with an idea, and started to seem worried; instead of having had the brilliant idea of writing a Relatable Character-Focused Story instead.

Perhaps I overestimate how much I can deduce from tone and context, but to me it seems like there's a visible departure from the norm for the person who becomes worried themselves and wonders "How will people handle it?" versus the kid visiting the zoo to look at the strange creatures who believe strange things.

Eliezer's Unteachable Methods of Sanity

Eliezer Yudkowsky9h72

My understanding is that there's a larger pattern of behavior here by Oppenheimer, which Truman might not've known about but which influences my guess about Oppenheimer's tone that day and the surrounding context. Was Truman particularly famous for wanting sole credit on other occasions?

Eliezer's Unteachable Methods of Sanity

StanislavKrym9h30

I strongly suspect that the answer stems from historical analogies. The equivalent of doom was related to catastrophes like epidemics, natural disasters, genocide-threatening wars and destruction of the ecosystem. Genocide-threatening wars could motivate individuals to weaken the aggressive collective as much as possible (so that said collective would either think twice before starting the war or commiting genocide or have a bigger chance of being outcompeted). Epidemics, natural disasters and gradual destruction of the ecosystem historically left survivor... (read more)

The Industrial Explosion

Lukas Finnveden9hΩ120

(I think epoch's paper on this takes a different approach and suggests an outside view of hyperbolic growth lasting for ~1.5y OOMs without bottlenecks, because that was the amount grown between the agricultural evolution and the population bottleneck starting. That feels weaker to me than looking at more specific hypotheses of bottlenecks, and I do think epoch's overall view is that it'll likely be more than 1.5 OOMs. But wanted to flag it as another option for an outside view estimate.)

The Industrial Explosion

Lukas Finnveden9hΩ120

I do feel like, given the very long history of sustained growth, it's on the sceptic to explain why their proposed bottleneck will kick in with explosive growth but not before. So you could state my argument as: raw materials never bottlenecked growth before; no particular reason they would just bc growth is faster bc that faster growth is driven by having more labour+capital which can be used for gathering more resources; so we shouldn't expect raw materials to bottleneck growth in the future.

Gotcha. I think the main thing that's missing from this sort of... (read more)

Eliezer's Unteachable Methods of Sanity

James_Miller10h61

I teach a course at Smith College called the economics of future technology in which I go over reasons to be pessimistic about AI. Students don't ask me how I stay sane, but why I don't devote myself to just having fun. My best response is that for a guy my age with my level of wealth giving into hedonism means going to Thailand for sex and drugs, an outcome my students (who are mostly women) find "icky".

3

Preference gaps as a safeguard against AI self-replication

Alec Harris10h10

One concern might be that creating copies/counterparts instrumentally could be very useful for automating AI safety research. Perhaps one can get around this by making copies up front that AIs can use for their AI safety research. However, a misaligned AI might then be able to replace "making copies" with "moving existing copies". Is it possible to make a firm distinction between what we need for automating AI safety research and the behavior we want to eliminate?

Eliezer's Unteachable Methods of Sanity

J Bostock10h115

In what sense are you using "sanity" here? You normally place the bar for sanity very high, like ~1% of the general population high. A big chunk of people I've met in the UK AI risk scene I would call $s a n e_{j b}$ . Does $s a n e_{e l i e z e r}$ mean?

You are $s a n e_{e l i e z e r}$ iff you avoid totally crashing out, being unable to hold down a job, panicking or crying most of the time, threatening people
You are $s a n e_{e l i e z e r}$ iff you do the stuff in 1 and you're able to think about AI without making stupid errors, knowing the limits of your own reasoning

... (read more)

Distinguish worst-case analysis from instrumental training-gaming

Olli Järviniemi10h20Review for 2024 Review

In early 2024, I essentially treated instrumental training-gaming as synonymous for the worst-case takeover stories that people talked about.

In mid-2024, I saw the work that eventually became the Alignment Faking paper. That forced me to confront erroneous-conclusion-jumping I had been doing: "huh, Opus 3 is instrumentally training-gaming, but it doesn't look at all like I pictured 'inner misalignment' to look like". I turned the resulting thinking into this post.

I still endorse the one-sentence summary

While instrumental training-gaming is both evide

... (read more)

The behavioral selection model for predicting AI motivations

EJT11hΩ220

Great post! Tiny thing: is the speed prior really best understood as a prior? Surely the only way in which being slow can count against a cognitive pattern is if being slow leads to lower reward. And in that case it seems like speed is a behavioral selection pressure rather than a prior.

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

Steven Byrnes11hΩ230

Personally, my stance is something more like, "It seems very feasible to create sophisticated AI architectures that don't act as scary maximizers." To me it seems like this is what we're doing now, and I see some strong reasons to expect this to continue. (I realize this isn't guaranteed, but I do think it's pretty likely)

We probably mostly disagree because you’re expecting LLMs forever and I’m not. For example, AlphaZero does act as a scary maximizer. Indeed, nobody knows any way to make an AI that’s superhuman at Go, except by techniques that produce sca... (read more)

MattAlexander's Shortform

Is it relevant whether you knew about the apples before the apple man told you about them? If you didn't know, then the least exploitable response to a message that looks adversarial is to pretend you didn't hear it, which would mean not eating the apples.

Also, pascal's mugging is worth coordinating against- if everyone gives the 5 dollars, the stranger rapidly accumulates wealth via dishonesty. If no one eats the apples, then the stranger has the same tree of apples get less and less eaten, which is less caustic.

1

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

Steven Byrnes12hΩ230

I personally find histories of engineering complex systems in predictable and controllable ways to be much more informative, for these challenges.

To explain my disagreement, I’ll start with an excerpt from my post here:

Question: Do you expect almost all companies to eventually be founded and run by AGIs rather than humans? …
3.2.4 Possible Answer 4: “No, because if someone wants to start a business, they would prefer to remain in charge themselves, and ask an AGI for advice when needed, rather than ‘pressing go’ on an autonomous entrepreneurial A

... (read more)

Eliezer's Unteachable Methods of Sanity

David Gross12h2614

"How are you coping with the end of the world?" journalists sometimes ask me... The journalist is imagining a story that is about me, and about whether or not I am going insane...

Seems too cynical. I can imagine myself as a journalist asking you that question not because I'm hoping to write a throw-away cliche of an article, but because if I take seriously what you're saying about AGI risk, you're on the cutting edge of coping with that, and the rest of us will have to cope with that eventually, and we might have an easier time of it if we can learn from your path.

Eliezer's Unteachable Methods of Sanity

One way I could write a computer program that e.g. lands a rocket ship is to simulate many landings that could happen after possible control inputs, pick the simulated landing that has properties I like ( such as not exploding and staying far from actuator limits) and then run a low latency loop that locally makes reality track that simulation, counting on the simulation to reach a globally pleading end.

Is this what you mean by loading something into your pseudo prediction?

leogao's Shortform

Oliver Daniels12h10

Hmm I guess there's no guarantee that KL does better, and since we don't have great metrics for "internal faithfulness", maybe its just better to transparently optimize the flawed metric (task ce + sparsity).

Though as Robin notes on the AMI post, I do think the next step in this style of research is handling negative heads and self repair in a principled way.

Eliezer's Unteachable Methods of Sanity

Rana Dexsin12h80

It makes perfect sense, but I have no easy-to-access perception of this thing. Will try to do something with this skill issue.

As someone who believes myself to have had some related experiences, this is very easy to Goodhart on and very easy to screw up badly if you try to go straight for it without [a kind of prepwork that my safety systems say I shouldn't try to describe] first, and the part where you're tossing that sentence out without obvious hesitation feels like an immediate bad sign. See also this paragraph from that very section (to be clear, i... (read more)

papetoast's Shortforms

papetoast12h157

Raw feelings: I am kind of afraid of making reviews for LW. The writing prompt hints very high effort thinking. The vague memory of other people's reviews also feel high effort. The "write a short review" ask doesn't really counter this at all.

MattAlexander's Shortform

MattAlexander12h10

Pascal's reverse-mugging

One dark evening, Pascal is walking down the street, and a stranger slithers out of the shadows.

"Let me tell you something," the stranger says. "There is a park on the route you walk every day, and in the park is an apple tree. The apples taste very good; I would say they have a value of $5, and no one will stop you from taking them. However -- I am a matrix lord, and I have claimed these apples for myself. I will create and kill 3^^^3 people if you take any of these apples."

On similar reasoning to that which leads most people... (read more)

Eliezer's Unteachable Methods of Sanity

Richard_Kennaway12h2-3

Was this inspired by active inference?

I wondered the same thing. I'm not a fan of the idea that we do not act, merely predict what our actions will be and then observe the act happening of itself while our minds float epiphenomenally above, and I would be disappointed to discover that the meme has found a place for itself in Eliezer's mind.

Richard Ngo's Shortform

Basically I'd bet capable people are still around, only that the circumstances don't allow them to rise to the top for whatever reason.

Richard Ngo's Shortform

My guess would be that nowadays many people who could bring a fresh perspective, or simply high-caliber original thinking, get either selected out/drowned out or are pushed through social and financial incentives to align there thinking towards more "mainstream" views.

Eliezer's Unteachable Methods of Sanity

Thank you! Datapoint: I think at least some parts of this can be useful for me personally.

Somehat connected to the first part, one of the most "internal-memetic" moments from "Project: Lawful" for me is this short exchange between Keltham and Maillol:

"For that Matter, what is the Governance budget?"
"Don't panic. Nobody knows."
"Why exactly should I not panic?"
"Because it won't actually help."
"Very sensible."

If evil and not very smart bureaucrat understands it, I can too :)

Third part is the most interesting. It makes perfect sense, but I have no easy-to-acce... (read more)

The Industrial Explosion

Tom Davidson13hΩ240

Thanks! (Quickly written reply!)

I believe I was here thinking about how society has, at least in the past few hundred years, spent a minority of GDP on obtaining new raw materials. Which suggests that access to such materials wasn't a significant bottleneck on expansion.

So it's a stronger claim that "hard cap". I think a hard cap would, theoretically, result in all GDP being used to unblock the bottleneck, as there's no other way to increase GDP. I think you could quantify the strength of the bottleneck as the marginal elasticity of GDP to additional... (read more)

1

1

An Ambitious Vision for Interpretability

Alexander Gietelink Oldenziel13h42

I know you know this but I thought it is important to emphasize that

your first point is plausibly understating the problem of pragmatic/blackbox methods. In the worse-case an AI may simply encrypt its thoughts.

It's not even an oversight problem. There is simply nothing to ' oversee'. It will think its evil thoughts in private. The AI will comply with all evals you can cook up until it's too late.

kallisti's Shortform

StanislavKrym14h10

functional decision theorists should join the causal decision theorists in living a life of debauchery if Calvinism is true

IIRC Calvinism also implied that one can tell apart being blessed and non-blessed by working hard and seeing if it brings success, since success was brought ~mostly to the blessed.

Secular Solstice: Bremen (Dec 13)

Thanks everyone for engaging with the planning!

Important updates:
1. 🥕 Potluck. Make sure to bring something to contribute to the potlusc and list it in the coordination sheet. Besides it being a Solstice celebration, this event is a dinner for 16-8 (!) people.
2. ☝️ Timeline. Doors will open at 5.45pm and we'll aim to have the official kick-off at 6pm. Make sure to arrive on time.
3. ✨ Celebration! If you have any celebratory/festive clothing that you were always looking for an occasion to wear, feel free to use this as an opportunity to bring i... (read more)

Eliezer's Unteachable Methods of Sanity

Errors vs. Bugs and the End of Stupidity is a great post about "skill issues".