575

LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ
Customize
Load More

Quick Takes

Your Feed
Load More

Popular Comments

Superintelligence FAQ
By Scott Alexander

A basic primer on why AI might lead to human extinction, and why solving the problem is difficult. Scott Alexander walks readers through a number of questions with evidence based on progress from machine learning.

[Today]Agentic property-based testing: finding bugs across the Python ecosystem
[Tomorrow]Rationalist Shabbat
Berkeley Solstice Weekend
2025 NYC Secular Solstice & East Coast Rationalist Megameetup
494Welcome to LessWrong!
Ruby, Raemon, RobertM, habryka
6y
76
55
Human Values ≠ Goodness
johnswentworth
22h
50
329
Legible vs. Illegible AI Safety Problems
Ω
Wei Dai
4d
Ω
92
Raemon18hΩ134313
Please, Don't Roll Your Own Metaethics
What are you supposed to do other than roll your own metaethics?
Wei Dai1d*5024
The problem of graceful deference
> Yudkowsky, being the best strategic thinker on the topic of existential risk from AGI This seems strange to say, given that he: 1. decided to aim for "technological victory", without acknowledging or being sufficiently concerned that it would inspire others to do the same 2. decided it's feasible to win the AI race with a small team and while burdened by Friendliness/alignment/x-safety concerns 3. overestimated likely pace of progress relative to difficulty of problems, even on narrow problems that he personally focused on like decision theory (still far from solved today, ~16 years later. Edit: see UDT shows that decision theory is more puzzling than ever) 4. had large responsibility for others being overly deferential to him by writing/talking in a highly confident style, and not explicitly pushing back on the over-deference 5. is still overly focused on one particular AI x-risk (takeover due to misalignment) and underemphasizing or ignoring many other disjunctive risks These seemed like obvious mistakes even at the time (I wrote posts/comments arguing against them), so I feel like the over-deference to Eliezer is a completely different phenomenon from "But you can’t become a simultaneous expert on most of the questions that you care about." or has very different causes. In other words, if you were going to spend your career on AI x-safety, of course you could have become an expert on these questions first.
habryka1d*268
Do not hand off what you cannot pick up
> But still cheaper than learning to renovate a kitchen and doing it. It's really not hard to learn how to renovate a kitchen! I have done it. Of course, you won't be able to learn how to do it all quickly or to a workman's standard, but I had my contractor show me how to cut drywall, how installing cabinets works, how installing stoves works, how to run basic electrical lines, and how to evaluate the load on an electrical panel. The reports my general contractor was delegating to were also all mostly working on less than 30 hours of instruction for the specific tasks involved here (though they had more experience and were much faster at things like cutting precisely). My guess is learning how to do this took like 20 hours? A small fraction of what a kitchen renovation took, and a huge boost to my ability to find good contractors.  This is the kind of mentality I don't understand and want to avoid at Lightcone. Renovating a kitchen is not some magically complicated tasks. If you really had to figure out how to do it fully on your own you could probably just learn it using Youtube tutorials and first-principles reasoning in a month or two. Indeed, you will be able to directly watch the journeys of people who have done exactly that on Youtube, so you can even see what likely goes wrong and not make the same mistakes. Of course, then don't do it all on your own, but it's really not that hard to get to a point where you could do it on your own, if slowly.
Load More
42Solstice Season 2025: Ritual Roundup & Megameetups
Raemon
7d
6
116Please, Don't Roll Your Own Metaethics
Ω
Wei Dai
19h
Ω
22
329Legible vs. Illegible AI Safety Problems
Ω
Wei Dai
4d
Ω
92
745The Company Man
Tomás B.
2mo
70
304I ate bear fat with honey and salt flakes, to prove a point
aggliu
10d
50
64Paranoia: A Beginner's Guide
habryka
10h
5
304Why I Transitioned: A Case Study
Fiora Sunshine
12d
53
694The Rise of Parasitic AI
Adele Lopez
2mo
178
190Unexpected Things that are People
Ben Goldhaber
5d
11
102Do not hand off what you cannot pick up
habryka
1d
15
94How I Learned That I Don't Feel Companionate Love
johnswentworth
2d
17
106The problem of graceful deference
TsviBT
2d
32
132Condensation
Ω
abramdemski
4d
Ω
13
Load MoreAdvanced Sorting/Filtering
Drake Thomas2d9317
breaker25, Zach Stein-Perlman
2
A few months ago I spent $60 ordering the March 2025 version of Anthropic's certificate of incorporation from the state of Delaware, and last week I finally got around to scanning and uploading it. Here's a PDF! After writing most of this shortform, I discovered while googling related keywords that someone had already uploaded the 2023-09-21 version online here, which is slightly different.  I don't particularly bid that people spend their time reading it; it's very long and dense and I predict that most people trying to draw important conclusions from it who aren't already familiar with corporate law (including me) will end up being somewhat confused by default. But I'd like more transparency about the corporate governance of frontier AI companies and this is an easy step. Anthropic uses a bunch of different phrasings of its mission across various official documents; of these, I believe the COI's is the most legally binding one, which says that "the specific public benefit that the Corporation will promote is to responsibly develop and maintain advanced Al for the long term benefit of humanity." I like this wording less than others that Anthropic has used like "Ensure the world safely makes the transition through transformative AI", though I don't expect it to matter terribly much. I think the main thing this sheds light on is stuff like Maybe Anthropic's Long-Term Benefit Trust Is Powerless: as of late 2025, overriding the LTBT takes 85% of voting stock or all of (a) 75% of founder shares (b) 50% of series A preferred (c) 75% of non-series-A voting preferred stock. (And, unrelated to the COI but relevant to that post, it is now public that neither Google nor Amazon hold voting shares.)  The only thing I'm aware of in the COI that seems concerning to me re: the Trust is a clause added to the COI sometime between the 2023 and 2025 editions, namely the italicized portion of the following: I think this means that the 3 LTBT-appointed directors do not have the abi
Wei Dai8h*14-6
MondSemmel, Yair Halberstadt
3
Today I was author-banned for the first time, without warning and as a total surprise to me, ~8 years after banning power was given to authors, but less than 3 months since @Said Achmiz was removed from LW. It seems to vindicate my fear that LW would slide towards a more censorious culture if the mods went through with their decision. Has anyone noticed any positive effects, BTW? Has anyone who stayed away from LW because of Said rejoined? Edit: In addition to the timing, previously, I do not recall seeing a ban based on just one interaction/thread, instead of some long term pattern of behavior. Also, I'm not linking the thread because IIUC the mods do not wish to see authors criticized for exercising their mod powers, and I also don't want to criticize the specific author. I'm worried about the overall cultural trend caused by admin policies/preferences, not trying to apply pressure to the author who banned me.
Daniel Tan2h30
0
"Functional interpretability"  A while ago I wrote a blogpost attempting to articulate the limitations of mechanistic interpretability, and define a broader / more holistic philosophy of how we try to understand LLM behaviours. At the time I called this 'prosaic interpretability', but didn't like this very much in hindsight.  Since then I've updated on the name, and I now think 'functional' or 'black-box' interpretability is a good term for this. Copying from a comment by @L Rudolf L (emphasis mine)  I think this accurately describes several types of ongoing work:  * The model organisms research agenda that Anthropic's alignment science team is pursuing * Owain Evans - style research on cognitive abilities and emergent properties of LLMs * Generally, identifying and studying upstream causes of LLM behaviour that extend beyond looking at the static artifact (pretraining data, midtraining data, optimization objectives, general inductive biases, learning theory, ... )  --- I don't think any of this is particularly novel to those in the know, but I'm writing this so I can point at it in the future 
Cleo Nardo21h*2112
Sheikh Abdur Raheem Ali, gwern, and 2 more
5
Remember Bing Sydney? I don't have anything insightful to say here. But it's surprising how little people mention Bing Sydney. If you ask people for examples of misaligned behaviour from AIs, they might mention: * Sycophancy from 4o * Goodharting unit tests from o3 * Alignment-faking from Opus 3 * Blackmail from Opus 4 But like, three years ago, Bing Sydney. The most powerful chatbot was connected to the internet and — unexpectedly, without provocation, apparently contrary to its training objective and prompting — threatening to murder people! Are we memory-holing Bing Sydney or are there are good reasons for not mentioning this more? Here are some extracts from Bing Chat is blatantly, aggressively misaligned (Evan Hubinger, 15th Feb 2023).
GradientDissenter1d*30-8
lesswronguser123, Eli Tyre, and 1 more
7
LessWrong feature request: make it easy for authors to opt-out of having their posts in the training data. If most smart people were put in the position of a misaligned AI and tried to take over the world, I think they’d be caught and fail.[1] If I were a misaligned AI, I think I’d have a much better shot at succeeding, largely because I’ve read lots of text about how people evaluate and monitor models, strategies schemers can use to undermine evals and take malicious actions without being detected, and creative paths to taking over the world as an AI. A lot of that information is from LessWrong.[2] It's unfortunate that this information will probably wind up in the pre-training corpus of new models (though sharing the information is often still worth it overall to share most of this information[3]). LessWrong could easily change this for specific posts! They could add something to their robots.txt to ask crawlers looking to scrape training data to ignore the pages. They could add canary strings to the page invisibly. (They could even go a step further and add something like copyrighted song lyrics to the page invisibly.) If they really wanted, they could put the content of a post behind a captcha for users who aren’t logged in. This system wouldn't be perfect (edit: please don't rely on these methods. They're harm-reduction for information where you otherwise would have posted without any protections), but I think even reducing the odds or the quantity of this data in the pre-training corpus could help. I would love to have this as a feature at the bottom of drafts. I imagine a box I could tick in the editor that would enable this feature (and maybe let me decide if I want the captcha part or not). Ideally the LessWrong team could prompt an LLM to read users’ posts before they hit publish. If it seems like the post might be something the user wouldn't want models trained on, the site could could proactively ask the user if they want to have their post be remove
Simon Lermen2d*3319
Joey KL, avturchin, and 4 more
10
The Term Recursive Self-Improvement Is Often Used Incorrectly Also on my substack. The term Recursive Self-Improvement (RSI) now seems to get used sometimes for any time AI automates AI R&D. I believe this is importantly different from its original meaning and changes some of the key consequences. OpenAI has stated that their goal is recursive self-improvement, with projections of hundreds of thousands of automated AI R&D researchers by next year and full AI researchers by 2028. This appears to be AI-automated AI research rather than RSI in the narrow sense. When Eliezer Yudkowsky discussed RSI in 2008, he was referring specifically to an AI instance improving itself by rewriting the cognitive algorithm it is running on—what he described as "rewriting your own source code in RAM." According to the LessWrong wiki, RSI refers to "making improvements on one's own ability of making self-improvements." However, current AI systems have no special insights into their own opaque functioning. Automated R&D might mostly consist of curating data, tuning parameters, and improving RL-environments to try to hill-climb evaluations much like human researchers do. Eliezer concluded that RSI (in the narrow sense) would almost certainly lead to fast takeoff. The situation is more complex for AI-automated R&D, where the AI does not understand what it is doing. I still expect AI-automated R&D to substantially speed up AI development. Why This Distinction Matters Eliezer described the critical transition as when "the AI's metacognitive level has now collapsed to identity with the AI's object level." I believe he was basically imagining something like if the human mind and evolution merged to the same goal—the process that designs the cognitive algorithm and the cognitive algorithm itself merging. As an example, imagine the model realizes that its working memory is too small to be very effective at R&D and it directly edits its working memory. This appears less likely if the AI r
Mo Putera1d262
Thomas Kwa
1
Interesting anecdotes from an ex-SpaceX engineer who started out thinking "Elon's algorithm" was obviously correct and gradually grew cynical as SpaceX scaled: This makes me wonder if SpaceX could actually be substantially faster if it took systems engineering as seriously as the author hoped (like say the Apollo program did), overwhelmingly dominant as they currently are in terms of mass launch fraction etc. To quote the author:
Load More (7/50)