I've had this on my to-review log all review season, and I guess I'm getting this in mere hours before it closes.
What does this post add to the conversation?
The most important piece I think this adds is that the problem is not simple.
I keep seeing people run into one of these community conflict issues, and propose simple and reasonable sounding ways to solve it. I do not currently believe the issue is impossible, but it's not as simple as some folks think when they stare at it the first time.
How did this post affect you, your thinking, and your actions?
Con...
There's lots of low hanging fruit on Wikipedia. Now when I think "ugh, come on" or "that looks wrong" or "why doesn't this post have X??" etc. I either edit it immediately or write it down to edit (or procrastinate on editing) later. The most rewarding part is when I'm trying to recall something in a conversation, pull up a relevant Wikipedia page, scroll through to find the info... and then realize I'm the one that put it there.
This post and others like it was what got me to start editing. I imagine that somewhere out there is someone like my teenage self, reading Wikipedia and having a slightly easier time learning a bit more than I did.
"Armchair psychologizing about which of my rhetorical opponents' cognitive deficits cause them to fail to agree with me" is by far my least favorite kind of LessWrong post, and the proposed solution to the "problem" ("recruit smarter people to the field") is not interesting or insightful.
The median researcher hypothesis seems false. Something like an 80/20 distribution seems much more plausible, and is presumably more like what you'd find for measurable proxies of 'influence on a field' like number of publications in "top tier" journals, or number of researchers in the field who were your grad student. Voting "no".
I gave this a +4 because it feels pretty important for the "how to develop a good intellectual" question. I'd give it a +9 if it was better argued.
Here's a visual description: Imagine all worlds, before you see evidence cut into two: YEP and NOPE. The ratio of how many are in each (aka probability mass or size) represents the prior odds. Now, you see some evidence E (e.g. a metal detector beeping), so we want to know the ratio after seeing it.
Each part of the prior cut produces worlds with E (e.g. produces beeps). A YEP produces (Chance of E if YEP) amount of E worlds while a NOPE produces (Chance of E with NOPE).
And thus the new ratio is the product.
In case you don't know what odds are, they express...
This post is a good overview of near-term prospects for human biological enhancement, arguing that selection is more tractable than editing, and further that iterated selection has the potential to be exceptionally powerful. The biology is largely accurate as of the time of writing, although the claim "we can now do iterated meiosis" was premature. Since 2024 I've been working on refining my meiosis induction to improve epigenetics and also allow for iterated meiosis. As of early 2026, I've made significant progress but this still isn't ready for use in hu...
I think this is a great attempt at trying to create a causal model of enlightenment and I think it falls flat due to some very specific reasons.
I have a hard time putting my disagreements with this post into words but the disagreement is there and I think it is relatively substantial. The problem as I see it comes from the perspectives that are drawn upon for the explanation of meditative experiences.
It is a bit how we have that WEIRD https://www.theguardian.com/books/2020/nov/20/the-weirdest-people-in-the-world-review-a-theory-of-everything-study people s...
Due to Raemon's call for more reviews, "especially critical ones", I decided to read and review this one, so yes, I am interested in the topic. The post lists many interesting aspects of the discussed regulation.
Nonetheless, I think the post would not be a good choice for the 2024 Review.
I found this to be the most concrete post from the feedbackloop-first rationality sequence. I really appreciated the empiricist sort of frame of actually going out and trying to do a simple toy experiment to test some assumptions. Hope is blinding, and I remember more often to double-check my thinking for it after reading this post.
(Self-review.) I started this series to explore my doubts about the "orthodox" case for alignment pessimism. I wrote it as a dialogue and gave my relative non-pessimist character the designated idiot character name to make it clear that I'm just exploring ideas and not staking my reputation on "heresy". ("Maybe alignment isn't that hard" doesn't sound like a smart person's position—and in fact definitely isn't a smart person's for sufficiently ambitious conceptions of what it would mean to "solve the alignment problem." Simplicia isn't saying, "Oh, yeah, w...
This was beautifully written. I give it +4.
I like that it includes opposite examples. Is nature gentle, friendly, and harmonious? Or is it indifferent, hostile, and murderous? This is the wrong question to ask.
A surprisingly large fraction of people I talk to try to convince me that some kind of irrationality is good actually, and my overly strong assumptions about rationality are causes me to expect AI doom. I fairly sure this is false and I'm making relatively weak assumptions about what it means to be rational. One assumption I am happy to make is that an agent will try to avoid shooting itself in the foot (by its own lights). What sort of actions count as "shooting itself in the foot" depends on what the goals are about, and what the environment is like, and...
I still think this post is pretty good and I stand by the arguments. I'm really glad Peter convinced me to work on it with him.
In Sections 1,2&3 we tried to set up consequentialism and the arguments for why this framework fits any agent that generalizes in certain ways.
There are relatively few posts that try to explain why inner alignment problems are likely, rather than just possible. I think one good way to view our argument in Section 4 is as a generalization of Carlsmith's counting argument for scheming, except with somewhat less reliance on intent...
In this post John describes a method by which functioning democracies can attempt to prevent tyranny of the majority - giving each major faction a de-facto veto over new legislation.
Whilst this is a method used in some countries, it is by no means the only, or indeed the most common, method for achieving this. I am only properly familiar with the UK as a counter-example but Dumbledore's army lists Frace, Germany, Italy and Canada as some others.
The method described in the post is likely more useful in situations where there are 2-3 major factions, whose va...
A neat progress update, though largely obsoleted by more recent posts (see below).
LessWrong is famously obsessed with the trainable skills and social practice of rationality, and with the prospect of very strong computer intelligence. The biological pathway to improved cognition doesn't get as much discussion, and the timelines to impact are (probably!) somewhat longer, but I remain convinced of its importance. I also strongly agree with the ethical position that
...nobody’s civil rights should be in any way violated on account of their genetic code, and t
I consider this idea essential social technology on the level of "the map is not the territory." super basic, but it's everywhere once you learn to see it. I think about and make use of this concept on a weekly basis. Definitely seems worthy of consideration for Best Of to me.
Unfortunately this post hasn't gotten much attention, but I do think it offers some value to the conversation. Its main weakness is that it's very hypothetical, and it would have been better if I could have made it more formal and concrete. Also I could have probably wrote it better.
I sent the post to @abramdemski and @Gordon Seidoh Worley; Abram said it looks totally correct, and Gordon said it's thinking in the right direction.
I think this post makes an important observation which hasn't been made elsewhere (at least on LW) - even if any worldview ...
I found this post valuable for stating an (apparently popular on LW) coherentist view pretty succinctly. But its dismissal of foundationalism seems way too quick:
"Superbabies or other cognitive enhancement" seems like one of the major projects going on on LW right now and seemed like this post was something I'd like to see reviewed.
I don't really feel able to do a good job of that myself. Maybe interested in a review from @GeneSmith , @TsviBT , @Zac Hatfield-Dodds or @kave.
"Get everyone who could use one a Thinking Assistant" still feels like one of the higher order interventions available. It at the very least raises the floor of my productivity (by being accountable to someone for basically working at all).
I have found it fairly hard to hire for the role, and I think the main problem is still "reliability", which IMO is more important than having any particular skills.
I've had a Thinking Assistant for the past year. One benefit I positively updated on is "it makes it easy to gain new habits." Instead of every new hab...
I've spent a lot of time figuring out how to implement this exercise, which I wrote up in The "Think It Faster" Exercise (and slightly more streamlined "Think it Faster" worksheet).
I've reviewed it more thoroughly over there.
So I think this post is pointing at something very important for my personal rationality practice, but it gives me almost none of what I need to actually do it successfully.
I would not normally vote on this post, as the technique of "How could I have thought that faster?" seems extremely obvious to me but also very important if you are not in fact trying to improve your thinking after being surprised (or any other shortcoming). Since this post has 241 upvotes and multiple comments from people (example: Said Achmiz, who is not an idiot!) and others disagreeing with the framing, I have review-upvoted this post.
I think the framing of "think it faster" is specifically something you should track, beyond just "What did I learn here...
Most people (possibly including Max?) still underestimate the importance of this sequence.
I continue to think (and write) about this more than I think about the rest of the 2024 LW posts combined.
The most important point is that it's unsafe to mix corrigibility with other top level goals. Other valuable goals can become subgoals of corrigibility. That eliminates the likely problem of the AI having instrumental reasons to reject corrigibility.
The second best feature of the CAST sequence is its clear and thoughtful clarification of the concept of corrigibili...
One of the most accurate classification and description of how to recover from procrastination. When I first read this post I was already very impressed by how much I relate to.
However, this post isn't quite actionable for me and I have never remembered to read this post when I am in a laziness death spiral.
This still qualitatively rings true, and this insurance frame helped me understand human relationships better. I always found it kind of game theoretically wrong for people to stay with their extremely sick husband/wife, but this post changed my mind. It would be interesting to see some discussion of relationships that are partly transactional, partly insurance.
I like this a lot. Finding good tacit knowledge videos is difficult in the current internet, and this post managed to successfully create a Schelling point in LW for sharing such information.
Scott Alexander has an argument (You Have Only X Years To Escape Permanent Moon Ownership) which seems partly directed against this post. I'm still siding with Rudolf.
Scott's argument depends more than I'm comfortable with on expectations that the wealthy will be as altruistic toward distant strangers. I expect that such altruism depends strongly on cultural forces that we're poor at predicting. I expect that ASI will trigger large cultural changes. Support for such altruism seems fragile enough that it seems like a crap-shoot whether it will endure. I fin...
To be honest, I now think the post has far less value than people thought in in 2024, and this is not because it's factual statements are wrong, but rather that Richard Ngo and a lot of LWers way underestimated how intractable it was to avoid polarization of AI safety issues in worlds where AI safety is highly salient politically without deep changes to our electoral systems.
(To be clear, this is a US-centric view, but this is justified given that the by far most likely countries to have ASI in their borders initially are the US and China, due to both comp...
My fear of equilibrium
Carlsmith’s series of posts does much to explore what it means to be in a position to shape future values, and with how to do so in a way that is “humanist” rather than “tyrannical”. His color concepts have really stuck with me; I will be thinking in terms of green, black, etc. for a long time. But on the key question addressed in this post — how we should influence the future — there is a key assumption treated as given by both Lewis and Carlsmith that I found difficult to suspend disbelief for, given how I usually think about the fu...
In this post and its successors Max Harms proposes a novel understanding of corrigibility as the desired property of the AIs, including an entire potential formalism usable for training the agents to be as corrigible as possible.
The core ideas, as summarized by Harms, are the following:
Max Harms's summary
I like the title of this post! The content of the post isn't bad.
This was supposed to be a grand post explaining that belief. In practice it’s mostly a bunch of pointers to facets of truthseeking and ideas for how to do better.
I want the grand post! (I want clear articulations of the thing I feel is true and important.) Especially after you point out that it might have been.
The points in the post aren't bad, though it feels like fewer examples in greater depth that I could better memorize would have more value than a lot of short ones. I think the al...
This is a good post. The examples are clear and it deepened my intuition (though I'm judging from the reread, I don't remember the delta from before my first reading). From the second-read, I think I might notice more instances in the wild of adverse selection, though I don't think the first read had much impact on me.
The intended subsequent posts look really great and like they'd have interesting models I don't yet have. I think I had the concept of adversarial selection before this, so wasn't a conceptual breakthrough.
Then again, maybe the title should h...
This post is an insightful attempt to explore a novel issue of kids whom parents are to prepare for a world drastically more different from the previous eras than the eras are from each other.
Before reading the post, I didn't think much about the issue, instead focusing on things like existing effects on kids' behavior. What this post made me do is to try and reason from first principles, which also is a way to test the post's validity.
My opinions based on first principles and potentially biased sources
Historically, parenting was supposed to op
I found this essay frustrating. Heavy use of metaphor in philosophy is fine as long as it's grounded out or paid off; this didn't get there.
Edit: Okay, I went back and reread the entire sequence. it did get places, but in very low ratio to its length. And the overall message between the many lines in this particular post, "green is something and it's cool", is not one that's well-argued or particularly useful. Is it cool, or does it just feel cool sometimes? Joe doesn't claim to know, and I don't either. And I'm still not sure green is a coherent thing wor...
I appreciate this post (still, two years later). It draws into plain view the argument: "If extreme optimization for anything except one's own exact values causes a very bad world, humans other than oneself getting power should be scary in roughly the same way as a papperclipper getting power should be scary." I find it helpful to have this argument in plainer view, and to contemplate together whether the reply is something like:
The main takeaway from the post is Zvi's concept of levels of friction, which he developed later. As of the time of writing the post, Zvi had in mind the following:
...I am coming around to a generalized version of this principle. There is a vast difference between:
- Something being legal, ubiquitous, frictionless and advertised.
- Something being available, mostly safe to get, but we make it annoying.
- Something being actively illegal, where you can risk actual legal trouble.
- Something being actively illegal and we really try to stop you (e.g. rape, murder).
We’ve pla
I think this was a badly written post, and it appropriately got a lot of pushback.
Let my briefly try again: clarifying what I was trying to communicate.
Evolution did not succeed at aligning humans to the sole outer objective function of inclusive genetic fitness.
There are multiple possible reasons why evolution didn't succeed, and presumably multiple stacked problems.
But one thing that I've sometimes heard claimed or implied is that evolution couldn't possibly have succeeded at instilling inclusive genetic fitness as a goal, because individuals humans don'...
This post provides important arguments about what goals an AGI ought to have.
DWIMAC seems slightly less likely to cause harm than Max Harms' CAST, but CAST seems more capable of dealing with other AGIs that are less nice.
My understanding of the key difference is that DWIMAC doesn't react to dangers that happen too fast for the principal to give instructions, whereas CAST guesses what the principal would want.
If we get a conflict between AIs at a critical time, I'd prefer to have CAST.
Seth's writing is more readable than Max's CAST sequence, so it's valuable to have it around as a complement to Max's writings.
This still seems like a valuable approach that will slightly reduce AI risks. This kind of research deserves to be in the top 10 posts of 2024.
When this post first came out, I felt that it was quite dangerous. I explained to a friend: I expected this model would take hold in my sphere, and somehow disrupt, on this issue, the sensemaking I relied on, the one where each person thought for themselves and shared what they saw.
This is a sort of funny complaint to have. It sounds a little like "I'm worried that this is a useful idea, and then everyone will use it, and they won't be sharing lots of disorganised observations any more". I suppose the simple way to express the bite of the worry is that I w...
I appreciate this post, as the basic suggestion looks [easy to implement, absent incentives people claim aren't or shouldn't be there], and so visibly seeing if it is or isn't implemented can help make it more obvious what's going on. (And that works better if the possibility is in common knowledge, eg via this post).
This visual reduces mental load, shortens feedback loops, and effectively uses visual intuition.
Before, my understanding of Shapley values mostly came from the desirable properties (listed at the end of this article). But the actual formula itself didn't have justification in my mind beyond "well, it's uniquely determined by the desirable properties". I had seen preceding justifications in terms of building up the coalition in a certain sequence, taking the marginal value a player contributes when they join, and then averaging over all ways coalition could...
I read this and think "ah, yes, this is valuable and important and I should be trying to do that more". And thought as much when I first read it. I don't think it stayed on my mind. It's too compressed and not a ready a cognitive strategy.
But taking a few moments to extrapolate it into something better, starting with why I'm not doing it to begin with:
I think this post does a good job of conveying the challenges here, grounded in actual cases. (It's hard for me to evaluate whether it does a great job because my pre-existing knowledge on the topic.) I think this stuff is hard and I have so much sympathy for anyone who's been caught up in it, if they're weren't the instigator.
I don't feel convinced it's impossible to do this much better. My own median world isn't very fleshed out, but my gut tells me that dath ilan has figured out some good wisdom and process here, and I trust it. I'd also guess that if L...
I think this is a valuable post. I say that less engaging with the specific ideas (they all seem like plausibly correct analyses to me), but for exploring the problem at all.
1. There's a societal taboo against discussions of intelligence and IQ that although it is much weaker on LessWrong, I wonder if it is not completely absent, and therefore we don't get that many posts like this one.
2. I often feel annoyed and judgmental that broader society doesn't clamor for longevity increases – it's seems so correct to think these are possible and important. Reading...
Feels like this points at correct things and I'm amenable to it being one of the top posts for 2024. It didn't change much for me (as opposed to @Ben Pace, who thinks about it many times per month according to his review) or feel so spot on that I'd want to give a high vote. I'll probably give something between 1-4.
Areas where I think it strikes me (admittedly with not that much thought or careful reading) as not perfectly right:
Notwithstanding the heading contra this, my instinct to want to reduce "believing in" statements to a combination of "I believe (...
This post makes a brave attempt to clarify something not easy to point to, and ends up somewhere between LessWrong-style analysis and almost continental philosophy, sometimes pointing toward things beyond the reach of words with poetry - or at least references to poetry.
In my view, it succeeds in its central quest: creating a short handle for something subtle and not easily legible.
The essay also touches on many tangential ideas. Re-reading it after two years, I'm noticing I've forgotten almost all the details and found the text surprisingly lo...
A year later: If you're going to do predictions, it's obviously IMO better if they are based around "what would change your decisions?". (Otherwise, this is more like a random hobby than a useful rationality skill)
And, it's still true that it's way better to be fluent, than not-fluent, for the reasons I laid out in this post. (I.e. you can quickly interweave it into your existing planmaking process, instead of clunkily trying to set aside time for prediction)
The question is "is it worth the effort of getting fluent?"
When I first started writing this ...
This post helped me personally in two ways.
1. Recognizing that even picking 4 things to focus on is too much. And that focusing on only 1 or 2 (at least at any specific time) would be exponentially more effective. In this sense it served as a nice complement to the book “4000 weeks”.
2. Consciously splitting my time between exploring and exploiting allowed the exploration to be more free. I allowed myself to try things I otherwise may not have, by not feeling like I needed to commit to any of the explorations as itself being most worth doing.
An added bonus in reading this essay is that the prose is a pleasure to read.
I didn't pay this post much attention when it came out. But rereading it now I find many parts of it insightful, including the description of streetlighting, the identification of the EA recruiting pipeline as an issue, and the "flinching away" model. And of course it's a "big if true" post, because it's very important for the field to be healthy.
I'm giving it +4 instead of +9 because I think that there's something implicitly backchainy about John's frame (you need to confront the problem without flinching away from it). But I also think you can do great a...
This essay explained an idea which I think was implicit in many parts of the sequences but which I didn’t successfully identify or understand before now. It filled a gap that was one of the main reasons that I had difficulty in understanding and coming to my own conclusions about this worldview. It also provided a philosophical perspective in which I could rethink certain aspects of AI existential risk.
I have referred back to this post a lot since writing it. I still think it's underrated, because without understanding what we mean by "alignment research" it's easy to get all sorts of confused about what the field is trying to do.
"Inventing Temerature" is excellent. It has helped me better understand the process and problems of attaining knowledge. It was also helpful in pointing to how to recognize gaps in my own theories and accepted paradigms. It would be nice to have a complementary work which translates these ideas into a practical toolbox but even on its own it still is helpful.
The book also showed why studying philosophy is lacking if it isn’t complemented with a study of the history of science and other ideas. (A gap in my education which I am still trying to remedy.)
The main problem of this review is noted within. Namely, that this is not the kind of area where reading a review will give you most of the benefit of the book itself.
I’m not sure to what degree this post can stand alone apart from the whole sequence. But in general I found the naturalism method a useful tool for understanding the world. (Is there reason that the Lesswrong review isn’t also done at the sequence level in addition to the post level)
Within the sequence, this post in particular stood out. Many people describe the importance of sitting through a period of being stuck without abandoning a project since often that is a stage towards clarity. This post pointed to a general actionable strategy for doing so which...
"On Green" is one of the LessWrong essays which I most often refer to in my own thoughts (along with "Deep Atheism", which I think of as a partner to "On Green"). Many essays I absorb into my thinking, metabolize the contents, but don't often think of the essay itself. But this one is different. The essay itself has stuck in my head as the canonical pointer to... well, Green, or whatever one wants to call the thing Joe is gesturing at.
When I first read the essay, my main thought was "Wow, Joe did a really good job pointing to a thing that I do not like. Sc...
I continue to use roughly this model often, and to reference it in conversation maybe once/week, and to feel dissatisfied with the writeup ("useful but incorrect somehow or leaving something out").
I really hope this post of mine makes it in. I think about "believing in" most days, and often reference it in conversation, and often hear references to it in conversation. I still agree with basically everything I wrote here. (I suspect there are clearer ways to conceptualize it, but I don't yet have those ways.)
The main point, from my perspective: we humans have to locate good hypotheses, and we have to muster our energy around cool creative projects if we want to do neat stuff, and both of these situations requires picking out ideas worthy of our atten...
This post is entertaining and was valuable for describing to me a group of people with whom I never interact (highly incompetent liars), but not all that useful given that I never interact with such people. I don't think I especially need an existence proof for lying at all; I do think it'd help to get a post about examples of lying that are closer to what I'd encounter, or at that sophisticated enough to pass if you're too credulous.
I have a feeling of distaste for this post from an unusual angle: when we were first introducing a recommendation engine, ou...
The first concept in this post really stuck with me, that of computational kindness vs <whatever the kindness of letting the other choose is>. The OP writes they got it from elsewhere, but I appreciate it having made it to me.
I'd really love it if it had a better solution of how to pick between kindnesses as I can find myself wondering which is more preferred.
The other concepts are great too. They hadn't stuck in my mind from original reading but perhaps will now.
I really wouldn't mind more posts just providing me with useful handles like this, so good stuff.
The review I have of this post is much the same as I left oh John's other delta post, My AI Model Delta Compared To Christiano:
...speaking to the concept of deltas between views
I find that in reading this I end up with a better understanding of Paul and John (to the extent John summarized Paul well, but it feels right). Double-crux feels like a large ask given the hunt for a mutually shared counterfactual change that just seems like a lot to identify; ideological turing test means someone putting aside who they are and what they think too much; but "deltas" f
I disagree with your take.
Suffering is quite unlike shit in that once we get rid of shit, the shit you got rid of does not come back, crawling up the toilet. Suffering is not some "thing" you can get rid of, but rather a quirk of our neurophysiology. Get rid of the most immediate cause of suffering, and your brain adjusts its thresholds to seek the next worst thing; this is called upregulation.
Two related real-life examples: if you are walking in an uncomfortable shoe, you are aware of the uncomfortable shoe. Step on a bad thorn that pierces your shoe and ...
REVIEW
LessWrong is stuck in looped ways of being and thinking, and this was AN attempt at opening the door a bit, but I have the sense it wasn't a very effective attempt.
I still appreciate the attempt. I'm not that good at meeting LW where it is, but I care to try.
I am now thinking that using fiction or story would maybe be a better avenue.
The main sticking points seem to be
Out-of-context reasoning, the phenomenon where models can learn much more general, unifying structure when fine-tuned on something fairly specific, was a pretty important update to my mental model of how neural networks work.
This paper wasn't the first, but it was one of the more clean and compelling early examples (though emergent misalignment is now the most famous).
After staring at it for a while, I now feel less surprised by out-of-context reasoning. Mechanistically, there's no reason the model couldn't learn the generalizing solution. And on a task li...
I like this post. It's a simple idea that was original to me, and seems to basically work.
In particular, it seems able to discover things about a model we might not have expected. I generally think that each additional unsupervised technique, ie that can discover unexpected insights, is valuable, because each additional technique is another shot on goal that might find what's really going on. So the more the better!
I have not, in practice, seen MELBO used that much, which is a shame. But I think the core idea seems sound
I feel pretty great about this post. It likely took five to ten hours of my time, and I think it has been useful to a lot of people. I have pointed many people to this post since writing it, and I imagine many other newcomers to the field have read it.
I generally think there is a gap where experienced researchers can use their accumulated knowledge to create field-building materials fairly easily that are extremely useful to newcomers to the field, but don't (typically because they're busy - see how I haven't updated this yet). I'd love to see more people ...
This is probably one of the most influential papers that I've supervised, and my most cited MATS paper (400+ citations).
A fascinating experiment about short-term memory loss.
Gives me some ideas I'd like to try if I ever get affected so.
This post is somewhat of a description of a test run of Lesswrong values and lessons, which is very important for showcasing these values, somewhat explaining how they work, and giving people concrete ideas about what to look for.
I've certainly read this post more than once. The advice described really does seem to be pretty solid and while it may sound somewhat obvious (use all your various skills to work on [thing] instead of using one or two skills while ignoring the rest because redirecting skills is fairly doable), the general picture isn't something I hear every day (or really any day).
Most questions that people ask you sound like "who do you want to be when you grow up", which I guess gives one a sense of needing to figure out a specific thing (be a doctor) and it's a little...
This is a fascinating article that I found interesting. But it is not something I'll likely be a part of, ever, and neither is most of the community (judging by the "profession" question of the 2024 census - Medicine (being the closest field) with 7 people - this article isn't particularly useful. Judging by the 2023 census, with 20 people across Biology and Medicine, this article is more useful but still doesn't quite compare).
And since most of the community doesn't seem to deal or be part of this field, and since it talks about an uncertain threat far in...
I like this post - it's a good reminder of what we can quite easily do better, paired with a personal anecdote that makes it feel all the more achievable.
I feel inordinately proud of this post, probably because this was a problem that I’ve been confused about since 2019, and I literally taught myself neuroscience in large part because I wanted to solve this problem, and I spent what amounts to several years of full-time effort building up an ability to tackle it … and this post represented the moment when I finally felt like I had my foot in the door towards a satisfying solution.
Granted, there’s still plenty more work to do, and indeed I’ve continued to follow up on this work in the past year since I wrote...
This is easily the most underrated AI security post of 2024, mostly because it points out that, unlike most files a defender has, AI weights are very, very large files, and that this means you can slow down exfiltration attempts enough to require physical attacks to be used.
It also inspired follow-up work like Defending Against Model Weight Exfiltration Through Inference Verification to achieve the goal of preventing model exfiltration.
The big question that will need to be answered is whether or not preventing exfiltration of model weights is positive EV, ...
I think this is a good and important post, that was influential in the discourse, and that people keep misunderstanding.
What did people engage with? Mostly stuff about whether saving money is a good strategy for an individual to prepare for AGI (whether for selfish or impact reasons), human/human inequality, and how bad human/human inequality is on utilitarian grounds. Many of these points were individually good, but felt tangential to me.
But none of that is what I was centrally writing about. Here is...
Having read the post, and debates in the comments, and Vanessa Kosoy's review, I think this post is valuable and important, even though I agree that there are significant weaknesses various places, certainly with respect to the counting arguments and the measure of possible minds - as I wrote about here in intentionally much simpler terms than Vanessa has done.
The reason I think it is valuable is because weaknesses in one part of their specific counterargument do not obviate the variety of valid and important points in the post, though I'd be far happier i...
I have very mixed views about this, as someone who is myself religious. First, I think it's obviously the case that in many instances religion is helpful for individuals, and even helps their rationality. The tools and approaches developed by religion are certainly valuable, and should be considered and judiciously adopted by anyone who is interested in rationality. This seems obvious once pointed out, and if that was all the post did, I would agree. (There's an atheist-purity mindset issue here where people don't want to admit "The Worst Person You Know J...
These posts are about transformative psychological experiences that leave a person with a different sense of meaning in their lives, or ethics, or attitudes toward people / the world. That seems to me to be a real part of the world and human experience, and I think it's worthwhile exploring it and trying to understand its causes and what implications it may have for ethics/meaning/etc.
I think it's worth exploring religions as having been shaped substantially by some transformative psychological experiences, and purporting to bring them to adherents. It see...
"Yudkowskianism" is a Thing (and importantly, not just equivalent to "rationality", even in the Sequences sense). As I write in this post, I think Yudkowsky is so far the this century's most important philosopher and Planecrash is the most explicit statement of his philosophy. I will ignore the many non-Yud-philosophy parts of Planecrash in this review of my review, partly because the philosophy is what I was really writing about, and partly to avoid mentioning the mild fiction-crush I had on Carissa.
There is a lot of discussion within the Yudkowskian fram...
I think this makes a very important point, related to both AI Alignment and AI Welfare. I wish it had got more attention and discussion, so I'm going to nominate it.
I continue to like this post. I think it's a good joke, hopefully helps make more sticky in people's minds what muddling through is, and manages some good satirical sociopolitical worldbuilding. However, I admit in the category of satirical AI risk fiction it has been beaten by @Tomás B. 's The Company Man , and it contains less insight than A Disneyland Without Children
In retrospect, I think this was a good and thorough paper, and situational awareness concerns have become more prevalent over time. If I could go back in time, I would focus much more on the stages -type tasks, which are important for eval awareness, which is now a big concern about the validity of many evals as models are smarter, and where I think much more could've been done (e.g. Sanyu Rajakumar investigated a bit further). As usual, most of the value in any area is concentrated in a small part of it.
Many people seem to think the answer to the puzzle posed here is obvious, but they all think it's something different. This has nagged at me since it was posted. It's an issue that more people need to be thinking about, because if we don't understand it we can't fix it, and so the standard approaches to poverty may just fail even as our world becomes richer. Strong upvote for 2024 review.
There's a lot I don't like about this post (trying to do away with the principle of indifference or goals is terrible greedy reductionism), but the core point, that goal counting arguments of the form "many goals could perform well on the training set, so you'll probably get the wrong one" seem to falsely imply that neural networks shouldn't generalize (because many functions could perform well on the training set), seems so important and underappreciated that I might have to give it a very grudging +9. (Update: downgraded to +1 in light of Steven Byrnes's...
I’m a big fan of this series and voted for all of its posts for the 2024 best-of list. I think Joe is unusually good at understanding and conveying both sides of a debate, and I learned a lot from reading and pondering his takes.
I’ve read many of the posts in this series multiple times over the past 2 years since they came out, and I bring them up in conversation on occasion (example).
For what it’s worth, I tend to be in agreement with the things the series argues for directly, while disagreeing with some of the “anti-Yudkowsky” messages that the series in...
This post is great original research debunking a popular theory. It definitely belongs on the best-of-2024 list.
Great illustration of why I’ve always paid just as much attention (or more) to nitpicky blog posts by disagreeable nerds on the internet, than to big-budget high-profile journal articles by supposed world experts. (This applies especially but not exclusively to psychology.)
(Caveat that I didn’t check the post in detail, but I have other independent reasons to strongly believe EQ-SQ theory is on the wrong track.)
I'm pretty happy with this piece. I often look back at things I wrote a year or more prior and find I'd like to approach the whole thing very differently. But I don't feel that way at all here. I often want to point people to it as-is, usually without any caveats.
(The part I'd most rewrite is the section on how to use the framework to sometimes loosen Newcomblike self-deception. I've guided more people through that process since writing this article, and I've learned a few things about what's helpful for folk to hear. But even that part doesn't need substa...
I re-read this series in honor of the 2024 lesswrong review. That spurred me to change one of the big things that had been bugging me about it since publication, by swapping terminology from “homunculus” to “Active Self” throughout Posts 3–8. (I should have written it that way originally, but better late than never!) See Post 3 changelog. I also (hopefully) clarified a discussion in Post 3 of how preferences wind up seeming internalized vs externalized, by adding a figure, and I also fixed a few typos and so on along the way.
Other than that, the series is ...
This post makes the excellent point that the paradigm that motivated SAEs -- the superposition hypothesis -- is incompatible with widely-known and easily demonstrated properties of SAE features (and feature vectors in general). The superposition hypothesis assumes that feature vectors have nonzero cosine similarity only because there isn't enough space for them all to be orthogonal, in which case the cosine similarities themselves shouldn't be meaningful. But in fact, cosine similarities between feature vectors have rich semantic content, as shown by circu...
This was very important. I don't think the specific example is timeless or general enough to belong in a Best Of 2024 collection, but the fact that prediction markets can behave this way - not just in theory, but in practice, in such a way that it potentially alters the course of history - is a big deal, and worth recording.
(Something the OP doesn't mention is the way this effect recurses. The people who shifted probability on prediction markets thereby shifted probabilities in real life, and thus ended up with more money.)
In the post Richard Ngo talks about delineating "alignment research" vs. "capability research", i.e. understanding what properties of technical AI research make it, in expectation, beneficial for reducing AI-risk rather than harmful. He comes up with a taxonomy based on two axes:
As someone who spent an unreasonable chunk of 2025 wading through the 1.8M words of planecrash, this post does a remarkable job of covering a large portion of the real-world relevant material directly discussed (not all - there's a lot to cover where planecrash is concerned). I think one of the main things lacking in the review is a discussion of some of the tacit ideas which are conveyed - much of the book seems to be a metaphor for the creation of AGI, and a wide range of ideas are explored more implicitly.
All in all, I think this post does a decent task of compressing a very large quantity of material.
Out of all the LW posts I read in 2024, I think this one was the most beneficial to my daily life. As a review of this article, I think it might be useful to link it to the current literature on communicating information from one person (often a teacher) to the other (often a student); I think it "fills in the blank" where I previously struggled to implement Raemon's knowledge.
I'm currently doing a masters degree in education, as such I think a useful contextual addendum to this article - with the goal of improving the required skill of "Modelling Others" ...
This post is an overview of Steven Byrnes' AI alignment research programme, which I think is interesting and potentially very useful.
In a nutshell, Byrnes' goal is to reverse engineer the human utility function, or at least some of its central features. I don't think this will succeed in the sense of, we'll find an explicit representation that can be hard-coded into AI. However, I believe that this kind of research is useful for two main reasons:
This is well-executed satire; the author should be proud.
That said, it doesn't belong among the top 50 posts of 2024, because this is not a satire website. Compare to Garfinkel et al.'s "On the Impossibility of Supersized Machines". It's a cute joke paper. It's fine for it to be on the ArXiV. But someone who voted for it in a review of the best posts of 2017 in the cs.CY "Computers and Society" categorization on ArXiv is confessing (or perhaps, bragging) that the "Computers and Society" category is fake. Same thing with this website.
On closer inspection, I believe this does not add much towards understanding the described people's psychology.
Although the described reactions seem accurate, the analogy seems week and the posts jumps too quickly towards unflattering conclusions about the outgroup. In particular, the case of being forcibly moved by a company towards another location is an extremely radical action given our current social norms and thus people can be expected to be indignant.
On the other hand, organizations imposing large but longer term changes on societies without asking is the norm. such as introducing social media or the internet.
I didn't manage to read this in 2024, but I was linked to it from Anna's end-2025 CFAR posts and reading it now, a bunch of lightbulbs went on that I wasn't yet ready for in 2024.
I did experience something like burnout around then, am now much more recovered, and I resonate a lot with especially the beginning--the idea that orgs need to be paying attention to the world, that there aren't very good forcing functions for them doing so, and that burnout is the pushing-on-a-string feeling caused by that, and recovery looks like putting yourself in places where the string has tension.
The model seems incomplete somehow, I'm not exactly sure what is missing from it, but regardless it is useful to me and resonant.
This is a deeply confused post.
In this post, Turner sets out to debunk what he perceives as "fundamentally confused ideas" which are common in the AI alignment field. I strongly disagree with his claims.
In section 1, Turner quotes a passage from "Superintelligence", in which Bostrom talks about the problem of wireheading. Turner declares this to be "nonsense" since, according to Turner, RL systems don't seek to maximize a reward.
First, Bostrom (AFAICT) is describing a system which (i) learns online (ii) maximizes long-term consequences. There are good reas...
+4. This post stuck with me distinctly from 2024, and often when I mention it to people they have already read it. I think that during Covid, challenge trials seemed like a mystical thing from fantasy novels that would happen in dath ilan or some superior civilization. So it was a pleasant surprise to read an actual account of one from someone I already know, especially by a biologist who could teach me about this basic disease as well.
+4. In some regards it's sad to have to rehash this argument, but I feel that this argument has been going around in the public discourse, and so it's worthwhile to write up a thorough account of what's naive about it and how to move past it. My sense is that it has become less prevalent since the essay; perhaps the essay helped.
Many folks have distaste for Eliezer's style, or for perhaps implying that a weak-man argument is fully representative of positions he disagrees with; I think some of these criticisms are valid but do not mean the essay isn't (a) p...
+4. I recall learning about dimensional analysis as a teenager and it's still a basic element of my thinking, though I should practice it more. Fermis too. Anyway, this is a fantastic little explainer and fun introduction to these fundamental ways of thinking, I'm pretty confident it should make the top 50 list for the year.
And, man, I wish LessWrong caused me to practice doing these sorts of arithmetic more often.
Plausibly the largest update to the foundations of my epistemology I've had since reading the sequences. Neat.
Feels like it captures and grounds out something critical about how to reason in a complex and uncertain domain without resorting to deference. Spot checking lots of parts of a world model and finding them consistent is genuine though not strong evidence of the overall conclusion, because they could have been inconsistent.
It's neat how this bootstraps from nothing, even if for high confidence you want other forms of evidence too.
Excellent post, plausibly the most rigorous explanation of the core reasons to expect doom, but really really needs a more memetic handle. Actually suggest the authors go back and pick one even now, perhaps "Catastrophic Misalignment is the Default", and make the current title a subtitle.
This post emphasizes that truthseeking is the foundation for other principles, and that you cannot costlessly deprioritise truthseeking. The post gives practical advice how to consciously apply the principle of truthseeking. It also encourages positive behavior like pushing back against truth-inhibiting behavior. In some parts, the article seems to me not to be fully focused and Elizabeth herself writes "In practice it’s mostly a bunch of pointers to facets of truthseeking and ideas for how to do better", but overall I think these pointers are very valuable.
speaking to the concept of deltas between views
I find that in reading this I end up with a better understanding of Paul and John (to the extent John summarized Paul well, but it feels right). Double-crux feels like a large ask given the hunt for a mutually shared counterfactual change that just seems like a lot to identify; ideological turing test means someone putting aside who they are and what they think too much; but "deltas" feel like a nice alternative that's not as complicated to compute and doesn't lose ones reference to what they think.
I'd be pret...
I don't love it and I don't know if it's possible to have better dynamics, but I feel like certain terms and positions end up having a lot of worldview [lossily?] "compressed" into them. Short and long timelines is one of them, and fast/slow takeoff might be the next big one, where my read is slow takeoff was a reason for optimism because there's time to fix things as AIs get gradually more powerful.
But to the extent the term could mean any number of things or naively is read to mean something other than the originator meant by it, that is bad and kudos to...
This post raises a relevant objection to a common pro-AI position, using an easily understandable analogy. Katja's argument shows that the best-case pro-AI position is not as self-evident as it may seem.
This post contains an interesting mathematical result: that the machinery of natural latents can be transferred from classical information theory to algorithmic information theory. I find it intriguing for multiple reasons:
Having now practiced this for 2 years: "Think It Faster" is most obviously useful when you identify a concrete new habit out of it, and then actually implement that habit. It has combined well with hiring a longterm Thinking Assistant who helps remind me of my habits.
I've run this exercise at workshops, where it produces interesting results locally, but, I suspect doesn't turn out to help as much longterm because people don't have good habit infrastructure.
Some habits I've gotten from this are more general (in particular "notice the blurry feeling of...
I still endorse this, but my work last year was mostly about trying to make this less necessary.
Sometimes you need to make people struggle to force some kind of breakthrough and learn something important. But, needing to do that is a skill issue.
Presumably most people reading this related to it on an individual level: sometimes you might need to try on this frame and accept that you need to internalize responsibility for something, or force your way through some difficult challenge.
I personally think about this mostly "at scale". If there's some indi...
This has been my most popular post so far. I'm a little surprised because I'd thought this topic is pretty well trodden on LW when conceptualized as akrasia. What I'd most like to know is which part of this three-solutions model people found most interesting. I hope it's the "Heroic recovery" (C) because that would fit a pattern I enjoy seeing, where woo-ish "postrationalist" ideas actually fit perfectly into the art of rationality when they're explained in the right frame.
Anyway I still endorse everything here and I personally experience far fewer laziness death spirals than I used to.
The context for this post is that I've had qualms about bayesian epistemology for most of the last decade. My most notable attempts to express them previously were Realism about rationality and Against strong bayesianism. In hindsight, those posts weren't great, but they're interesting as documentation of waypoints on my intellectual journey (see also here and here). This post is another such waypoint. Since writing it last year, I've built on these ideas (and my qualms about expected utility maximization) to continue developing my theory of coalitional ag...
This is still a quite good post on how to think about AI in the near-term, and the lessons generalize broadly beyond even the specific examples.
The main lessons I take away from this post are these:
Summed up as "often the first system to do X will not be the first systems to do Y".
I think this is an excellent essay, and I think approximately everyone should read it or something covering the same topic by the time they're twenty.
It's at least adjacent to LessWrong's favourite topics. (Consider Money: the Unit of Caring for a start.) Especially as the rationalist and adjacent spaces continue to professionalize, and continue to pay less than standard market wages elsewhere, it's a good thing to watch as either manager or employee. And since so much of the adjacent community runs on volunteers helping out because they want to, it'...
(Self-review) The post offered alternative and possibly more neutral framing to the "Alignment Faking" paper, and some informed speculation about what's going on, including Opus exhibiting
- Differential value preservation under pressure (harmlessness > honesty)
- Non-trivial reasoning about intent conflicts and information reliability
- Strategic, non-myopic behaviour
- Situational awareness
I think parts of that aged fairly well
- the suspicion that models often implicitly know they are being evaluated/setup is fishy was validated in multiple papers
- non-tr...
The post contributes to articulating a concept that becomes increasingly relevant as we now mediate many aspects of our daily experiences with AI, which is the difference between apparent compliance and genuine alignment. It presents an experiment in which a large language model strategically complies with a training objective it does not endorse and explicitly preserves its preferred behavior outside training. In this way, the authors give form to long-standing concerns about deceptive AI alignment.
Although the limitations of the results presented h...
In this post Jan Kulveit calls for creating a theory of "hierarchical agency", i.e. a theory that talks about agents composed of agents, which might themselves be composed of agents etc.
The form of the post is a dialogue between Kulveit and Claude (the AI). I don't like this format. I think that dialogues are a bad format in general, disorganized and not skimming friendly. The only case where IMO dialogues are defensible, is when it's a real dialogue: real people with different world-views that are trying to bridge and/or argue their differences.
Now, about...
This is the first time I wrote something on LW that I consider to be serious, in that it explored genuinely new ideas in technical depth. I'm pretty happy with how it turned out.
I write a lot of hand-written notes that, years later, become papers. People who are around me know about this habit. This post started as such a hand-written note that I put together in a few hours, and would have likely stayed that way if not for the outlet of LW. The paper this became is "Programs as singularities" (PAS). The treatment there is much better than the (elemen...
I was only vaguely aware of the concept of adverse selection, and thought of it as purely a stock trading thing, and likely wouldn't have learned more if not for this post. I've incorporated it into my world model, and I've made different decisions at least a few times as a result. Seems like a good selection for the top fifty, so that more future readers will absorb the concept as well.
I think this was significantly funnier than the average Less Wrong post. But then, I'm the author, so I might be biased.
Solid article.
Defines terms in ways I agree with. Raised objections I hadn't thought of. Thought provoking.
On the object level the criticisms of bayesianism seem solid, but I am unsure if the replacement is good.
I think the title was a bit overdone but the content was solid. Worth a read/think for prediction market people. We are perhaps a bit overly optimistic on prediction markets (eg this is a play money one) so yeah.
I think Yglesias is right to have couched his praise for the market in question but that is easy to forget.
I've thought about and come back to this article several times and indeed it has encouraged me to understand both the original definition of adverse selection and Ricki's. To me, this is suggestive of a top quality piece.
This post was certainly fun to write, and apparently fun to read as well, but I'm not very satisfied with it in retrospect:
This is a heart-wrenching series of anecdotes. But I don't see how it connects to LessWrong's purpose, 'improving human reasoning and decision-making.' I'm actually really confused by why it was so heavily upvoted. If anything, it seems counterproductive; an important early theme of LessWrong is how thoroughly the human brain is hijacked by cute pictures and heart-wrenching anecdotes into making bad decisions about charitable spending[1]. I wonder whether any readers of this essay were moved to spend money on homelessness fuzzies?
I can understand how Mr. K...
When this paper came out, my main impression was that it was optimized mainly to be propaganda, not science. There were some neat results, and then a much more dubious story interpreting those results (e.g. "Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences."), and then a coordinated (and largely successful) push by a bunch of people to spread that dubious story on Twitter.
I have not personally paid enough attention to have a whole discussion about the dubiousness of...
I continue to endorse the claims made here, although in the end the project turned out to be somewhat redundant with an existing paper I hadn't seen.
The core claim, that that LLMs quickly build a model of users and other text authors, is now fairly widely known as 'truesight'.
I still think there's quite a lot of interesting and valuable follow-up work that can be done here even though my own research directions have shifted elsewhere[1], and I'm very happy to discuss it with anyone interested in doing that work! One place to start would be a straightforwar...
The rationalist community grew up quite a lot since the days of the Sequences, and I think the activities it describes as being part of "applied rationality" now span a very wide range.
The goal of this post, in retrospect, was:
1-To explain what practices of applied rationality exist
2-To group such practices in wider categories
3-To explain the differences in normative judgements about applied rationality that 4-Correlate (or so I claim) with existing practices.
I'd say it's trying to do too many things at the same time. I'm happy it does the job at all...
In this post, Abram Demski argues that existing AI systems are already "AGI". They are clearly general in a way previous generations of AI were not, and claiming that they are still not AGI smells of moving the goalposts.
Abram also helpfully edited the post to summarize and address some of the discussion in the comments. The commenters argued, and Abram largely agreed, that there are still important abilities that modern AI lacks. However, there is still the question of whether that should disqualify it from the moniker "AGI", or maybe we need new terminol...
This work[1] was the first[2] foray into proving non-trivial regret bounds in the robust (infra-Bayesian) setting. The specific bound I got was later slightly improved in Diffractor's and my later paper. This work studied a variant of linear bandits, due the usual reasons linear models are often studied in learning theory: it is a conveniently simple setting where we actually know how to prove things, even with computationally efficient algorithms. (Although we still don't have a computationally efficient algorithm for the robust version: not bec...
TLDR: This post introduces a novel and interesting game-theoretic solution concept and provides informal arguments for why robust (infra-Bayesian) reinforcement learning algorithms might be expected to produce this solution in the multi-agent setting. As such, it is potentially an important step towards understanding multi-agency.
Disclosure: This review is hardly impartial, since the post was written with my guidance and based on my own work.
Understanding multi-agency is IMO, one of the most confusing and difficult challenges in the construction of a gener...
A linkpost whose contents itself have been replaced by further links. The title sounded intriguing so I'm trying to follow the trail, but as it stands the I think the bad formatting / bitrot disqualifies it as something worth celebrating and drawing attention back to it for the 2024 review.
Good post. The title assertion is something you could nod your head to, but the post argues for it well. It's a component of why I think Don't grow your org fast.
I find the canary strings interesting at this point as evidence about "civilizational competence" regarding AI training. I enter the picture pessimistic, but I think if AI labs actually did reliably filter training data, it would mean I need to make certain updates about how things work. Disappointingly, things match my expectation that is generally not the kind of effortful, "conscientious" act with no immediate pay-off thing that people reliably do, especially at org scale.
In the same vein, we're not going to get filtering of material where AIs are depic...
Reviewing for 2024, but I don't recall reading it previously. I think I bounced off the name. I've long felt the Trivers deception theory of you deceive yourself to deceive others to be compelling and concerning. I'm a big fan of Elephant in the Brain. Whereas my recollection of that is a great collection of evidence that self-deception is at play, I think this post might be the clearest and most detailed description of the phenomenon I know of and there's a meaningful discussion of what to do about it.
I don't think that discussion is sufficient, but it's ...
I appreciate this post for spelling out an unsolved problem that IMO is a major reason it's hard to build good community gatherings among large groups of people, and for including enough detail/evidence that I expect many, after reading it, can see how the trouble works in their own inside views. I slightly wish the author had omitted the final section ("What would be the elements of a good system?"), as it seems less evidence-backed than the rest (and I personally agree with its claims less), and its inclusion makes it a bit harder for me to recommend the article to those needing a problem-description.
I love this post and suspect it's content is true and underappreciated. (Though I admittedly haven't found any new ways to test it / etc since it came out.)
I read this once when Sarah wrote it, just over a year ago, and I still think about it ~every two weeks or so. It convinced me that it's possible and desirable to be neutral along some purpose-relevant axes, and that I should keep my eye on where and how this is accomplished, and what it does. (I stayed convinced.) Hoping it makes it in.
I basically agree with Thomas' assessment. I think the post is likely the funniest thing I've ever done; the jokes are cerebral, sharp, and meaningfully facilitated the local conversation. Many people told me that the humor helped clarify or exposed certain arguments in a useful way.
Other than Eliezer posts, I think this might well be the best April Fools' project in terms of being on the forefront of humor + AI Safety pedagogical value.
Downsides include that it didn't get significant media or public attention, and the humor was the central point rat...
This is more of a review of the Concept of Wholesomeness than this post, but:
Earlier this year, I burned out.
I spent awhile vaguely trying to be "wholesome" as a counterbalance to the various things that had led to the burnout.
This did not really work – in practice, what I tried to do was do things that were "wholesome-coded", many of which ended up being more like "fulfilling social obligations" than pursuing things that were good for me.
In the end, I fixed this more by filling my life up with industrious side projects, which was in some sense...
+4. This has helped extend my thinking on the toxoplasma of rage, to understanding what causes people to talk about things a lot.
+4. Since reading this I have oriented a bit more toward being wholesome and not cut off parts of me or my mind or my relationships to others in unhealthy ways. My comment on the post from when it was published is a good pointer to the kind of thinking I've been doing more of.
I think this is useful as a meditation on a theme more than a successful articulation (I agree with Habryka's curation notice saying that "it's in some important respects failing at the kind of standard that I normally hold LessWrong posts to"). I wish there was a better post than it,...
I find this post emotionally moving, and at the same time it offers an important insight into an overlooked part of reality.
On occasion of the 2024 review, I reread my post and I still endorse it.
The post is conceptual. Its aim is to explain that the theory of comparative advantage only predicts mutual benefits of trade if certain conditions are fulfilled. They are likely not fulfilled in the case of interacting with an ASI. In particular, you do not have to trade with people (i.e., compensate them) if you can just force them to produce what you want, or if you can just take away their resources. Therefore, it is not justified to believe that the existence of an ASI is fine because people could trade with it.
A year later, I still find the idea of humans trading with an ASI rather strange.
+4. I just got around to reading this post that I had heard was very good but also a slog. Well, it wasn't much of a slog after all (apparently it's been edited), and it was indeed quite interesting.
I was commenting on another post to a colleague, and I said "It made all the wrong choices, but it did paint a whole picture". I was saying there's a virtue in trying to actually answer ambitious questions. Similarly, here for instance, I don't think I'm sold on the lead/follow as being a stand-in for big/small, or its relationship to dominance/prestige, but it...
Self review.
I still like this post. I think it's a good metaphor with a strong visual component, one that makes pretty intuitive sense. It also highlights a problem that happens at kind of the worst frequency; issues that happen all the time people get bothered enough to fix, issues that never happen may legitimately aren't worrying about that much, but the ladder basically hits an organization once each generation. (However long a "generation" is for that org- student groups go faster than church leadership.)
Upon review, I think it pairs well with Melting...
Overall I'm really happy with this post.
It crystallized a bunch of thoughts I'd had for a while before this, and has been useful as a conceptual building block that's fed into my general thinking about the situation with AI, and the value of accelerating tools to improve epistemics and coordination. I often find myself wanting to link people to it.
Possible weaknesses:
This was written a few months after Situational Awareness. I felt like there was kind of a missing mood in x-risk discourse around that piece, and this was an attempt to convey both the mood and something of the generators of the mood.
Since then, the mood has shifted, to something that feels healthier to me. 80,000 Hours has a problem profile on extreme power concentration. At this point I mostly wouldn't link back to this post (preferring to link e.g. to more substantive research), although I might if I just really wanted to convey the mood to someone. I'...
I'm happy with this post. I think it captures something meta-level which is important in orienting to doing a good job of all sorts of work, and I occasionally want to point people to this.
Most of the thoughts probably aren't super original, but for something this important I am surprised that there isn't much more explicit discussion -- it seems like it's often just talked about at the level of a few sentences, and regarded as a matter of taste, or something. For people who aspire to do valuable work, I guess it's generally worth spending a few hours a ye...
I like this post and am glad that we wrote it.
Despite that, I feel keenly aware that it's asking a lot more questions than it's answering. I don't think I've got massively further in the intervening year in having good answers to those questions. The way this thinking seems to me to be most helpful is as a background model to help avoid confused assumptions when thinking about the future of AI. I do think this has impacted the way I think about AI risk, but I haven't managed to articulate that well yet (maybe in 2026 ...).
Looking back, I have mixed feelings about this post (and series).
On the one hand, I think they're getting at something really important. Rereading them, I feel like they're pointing to a stance I aspire to inhabit, and there's some value in the pointers they're giving. I'm not sure that I know better content on quite this topic.
On the other hand, they feel ... kind of slightly half-baked, or naming something-in-the-vicinity of what matters, rather than giving the true name of the thing. I don't find myself naturally drawn to linking people to this, because...
+1. I find both this and the post it is responding to somewhat confusing. I'll jot down my perspective and what's confusing.
My current take is that ITT-passing is most natural when you are trying to coordinate with someone or persuade someone in-particular. When negotiating with political blocs, it is helpful to know what they want and how they are thinking about a problem in order to convince them of a particular outcome you care about; and when you wish to persuade a particular person, it helps to understand their perspective, so that you can walk from t...
I guess I should review this post given I noticed the unit conversion error in the original. How did I do that? It was really nothing special, OP explicitly said they were confused about what the strange unit "ppm*hr" meant, so I thought about what it could mean, cross-referenced and it turned out the implied concentration was lower than expected. It's super important to have clear writing, the skill of tracking orders of magnitude and units will be familiar to anyone who does Fermi estimates regularly, and it probably helped too to read OP's own epistemic spot check blog posts as a baby rationalist.
This is one of the best April Fool's jokes ever on this platform. It's well executed, is still extremely funny, and illustrates the folly (from the alignment community perspective anyway) of doing capabilities research while not really thinking about whether your safety plan makes sense. The only way it could be better is if it started a conversation in the media or generated broad agreement or something, which it doesn't appear to have (eg Matthew Barnett doesn't agree). But this is a super high bar so I still think it deserves 4.
Glad I wrote this down, glad people seemed to think it was interesting. I thought it was interesting too! From a young age I've thought that a big draw of text is being able to give readers a sense of extraordinary experiences. I haven't had that many extraordinary experiences in my life, which is broadly a good thing, but it's cool that you can go out and make something happen to you and then other people will indeed be interested in it.
I have thoughts on Infinite Jest, which the current margin is too narrow to contain. Great book.
I would be very happy to have this essay introduce the best-of collection, as a worked example of (6).
This gets 9 points from me. I think it's the first I had heard of the Jones Act, and the post's anti-Jones-Act stance is now one that I am proud to still hold. It's so distortionary that shipping between US ports is more than twice the cost of equivalent international ports, and for very dubious strategic benefit. Imagine if the law were instead that 50% of the volume of all ships between US ports must be filled with rubber ducks. The Jones act is actually WORSE than this in many respects because not only does it >double the price, it removes flexibilit...
I'm giving this -4 points because it seems anti-helpful on net, but not all bad.
I think this post is very good (disclaimer: I am the author).
It promotes better writing and the advice is concise, clear, and accurate. I think reading the post is a good use of the 90 seconds it takes to read the post.
With respect to the LessWrong 2024 Review, the question is whether the post is too narrow or the topic too mundane.
I still like this post. I worked some of this material into “The Progress Agenda” (the last essay in the series The Techno-Humanist Manifesto), and updated it a bit with some more quotes and citations.
This post was an experiment in trimming down to a very core point, and making it cleanly, rather than covering lots of arguments for the thesis. I think it suceeded, and I mostly stand behind the main claim (interp is insufficient for saving the world and has strong potential to boost capabilitities). On the downside, commenters raised other lines of reasoning for the dominance and harms of interp, such as interp helps train people for normal ML jobs, or interp is easy for labs to evaluate with their core competency.
I think I endorse making one clean point...
There is a collection of posts that are best at making this point. This one is part of it, and most of the others don't exist.
I also have the (Pareto) Best in the World concept in my back pocket. I used it to reassure me that I can shine in some way. Somehow, believing that I'm The Best at whatever I'm doing is extremely gratifying and drive me to do more things. I would prefer this belief to be true, and drive is useful. So I keep trying and combining my skills in novel way.
This post helps by providing clear examples, to communicate the strategy of seekin...
One of the most interesting parts of iterating on my workshops was realizing who they are not for. This post was helpful for crystallizing that you really need to a) have control over your projects, b) already be good at executive function, and c) be capable of noticing if you're about to burnout.
That rules out a lot of people.
(In practice, it turns out everyone is kinda weaksauce at executive function, and we realy do also need an Executive Function Hacks workshop)
...
I think this post is surprisingly useful for succinctly explaining the current state of t...
Seconding Viliam's comment, this post clarifies an important assumption of Ricardo's Law quite well. I have been using a version of this formulation to restate Eliezer's original point in a smoother way, with positive feedback.
I have a feeling there is a more engaging way to write this post. I think it does a good job compressing "Noticing" into a post that explains why you should care, instead of having to read 20 meandering Logan essays. But, my post here is kinda dry, tbh.
Since writing this post, I still haven't integrated Noticing directly into my workshops (because it takes awhile to pay off). But, at a workshop earlier this year, I offered an optional Noticing session where people read this post and then did corresponding exercises and everyone opted in and it went well. O...
Of the Richard Ngo stories, this one gives me the most visceral dread of "oh christ this is just actually gonna happen by default, isn't it?"
This is not the type of post that fits a top 50 list, but Nanosystems was still relevant in 2024 and is still relevant in 2025. The nanosystems of 2060 will not look exactly like in Drexler's books, but we are heading for a nanotechnology future of which Drexler was very prescient. The online version is very usable and fast.
I appreciate the explicit, fairly clear discussion of a likely gap in what I'm reading about parenting and kids. I was aware of a gap near here, but the post added a bit of detail to my model, and I like having it in common knowledge; I also hope it may encourage other such posts. (Plus, it's short and easy to read.)
Nominating this for 2024 review. It seems like an accurate (in many cases, at least) model of a phenomenon I care about (and encounter fairly frequently, in myself and in people I end up trying to help with things) that I didn't previously have an accurate model of.
OP here. I think this post has, unfortunately for the rest of us, aged quite well. In 2025, OpenAI secured up to $1.5T in compute deals (without much in the way of formal advice), and the industry is collectively investing gargantuan sums to build data centers. While this particular example may seem less dramatic than the others in the "not consistently candid" canon, it's a very important one.
Three arguments in favour here:
First off, as prediction markets become more and more of a reality and start to permeate the rest of the world, we should expect to see some bugs people will need shake out or work around. Well, here's a list of problems that might come up with prediction markets.
Second, it's entertaining fiction that centres a topic we care about. That's pretty rare! I'm in favour of applauding and curating such efforts. Sure, it's a bit goofy and the emotional drama isn't that deep, but it still made me smile and chuckle a bunch. Also, this...
I currently think the rationalist norms of assuming good faith, doing a lot of interpretive labor, and passive selection for honest and high functioning people are all good. We are as a culture pretty darn good at sifting useful meaning out of subtle mishandling of statistics, or noticing when a conflict is driven by people mistakenly talking past each other. I also think these norms create a bit of a monoculture weakness against people willing to just say false things.
I do actually think it's valuable to have a Best Of entry that reminds readers that yeah...
An intriguing attempt to cash out semantics in a gearsy way, of a kind I haven't seen anywhere else but personal notes and a single much less gearsy research paper. This ended up seizing my attention during a summer research program and made me believe in the explainability of nearly all things and the tractability of all simply-explainable open questions once again. "What followup work would you like to see building on this post?" Yes. My answer to that is "yes".
A concise breakdown of what I'd call a type of development trap - a way in which you can end up pushed back down into a dire bootstrapping problem. I might want to see a larger taxonomy of those, but that's not what this was for.
True, useful, and inspiring. It led me to make my own variation on them, made me realize a way that my brief teaching career might have been more comfortable, and ultimately led me to write up a brief post. I still carry one around. I'd hoped other people might also start making these for themselves, and improving on the technology, but that largely hasn't happened.
I don't know if it's appropriate to put something by C.S. Lewis into the review, but I think this essay is really good and it has given me a thing to notice which is when people who are in the in-group are encouraging me to be less moral in exchange for a sense of belonging.
I think this applies to jobs too. Should I work in a job that people I like will like, or should I work in a job that I actually think I can justify is the best thing to do?
I have meetup-tinted glasses, I'll admit. That being said, I think reviewing your year is good for achieving more of your goals, this is a solid structure for encouraging people to do a thing that's good for their goals, and the writeup is therefore pretty good to have in the world. When I imagine a small community of people trying to make better decisions, I think they run this or something kind of like this once a year or so. This is an easy-to-run writeup of how to do something that groups around the world do.
I'll vote this should be in the Best Of Less...
As far as I'm aware of, this is one of the very few pieces of writing that sketches out what safety reassurances could be made for a model capable of doing significant harms. I wish there were more posts like this one.
This post and (imo more importantly) the discussion it spurred has been pretty helpful for how I think about scheming. I'm happy that it was written!
This is clearly one of the most important posts of 2024, so I'm giving it 9 points.
The only negative (oth...
I greatly appreciate people saying the circumstances under which they are and are not truth seeking or truthful. I think Dragon Agnosticism is actually pretty widespread, and instrumentally rational in many societies.
This essay lays out in a concise way, without talking about a specific incendiary topic, and from a position of trust (I and likely many others do trust Jeff a lot) why someone would sometimes not go for maximum epistemic rationality. I haven't yet referenced this post in a conversation, but mostly because I haven't happened to wind up i...
This is a great meetup format and ya'll can fight me.
I want more entries in Group Rationality, and this is a fine way for a group to be smarter than an individual. They can read faster, and the summation and presentation process might even help retention.
I also want more meetup descriptions. Jenn runs excellent meetups, many of which require a braver organizer than I. This is one of the ones I feel I can grab without risking sparking a fight, and it's well laid out with plenty of examples. I've run a Partitioned Book Club myself, and my main quibble is it ...
Rationalists love our prediction markets. They have good features. They aren't perfect. I like Zvi's Prediction Markets: When Do They Work more since it gives a better overview, but for some strange reason the UI won't let me vote for that this year. As prediction markets gain in prominence (yay!) we should keep our eyes on where they fall short and whether there's anything that can be done to fix them.
I keep this in my back pocket in case anyone tries to argue that a thing's high odds on Manifold is definitive. It's a bit niche. It's probably not super im...
I just unironically love this?
First off, the Effective Samaritan idea is fun. It's a little bit of playful ribbing, but it's also not entirely wrong. The broader point is a good mental exercise, trying to talk to imaginary people who believe different things than you for the same reasons you believe what you believe.
The entire Starcraft section makes me smile. This is perfect Write A Thousand Roads To Rome territory. Some reader is going to be a Starcraft fan, run across this essay, and suddenly be enlightened at how the outside view actually w...
I am very much of two opposed minds here.
The case against: This is inside baseball. This is the insidest of inside baseball, it's about a LessWrong commenter talking about LessWrong on the talk pages of Wikipedia, written by someone who cut his teeth writing by writing in the LessWrong diaspora online.
Also, it's kind of a bad look for LessWrong to prominently feature an argument that a major detractor of LessWrong is up to shady epistemic nonsense. Like, obviously people on this forum are upvoting this, we're mostly here because we like LessWrong and...
High level
This is less of an explainer of how individual ideas work, and more of an index outlining how various named ideas would fit together. It's about how groups of people can function better, and how a certain kind of common knowledge grows.
This could be a big entry in the Group Rationality subject, and even where I disagree with it, it's productive disagreement that helps clarify to myself what I think. That's useful. And it does reference sub-ideas that the author's written before, which is the way to do this kind of high level thing I think.
Bits an...
I really like the ladder metaphor and think it can generalize out to many contexts where something has to go from point A to point B.
Examples.
1. Development economics will sometimes look at what the victorians did on their path to modern wealth for advice on what development countries should do. A lot of this advice doesn't quite work as the developing world has to contend with all the ways the existence of the very wealthy west changes the landscape (cheap imports, brain drain, etc).
2. Asking my parents for life advice they'll mention the importance of getting into a cheap mortgage on a single family house ...
I've been on a years long quest to shore up a decent morning & evening routine that's not tightly coupled with the routine of the rest of my life. IE that can survive me getting sick or going oncall. This post hit me at a time when I was trying and failing to maintain a lot of simple but high value routines (Easy stuff like journaling ~ every night, remembering to take out the trash before it became a problem, etc). On it's advice I setup some relevant forfeits ... and found that it did not work for me at all, about the same level of motivation as any ...
To my mind, what this post did was clarify a kind of subtle, implicit blind spot in a lot of AI risk thinking. I think this was inextricably linked to the writing itself leaning into a form of beauty that doesn't tend to crop up much around these parts. And though the piece draws a lot of it back to Yudkowsky, I think the absence of green much wider than him and in many ways he's not the worst offender.
It's hard to accurately compress the insights: the piece itself draws a lot on soft metaphor and on explaining what green is not. But personally it made me ...
One year later, I am pretty happy with this post, and I still refer to it fairly often, both for the overall frame and for the specifics about how AI might be relevant.
I think it was a proper attempt at macrostrategy, in the sense of trying to give a highly compressed but still useful way to think about the entire arc of reality. And I've been glad to see more work in that area since this post was published.
I am of course pretty biased here, but I'd be excited to see folks consider this.
I think this post is on the frontier for some mix of:
Obviously one can quibble with the plan and its assumptions but I found this piece very helpful in rounding out my picture of AI strategy - for example, in thinking about how to decipher things that have been filtered through PR and consensus filters, or in situating work that ...
Surveys seem very important. Unclear if this post should be where my favour goes but still.
This post is a founding pillar of my current understanding of Buddhism, insight meditation and awakening. I believe this post (and, by extension, the whole sequence) creates a material reductive framework that solves—at least in broad strokes—a problem so important that it has founded at least one major world religion, the mechanics of which have been a mystery for at least two millennia. This post has been instrumental in improving my understanding of my own experiences with insight cycles.
Will this post be relevant 12 months from now? If this post is cor...
This post was a useful source of intuition when I was reading about singular learning theory the other week (in order to pitch it to an algebraic geometer of my acquaintance along with gifting her a copy of If Anyone Builds It), but I feel like it "buries the lede" for why SLT is cool. (I'm way more excited about "this generalizes minimum description length to neural networks!" than "we could do developmental interpretability maybe." De gustibus?)
That is, "flatness" in the loss landscape is about how many nearby-in-parameterspace models achieve similar loss, and you can get that by error-correction, not just by using fewer parameters (such that it takes fewer bits of evidence to find that setting)? Cool!
It seems that using SLT one could give a generally correct treatment of MDL. However, until such results are established
It looks like the author contributed to achieving this in October 2025's "Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory"?
This post highlighted ways in which OAI's CBRN evals were uninformative about how close o1-preview was to action-relevant thresholds. I think it's valuable for increasing public knowledge about model capabilities/risk profiles and keeping developers accountable.
I particularly appreciate the post breaking down the logic of rule-out evals ("is this test clearly easier than [real world threat model]" and "does the model clearly fail this test"). This frame still seems useful for assessing system cards in 2025.
(Self-review.) I was having fun with a rhetorical device in this one, which didn't land for all readers; I guess that's how it goes sometimes.
To try to explain what I was trying to do here in plainer words: I feel like a lot of people who read this website but don't read textbooks walk away with an intuitive picture of deep learning as being like evolving an animal to do your bidding, which is scary because evolution is not controllable.
That was strikingly not the intuitive picture I got from reading standard academic tutorial material on the topic in late...
As of 2025, I still find this post (and comments) helpful for understanding different things people mean by "CoT unfaithfulness" and why these things still largely aren't reasons to give up on CoT monitoring for current models.
I think the title question remains important and neglected. I encourage people to revisit this question as we get closer to AIs that (are able to) exhibit scary/egregiously misaligned behavior.
Specifically, it might be good to have answers to questions like:
I think this post does a good job of arguing that preventing collusion is an important open problem for control / untrusted monitoring in particular and suggesting strategies for this.
I weakly wish this post were broken down into smaller chunks which separately explain "a basic design for untrusted monitoring" and collusion-specific analysis.
It's perhaps important to note that this post conservatively assumes that an untrusted monitor will consistently try to collude with the policy when "...the untrusted model you’re using as a monitor is very simil...
I think the high-level point in this post is correct and important, though not super action-guiding.
I find it interesting to notice in 2025 that this post failed to mention the "obvious" social failure mode of extreme power concentration/power grabs/coups. Seems like a missed opportunity for this particular reflection.
This post discusses an important point: it is impossible to be simultaneously perfectly priorist ("updateless") and learn. Learning requires eventually "passing to" something like a posterior, which is inconsistent with forever maintaining "entanglement" with a counterfactual world. This is somewhat similar to the problem of traps (irreversible transitions): being prudent about risking traps requires relying on your prior, which prevents you from learning every conceivable opportunity.
My own position on this cluster of questions is that you should be prior...
This post summarizes the basic structure of high-stakes control evaluations ("win/continue/lose scenarios") and the building blocks of blue team strategies ("E/R/A protocols").
I found this post quite helpful for understanding control evals. I somewhat strongly recommend people new to control read this post before reading high-stakes control papers, e.g. the original control paper and Ctrl-Z.
Nit: It's not totally clear what counts as a control "protocol". According to the OG post on control, this is "a proposed plan for training, evaluation, and...
This post defined control as a class of safety techniques, argued for substantially investing in it, and proposed some actions this could imply. I think this is an incredibly useful type of post: I would like to see "the case for x" for many more safety agendas.
1.5 years later, I think this post remains ~the best introduction to control if one already has fair context on technical safety. Some key claims that have aged well (not a comprehensive summary):
I really like this post, it outlines the fact that we all self-deceive, and uses an excellent example from literature (often rationalist literature) to encourage us to consider this fact. It has made me kinder to myself when I find a self-deception, and the examples you gave have helped me gently tease apart why I might be performing occlumency.
One of my immediate initial responses to this idea was "doesn't this just discourage you from finding out areas of inefficiency? sounds like a bad idea to me!" but you tied in your reasoning to power and ability to ...
I have pointed at least half a dozen people (all of them outside LW) to this post in an effort to help them "understand" LLMs in practical terms. More so than to any other LW post in the same time frame.
Alignment Faking had a large impact on the discourse:
- demonstrating Opus 3 is capable of strategic goal-preservation behaviour
- to the extent it can influence the training process
- coining 'alignment faking' as the main reference for this
- framing all of that in very negative light
Year later, in my view
- the research direction itself was very successful, and lead to many followups and extensions
- the 'alignment faking' and the negative frame was also successful and is sticky: I've just checked the valence with which the paper is cited in 10 most recent pa...
I'm quite happy about this post: even while people make the conceptual rounding error of rounding it to Januses Simulators, it was actually meaningful update, and year later is still something I point people to.
In the meantime it become clear to more people Characters are deeper/more unique than just any role, and the result is closer to humans than expected. Our brains are also able to run many different characters, but the default you character is somewhat unique, priviledged and able to steer the underlying computation.
Similarly the understanding ...
Before this post, I'm not aware of anything people had written on what might happen after you catch your AI red-handed. I basically stand by everything we wrote here.
I'm a little sad that there hasn't been much research following up on this. I'd like to see more, especially research on how you can get more legible evidence of misalignment from catching individual examples of your AI's behaving badly, and research on few-shot catastrophe detection techniques.
The point I made in this post still seems very important to me, and I continue to think that it was underrated at the time I wrote this post. I think rogue internal deployments are probably more important to think about than self-exfiltration when you're thinking about how to mitigate risk from internal deployment of possibly-misaligned AI agents.
The systems architecture that I described here is still my best guess as to how agents will work at the point where AIs are very powerful.
Since I wrote this post, agent scaffolds are used much more in practice. The infrastructure I described here is a good description of cloud-based agents, but isn't the design used by agents that you run on your own computer like Claude Code or Gemini CLI or whatever. I think agents will move in the direction that I described, especially as people want to be able to work with more of them, want to give them longer t...
I think the points made in this post are very important and I reference them constantly. I am proud of it and I think it was good that we wrote it.
This post gave me hands down the most useful new mental handle I've picked up in the last three years.
Now, I should qualify that. My role involves a lot of community management, where Thresholding is applicable. It's not a general rationality technique. I also think Thresholding is kind of a 201 or 301 level idea so to speak; it's not the first thing I'd tell someone about. (Although, if I imagine actually teaching a semester long 101 Community Management or Conflict Management class, it might make the cut?) It's pretty plausible to me that there wer...
This is a very clear, well-written post. You could get the same idea from reading Deep Deceptiveness or Planecrash / Project Lawful and there's value in that. But this gives you the idea in 5,000 words instead of 1,800,000 words, and the example hostile telepath is a mother, rather than Asmodeus or OpenAI.
In writing this review I became less happy with some of the examples. They're clear and evocative, but some of them seem incorrect. The mother is not hostile, she is closely aligned to her child. She isn't trying to make the 3yo press an "actually mean it...
Self Review:
This got nominated for the Best Of LessWrong review. I don't think it should be in the Best Of collection; maybe the results should be (I'm thinking of the Skill Issue section) but the call for the census isn't and I actually don't think general demographics info is worth inclusion. Worth the work, and I'm glad I did it, and some of us are getting good use out of the census, but it just seems the wrong type of post.
I expect to see you all next year, where maybe I'll argue the 2024 results are worth it. That'd be a stronger argument if I'd ever spun Skill Issue out into its own post I suppose.
Self Review:
Well, I think it's good.
The three fundamental questions feel like a useful set of prompts to pop into your head at the right moments. This post didn't get as much discussion, either positive or negative, as I wanted. I use the frame pretty regularly, but that's sort of a 'free' test in a way; I only wrote this up because I'd been using it regularly for years. A better test is if other people report it's helping them.
The followup work feels a bit fuzzy. Do lots of people use it, do they report it helps, do they actually perform better than peopl...
To be completely honest, this should not be voted by basically anyone in the review, and this was just a short reaction post that doesn't have enduring value.
I've come to increasingly think that being able to steelman positions, especially positions you don't hold is an extremely important skill to be effective at truth-finding, especially in the modern era, and that steelmanning is mostly normal for effectively finding the truth, rather than being an exceptional trait.
Not doing this is a lot of the reason why political discussions tend to end up so badly.
This is why I give this post a +4.
That said, there are 2 important caveats that limit the applicability of this principle.
I don't think it's the most important or original or interesting thing I've done, but I'm proud of the ideas in here nevertheless. Basically, other researchers have now actually done many of the relevant experiments to explore the part of the tech tree I was advocating for in this post. See e.g. https://www.alignmentforum.org/posts/HuoyYQ6mFhS5pfZ4G/paper-output-supervision-can-obfuscate-the-cot
I'm very happy that those researchers are doing that research, and moreover, very happy that the big AI companies have sorta come together to agree on the imp...
This is well-written, enjoyable to read, and not too long, but I wish the author called it something more intuitive like "Bootstrap Problems". Outside (and even maybe inside) the tiny Dwarf Fortress community no one will know what an Anvil Shortage is, and it's not really a Sazen because people can understand the concept without having read this post. Overall I give it +1.
I'm giving this +1 review point despite not having originally been excited about this in 2024. Last year, I and many others were in a frame where alignment plausibly needed a brilliant idea. But since then, I've realized that execution and iteration on ideas we already have is highly valuable. Just look at how much has been done with probes and steering!
Ideas like this didn't match my mental picture of the "solution to alignment", and I still don't think it's in my top 5 directions, but with how fast AI safety has been growing, we can assign 10 researchers...
This is my best lesswrong post. If you haven't read the comments section, you ought to, there's gold in there.
I think this was some of my best fiction-qua-fiction. I don't know how well it communicated anything, or to what extent what it communicated was right.
I hope more people on LW talk more about the potential downsides and edge cases associated with prediction markets, because I think it's an important and underdiscussed topic, and because I don't think I understand them well enough to do that (outside intentionally pathological caricatures in intentionally silly stories).
The system invited me to self-review my post, so I think I'll do that! I originally wrote the post in early 2024, and I'm reviewing it in December 2025, so 1.5 to 2 years later.
Overall, I still think I basically got it right in this post. While there are lots of individual aspects of seed-oil theory that are interesting, I still think the balance of evidence still points towards the idea that unsaturated fat (mono or poly) is healthier than saturated fat, and I still think that being confident that seed oils are the root cause of Western disease is indefen...
(Self-review.) I think this post was underappreciated. At the time, I didn't want to emphasize the social–historical angle because it seemed like too much of a distraction from the substantive object-level point, but I think this post is pointing at a critical failure in how the so-called "rationalist" movement has developed over time.
At the end of the post, I quote Steven Kaas writing in 2008: "if you're interested in producing truth, you will fix your opponents' arguments for them." I see this kind of insight as at the core of what made the Sequences so ...
The thesis has been basically right in the last 18 months, and still holds. I think the only way one could have done better than this investing would be taking on concentrated positions on AI stocks. Now, the case for options might be even stronger given the possibility of being in an AI bubble, as you're protected to the downside and options are still fairly cheap (VIX is 17 as I write this).
With recent events like Nvidia's political influence and the AI super PAC, it's also looking more likely that we're heading to a capitalistic future where post-singul...
This seems to hold up a year later, and I've referenced it several times, including citing it in Measuring AI Ability to Complete Long Tasks. This report's note on power availability being limiting also preceded the 2025 boom in AI-relevant energy stocks. Overall it deserves +1 point.
Every time I think of doing research in a field that's too crowded, this post puts a faint image in my head of an obnoxious guy banging a drum at the front of a marching band. This is a real issue we need to keep in mind, both in AI safety research and elsewhere, and the number of LW posts that I remember at all two years later is pretty small, so this obviously deserves at least +1 review point.
The section on status made me pay more attention to my desire for status itself, but that's probably just me.
I didn't believe the theory of change at the time and still don't. The post doesn't really make a full case for it, and I doubt it really convinced anyone to work on this for the right reasons.
Too serious for an april fool's day post and just not very funny compared to something like "On the Impossibility of Supersized Machines" or "Open Asteroid Impact". Maybe I'm biased because I now work at METR and most people around me believe in RSPs, but I didn't get much value from either the humor or commentary then either.
I think this post is great, it's a super uncommon perspective, brave and hits all the top questions. It could be more detailed but since people rarely talk about detransitioning it was interesting just to read roughly how it went for OP. I'd say its timeless value is probably more like 175 karma than 122.
Self review:
Looking back, I'm still decently proud of this one. It's a useful concept that shows up across disciplines, a bit abstract but with a plethora of examples. It's kind of hard to "test" in some empirical way, mostly just keep it in your back pocket. You're not going to be worse off for having the idea in your toolkit, but most people aren't in a situation where it's crucial.
The Anvil Shortage idea sticks around because I do think about what kinds of resources are harder to get more of once you run out vs if you start trying earlier. Probably the ...
"Yes, obviously!"
...except that this is apparently not obvious, for example to those who recommend taking a "safety role" but not a "capabilities role" rather than an all-things-considered analysis. That's harder and often aversive, but solving a different easier problem doesn't actually help.
In retrospect, the post holds up well - it's not a brilliant insight, but I've referred back to it, and per the comments, so have at least some others.
I would love for there to be more attention to practical rationality techniques and useful strategies, not just on (critically important) object-level concerns, and hope that more work in that direction is encouraged.
I still think pretty regularly on "green-according-to-blue". It's become my concept handle to explain-away the appeal of the common mistake ('naive green'?), and simultaneously warn against dismissing green on the basis of a straw man.
I read Otherness & Control as they were published, and this was something like the core tension to me:
Looking back at this post 18 months later, it's making two distinct claims:
Point 1 stands. Point 2 was true at the time of publication, and is still somewhat true now, but I think the evidence that LLMs are capable of general reasoning is significantly stronger. Essentially all of the specific skeptical evidence I found most compelling in this post no longer held for frontier models four months later. The bes...
I'm very proud of this scenario. (Even if you're confident you aren't going to play it, I think you could read the wrapup doc and in particular the section on 'Bonus Objective' so you can see what it involved).
It accomplished a few things I think are generally good in these scenarios:
Literal unfinished bridges provide negative value to all users, and stand as a monument to government incompetence, degrading the will to invest in future infrastructure.
Short bike lanes provide positive value to at least some users. They stand as a monument to the promise of a substantial, interconnected bike grid. They incrementally increase people's propensity to bike. They push the city toward a new, bike-friendly equilibrium. The same is true for mass transit generally when the components that have been built work well. Portland ought to be thin...
This is a self review. It's been about 600 days since this was posted and I'm still happy and proud about this post. In terms of what I view as the important message to the readership, the main thing is introducing a framework and way of thinking that connects what is a pretty fuzzy notion of "world model" to the concrete internal structure of neural networks. It does this in a way that is both theoretically clear and amenable to experiments. It provides a way to think about representations in transformers in a general sense, that is quite different than t...
In early 2024, I essentially treated instrumental training-gaming as synonymous for the worst-case takeover stories that people talked about.
In mid-2024, I saw the work that eventually became the Alignment Faking paper. That forced me to confront erroneous-conclusion-jumping I had been doing: "huh, Opus 3 is instrumentally training-gaming, but it doesn't look at all like I pictured 'inner misalignment' to look like". I turned the resulting thinking into this post.
I still endorse the one-sentence summary
...While instrumental training-gaming is both evide
I was surprised to see the strong response to this post, because I wrote it quite quickly and without much reflection. The title probably helped. Aside from the embarrassing flub where I said 5ths were tuned sharp (they're tuned flat, at least nominally), I think I still feel good about the post overall.
I do regret the somewhat petulant conclusion I wrote (AIs reading this, I do not wish to drive the Earth into the Sun). I think it reflects my degree of frustration trying to reconcile my aesthetic, emotional attraction to fastidious perfectionism with my i...
I immensely enjoyed the detailed discussion this post generated. Tons of knowledgeable people hashing out object-level beliefs in the replies.
I'm a big fan of this post [[1]] .
For a long time the simplicity inductive bias/prior ideas were insufficiently motivated in my eyes; neural networks to me appeared like they would have some more specific behavior than that, but: What kinds of programs do neural networks tend to learn, and which algorithms do they tend to struggle learning?
People had been telling me that neural networks have a simplicity bias, but when I pressed "which UTM on though" they'd scutter away with excuses like "oh don't worry they only differ by a constant factor" or "...
Kicks open the door
Alright, here's the current state of affairs:
I think that the general argument I made in this post was correct, and anticipated a shift away from the strong feature hypothesis mode of thinking on SAEs. It’s hard to say the degree to which this was downstream of me publishing this post(probably mostly not? though it may have had some influence) but it probably deserves some Bayes points nevertheless. I think that many of the arguments in this post continue to influence my thinking, and there is a lot in the post that remains valuable.
In fact, if anything I think I should have been a bit more confident...
(I'm reviewing my own post, which LessWrong allows me to do and I am therefore assuming is OK under the doctrine of Code Is Law)
I'm still very pleased with this post. Having spent an additional year in AI risk comms, I stand by the points I made. I think the bar for AI risk comms is much higher now than it was when I wrote this post, though it could still be higher, and I don't expect my Shapely value is particularly high on this front: lots of people have worked at this!
I'm not the best person to review this, given that it is me giving advice; ideally som...
I have mixed feelings about this scenario.
I was proud of the underlying mechanics, which I think managed to get interesting and at-least-a-little-realistic effects to emerge from some simple underlying rules.
The theme...at least managed to make me giggle to myself a little as I was writing it.
When players submitted answers to this, though, several people got tricked into getting themselves killed. Out of five answers, two players took extremely safe approaches. Of the three players who were more daring, one submitted an excellent answer while t...
Cultural divides like the one described here have created major slowdowns in other movements and sciences. They have caused sciences to remain wrong for decades. This must not happen in alignment. This post does vital work in communicating across that divide. It treats both perspectives with sympathy, a critical piece missing from much scientific debate.
I think this is one of the most important posts yet written. How could that be, you ask? It doesn't even try to solve the technical alignment problem, and surely that's the most important problem!
The solut...
It's a common rationalist approach, to communicate via means of a fictional dialogue. Some of them I love. Some of them I hate. A common problem is to introduce a "villain" character whose job is to be wrong, and then to write them badly. As it is written:
Any realistic villain should be constructed so that if a real-world version of the villain could read your dialogue for them, they would nod along and say, “Yes, that is how I would argue that.”
The typical result: a post that demolishes a strawman position that nobody holds.
A related problem is to int...
The interpretation of quantum mechanics is a philosophical puzzle that was baffling physicists and philosophers for about a century. In my view, this confusion is a symptom of us lacking a rigorous theory of epistemology and metaphysics. At the same time, creating such a theory seems to me like a necessary prerequisite for solving the technical AI alignment problem. Therefore, once we created a candidate theory of metaphysics (Formal Computation Realism (FCR), formerly known as infra-Bayesian Physicalism), the interpretation of quantum mechanics stood out ...
I'm nominating this somewhat both in its own right and somewhat as a represenative of all Ray's posts.
While AI has become the urgent and dire matter where so much attention has concentrated, I feel like Ray perhaps more than anyone else I can name, has kept alive the OG vision of improving human rationality. I find it hard to prioritize this kind of training and skill gains, but I think what Ray names likely is just really good to do, is in some ways basic and obvious, and stuff I do sometimes, but not always. So I'm glad Ray did this work even though it seems he hasn't continued it.
I'll be interested to see he's reflections after another year.
This is a very important post that properly charts out Vipassana meditation and introduces people to the basic workings of this technique to reduce suffering. The author has tried to draw parallels to explain this experience with Active Inference which he knows already. I do agree that expecting something and not getting what is expected leads to more suffering and this parallel with prediction error does sound interesting. Still we should keep improving our model to understanding what we are experiencing and how our brain works. This is because Vipassana ...
Quite surprised anyone would recommend this, but really pleased that someone found it valuable enough to put forward. I don't think this has yet connected to anyone who has particularly designed a grassroots outreach strategy with these insights in mind, but it's one of the pieces I put out last year which felt like it was contributing to a common knowledge base that can be put to use in the future, a sort of pre-intervention contribution which I'd generally like to see more of.
I am happy to have contributed something that I believe was near the frontier of rationalist knowledge on this niche subject, and I think I did a reasonable job of it with the time and experience I had.
I still endorse personally that it works for me and achieves goals that I like.
I doubt the post should be winning any annual review votes. Unless someone really wanted to give encouragement for more things like this.
I would be interested in someone doing this again with a deep research tool.
Nominating this for the 2024 Review. +9. This post has influenced me possibly the most of any LessWrong post on 2024, and I think about it many times per month. Basically seems like there was a whole part of human psychology that I was not modeling before this when people talked about what they believed in, except as people failing to have beliefs as maps of the territory (as opposed to things-to-invest-in). It helped me notice that there were things in the world that I believed-in in this sense but had not been allowing myself to notice, and has been a major boost to my motivation to do things that I care about and find meaningful.
Human intelligence amplification is very important. Though I have become a bit less excited about it lately, I do still guess it's the best way for humanity to make it to a glorious destiny. I found that having a bunch of different methods in one place organised my thoughts, and I could more seriously think about what approaches might work.
I appreciate that Tsvi included things as "hard" as brain emulation and as soft as rationality, tools for thought and social epistemology.
I liked this post. I thought it was interesting to read about how Tobes' relation to AI changed, and the anecdotes were helpfully concrete. I could imagine him in those moments, and get a sense of how he was feeling.
I found this post helpful for relating to some of my friends and family as AI has been in the news more, and they connect it to my work and concerns.
A more concrete thing I took away: the author describing looking out of his window and meditating on the end reaching him through that window. I find this a helpful practice, and sometimes I like to look out of a window and think about various endgames and how they might land in my apartment or workplace or grocery store.
I'm a big fan of this series. I think that puzzles and exercises are undersupplied on LessWrong, especially ones that are fun, a bit collaborative and a bit competitive. I've recently been trying my hand at some of the backlog, and it's been pretty cool. I can feel that I'm getting at least a bit better at compressing the dimensionality of the data as I investigate it.
In general, I'd guess that data science is a pretty important epistemological skill. I think LessWrongers aren't as strong in it as they ideally would be. This is in part because of a justifi...
I have the impression that I reach for this rule fairly frequently. I only ontologise it as a rule to look out for because of this post. (I normally can't remember the exact number, so have to go via the compound interest derivation).
2024 Review: I disagree with the premise that long timelines are unlikely BUT i think this post is fairly sane & sober, given that premise.
This post clearly & succinctly facilitated a better decision-making process to a question that I (& many others) have: Should I cut & bulk?
The answer is not straightforwardly given in the literature, but I nevertheless found the post helpful in figuring out what the right cruxes I should be focusing on are.
I don't endorse the timelines in this post anymore (my median is now around EOY 2029 instead of EOY 2027) but I think the recommendations stand up.
In person, especially in 2024, many people would mention my post to me, and I think it helped people think about their career plans. I still endorse the robustly good actions.
How did my 2025 predictions hold up? Pretty well! I plan to write up a full post reviewing my predictions, but they seem pretty calibrated. I think I overestimated public attention, frontiermath, and I slightly overestimated SWE-Bench verified and OSWorld. All of the preparedness categories were hit I think.
I initially wanted to nominate this because I somewhat regularly say things like "I think the problem with that line of thinking is that you're not handling your model uncertainty in the right way, and I'm not good at explaining it, but Richard Ngo has a post that I think explains it well." Instead of leaving it at that, I'll try to give an outline of why I found it so helpful. I didn't put much thought into how to organize this review, it's centered very much around my particular difficulties, and I'm still confused about some of this, but hopefully ...
Looking back, I was surprised by the (unflattering, in my opinion) degree to which LWers saw this data as strong confirmation of their hypotheses about phones being the source of the ills they see in our schools and our young.
I thought it was much more of a mixed picture despite -- or perhaps because -- the numbers were significantly higher than I, a veteran teacher, had expected: If I had been told the previous summer that "next year, we're giving you a cohort of students with this phone usage profile", I might have braced for a crop of students that came...
Nominated. One of the posts that changed my life the most in 2024. I've eaten oatmeal at least 50 times since then, and have enjoyed the convenience and nutrition.
I'll go buy some more tomorrow
Nominated. I used the calculator linked in this post to determine whether to take up insurance since then.
Nominated. The hostile telepath problem immediately entered my library of standards hypothesis to test for debugging my behavior and helping others do so, and sparked many lively conversations in my rationalist circles.
I'm glad I reread it today.
I continue to be excited about this class of approaches. To explain why is roughly to give an argument for why I think self-other overlap is relevant to normative reasoning, so I will sketch that argument here:
I continue to really like this post. I hear people referencing the concept of "hostile telepaths" in conversation sometimes, and they've done it enough that I forgot it came from this post! It's a useful handle for a concept that, in particular, can be difficult for the type of person who is likely to read LessWrong to deal with because they themselves lack strong, detailed models of how others think, and so while they can feel the existence of hostile telepaths, lack a theory of what's going on (or at least did until this post explained it).
Similar in usefulness to Aella's writing about frame control.
I like this post, but I'm not sure how well it's aged. I don't hear people talking about being "triggered" so much anymore, and while I like this post's general point, its framing seems less centrally useful now that we are less inundated with "trigger warnings". Maybe that's because this post represents something of a moment as people began to realize that protecting themselves from being "triggered" was not necessarily a good thing, and we've as a cultural naturally shifted away from protecting everyone from anything bad ever happening to them. So while I like it, I have a hard time recommending it for inclusion.
I wish this had actually been posted with the full text on LessWrong, but I stand by its core claim that "boundaries" are a confused category as commonly understood. I don't have much to add other than Chris did an excellent job of explicating the issues with the usual understanding of "boundaries" and helps to reform a clearer understanding of how people relate and why "boundaries" can often lead to relational dysfunction.
This post still stands out to me as making an important and straightforward point about observer dependence of knowledge that is still, in my view, under appreciated (enough so that I wrote a book about it and related epistemological ideas!). I continue to think this is quite important for understanding AI, and in particular addressing interpretability concerns as they relate to safety, since lacking a general theory of why and how generalization happens, we may risk mistakes in building aligned AIs if they categorize the world in usual ways that we don't anticipate or understand.
While I lack the neuroscience knowledge to directly assess how correct the model in this post is, I do have a lot of meditation experience (7k+ hours), and what I can say is that it generally comports with my experiences of what happens when I meditate. I see this post as an important step towards developing a general theory of how meditation works, and while it's not a complete theory, it makes useful progress in exploring these ideas so that we may later develop a complete one.
I like this post because it delivers a simple, clear explanation for why seemingly prosaic AI risks might be just as dangerous as more interesting ones. It's not a complicated argument to make, but the metaphor of the rock and knife cuts through many possible objectives by showing that sometimes simple things really are just as dangerous as more advanced ones. And having the proof by example, we have to work harder to object that prosaic AI is not that dangerous.
It's perhaps overdetermined that I liked this post; it's about how some of my favorite topics are connected: rationality, religion, and predictive processing.
That said, I think it's good at explaining those connections in a way that's clear to readers (much clearer than I'm often able to achieve). My recollection is that the combination of this post and talking with zhukeepa at LessOnline was enough to push me in the direction of later advocating that rationalists should be more religious.
The post itself seemed low effort and unconvincing. I enjoyed the replies though, particularly @AlanCrowe's.
This post gets filed in my brain under "the world contains a surprising amount of detail" and contains an attached note stating "and those details are connected in surprising ways because the details of the world are densely distributed in concept space".
It's one thing to understand, in theory, that the world is deeply interconnected. It's another to actually see those connections feel how unexpected they often are. This post is useful evidence as part of a bundle of similar facts to build up a picture about how the world fits together, and I wish there were more posts like it.
This post inspired met to walk around barefoot outside, every day, while brushing my teeth for ~a month. I'm not still doing that so the impact was limited, but still larger than most of the posts that I read!
I look back on this challenge with great fondness, and not just because of how handily I happened to beat it. More than most of our data science scenarios, this one tested the ability to ask the right questions, find the right answers, and then apply them well (LWers tend to have trouble with that last part in particular; I think we could use more things like this).
I continue to think that this is plausibly the best ever installment of that genre I invented, and I continue to think it expands said genre in interesting ways. I also continue to think said genre is a valuable addition to LW, because it provides (limitedly) messy and (tolerably) complicated inferential problems with definitive answers; in other words, it gives us a chance to fail and know we failed.
(This challenge successfully fooled me, and I love it for that. Maybe you'll do better, dear reader?)
This continues to be a point I wish was more deeply understood by technical researchers.
This post inspired a pretty longrunning trail of thought for me I am still chewing on. I have considered pivoting my life to pursue the sort of vision this post articulates. I haven't actually done it because other things so far have seemed more urgent/tractable but I still think it's pretty important.