The Best of LessWrong

Here you can find the best posts of LessWrong. When posts turn more than a year old, the LessWrong community reviews and votes on how well they have stood the test of time. These are the posts that have ranked the highest for all years since 2018 (when our annual tradition of choosing the least wrong of LessWrong began).

For the years 2018, 2019 and 2020 we also published physical books with the results of our annual vote, which you can buy and learn more about here.
Sort by:
curatedyear
+

Rationality

Eliezer Yudkowsky
Local Validity as a Key to Sanity and Civilization
Buck
"Other people are wrong" vs "I am right"
Mark Xu
Strong Evidence is Common
johnswentworth
You Are Not Measuring What You Think You Are Measuring
johnswentworth
Gears-Level Models are Capital Investments
Hazard
How to Ignore Your Emotions (while also thinking you're awesome at emotions)
Scott Garrabrant
Yes Requires the Possibility of No
Scott Alexander
Trapped Priors As A Basic Problem Of Rationality
Duncan Sabien (Deactivated)
Split and Commit
Ben Pace
A Sketch of Good Communication
Eliezer Yudkowsky
Meta-Honesty: Firming Up Honesty Around Its Edge-Cases
Duncan Sabien (Deactivated)
Lies, Damn Lies, and Fabricated Options
Duncan Sabien (Deactivated)
CFAR Participant Handbook now available to all
johnswentworth
What Are You Tracking In Your Head?
Mark Xu
The First Sample Gives the Most Information
Duncan Sabien (Deactivated)
Shoulder Advisors 101
Zack_M_Davis
Feature Selection
abramdemski
Mistakes with Conservation of Expected Evidence
Scott Alexander
Varieties Of Argumentative Experience
Eliezer Yudkowsky
Toolbox-thinking and Law-thinking
alkjash
Babble
Kaj_Sotala
The Felt Sense: What, Why and How
Duncan Sabien (Deactivated)
Cup-Stacking Skills (or, Reflexive Involuntary Mental Motions)
Ben Pace
The Costly Coordination Mechanism of Common Knowledge
Jacob Falkovich
Seeing the Smoke
Elizabeth
Epistemic Legibility
Daniel Kokotajlo
Taboo "Outside View"
alkjash
Prune
johnswentworth
Gears vs Behavior
Raemon
Noticing Frame Differences
Duncan Sabien (Deactivated)
Sazen
AnnaSalamon
Reality-Revealing and Reality-Masking Puzzles
Eliezer Yudkowsky
ProjectLawful.com: Eliezer's latest story, past 1M words
Eliezer Yudkowsky
Self-Integrity and the Drowning Child
Jacob Falkovich
The Treacherous Path to Rationality
Scott Garrabrant
Tyranny of the Epistemic Majority
alkjash
More Babble
abramdemski
Most Prisoner's Dilemmas are Stag Hunts; Most Stag Hunts are Schelling Problems
Raemon
Being a Robust Agent
Zack_M_Davis
Heads I Win, Tails?—Never Heard of Her; Or, Selective Reporting and the Tragedy of the Green Rationalists
Benquo
Reason isn't magic
habryka
Integrity and accountability are core parts of rationality
Raemon
The Schelling Choice is "Rabbit", not "Stag"
Diffractor
Threat-Resistant Bargaining Megapost: Introducing the ROSE Value
Raemon
Propagating Facts into Aesthetics
johnswentworth
Simulacrum 3 As Stag-Hunt Strategy
LoganStrohl
Catching the Spark
Jacob Falkovich
Is Rationalist Self-Improvement Real?
Benquo
Excerpts from a larger discussion about simulacra
Zvi
Simulacra Levels and their Interactions
abramdemski
Radical Probabilism
sarahconstantin
Naming the Nameless
AnnaSalamon
Comment reply: my low-quality thoughts on why CFAR didn't get farther with a "real/efficacious art of rationality"
Eric Raymond
Rationalism before the Sequences
Owain_Evans
The Rationalists of the 1950s (and before) also called themselves “Rationalists”
+

Optimization

sarahconstantin
The Pavlov Strategy
johnswentworth
Coordination as a Scarce Resource
AnnaSalamon
What should you change in response to an "emergency"? And AI risk
Zvi
Prediction Markets: When Do They Work?
johnswentworth
Being the (Pareto) Best in the World
alkjash
Is Success the Enemy of Freedom? (Full)
jasoncrawford
How factories were made safe
HoldenKarnofsky
All Possible Views About Humanity's Future Are Wild
jasoncrawford
Why has nuclear power been a flop?
Zvi
Simple Rules of Law
Elizabeth
Power Buys You Distance From The Crime
Eliezer Yudkowsky
Is Clickbait Destroying Our General Intelligence?
Scott Alexander
The Tails Coming Apart As Metaphor For Life
Zvi
Asymmetric Justice
Jeffrey Ladish
Nuclear war is unlikely to cause human extinction
Spiracular
Bioinfohazards
Zvi
Moloch Hasn’t Won
Zvi
Motive Ambiguity
Benquo
Can crimes be discussed literally?
Said Achmiz
The Real Rules Have No Exceptions
Lars Doucet
Lars Doucet's Georgism series on Astral Codex Ten
johnswentworth
When Money Is Abundant, Knowledge Is The Real Wealth
HoldenKarnofsky
This Can't Go On
Scott Alexander
Studies On Slack
johnswentworth
Working With Monsters
jasoncrawford
Why haven't we celebrated any major achievements lately?
abramdemski
The Credit Assignment Problem
Martin Sustrik
Inadequate Equilibria vs. Governance of the Commons
Raemon
The Amish, and Strategic Norms around Technology
Zvi
Blackmail
KatjaGrace
Discontinuous progress in history: an update
Scott Alexander
Rule Thinkers In, Not Out
Jameson Quinn
A voting theory primer for rationalists
HoldenKarnofsky
Nonprofit Boards are Weird
Wei Dai
Beyond Astronomical Waste
johnswentworth
Making Vaccine
jefftk
Make more land
+

World

Ben
The Redaction Machine
Samo Burja
On the Loss and Preservation of Knowledge
Alex_Altair
Introduction to abstract entropy
Martin Sustrik
Swiss Political System: More than You ever Wanted to Know (I.)
johnswentworth
Interfaces as a Scarce Resource
johnswentworth
Transportation as a Constraint
eukaryote
There’s no such thing as a tree (phylogenetically)
Scott Alexander
Is Science Slowing Down?
Martin Sustrik
Anti-social Punishment
Martin Sustrik
Research: Rescuers during the Holocaust
GeneSmith
Toni Kurz and the Insanity of Climbing Mountains
johnswentworth
Book Review: Design Principles of Biological Circuits
Elizabeth
Literature Review: Distributed Teams
Valentine
The Intelligent Social Web
jacobjacob
Unconscious Economics
eukaryote
Spaghetti Towers
Eli Tyre
Historical mathematicians exhibit a birth order effect too
johnswentworth
What Money Cannot Buy
Scott Alexander
Book Review: The Secret Of Our Success
johnswentworth
Specializing in Problems We Don't Understand
KatjaGrace
Why did everything take so long?
Ruby
[Answer] Why wasn't science invented in China?
Scott Alexander
Mental Mountains
Kaj_Sotala
My attempt to explain Looking, insight meditation, and enlightenment in non-mysterious terms
johnswentworth
Evolution of Modularity
johnswentworth
Science in a High-Dimensional World
zhukeepa
How uniform is the neocortex?
Kaj_Sotala
Building up to an Internal Family Systems model
Steven Byrnes
My computational framework for the brain
Natália
Counter-theses on Sleep
abramdemski
What makes people intellectually active?
Bucky
Birth order effect found in Nobel Laureates in Physics
KatjaGrace
Elephant seal 2
JackH
Anti-Aging: State of the Art
Vaniver
Steelmanning Divination
Kaj_Sotala
Book summary: Unlocking the Emotional Brain
+

AI Strategy

Ajeya Cotra
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
Daniel Kokotajlo
Cortés, Pizarro, and Afonso as Precedents for Takeover
Daniel Kokotajlo
The date of AI Takeover is not the day the AI takes over
paulfchristiano
What failure looks like
Daniel Kokotajlo
What 2026 looks like
gwern
It Looks Like You're Trying To Take Over The World
Andrew_Critch
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)
paulfchristiano
Another (outer) alignment failure story
Ajeya Cotra
Draft report on AI timelines
Eliezer Yudkowsky
Biology-Inspired AGI Timelines: The Trick That Never Works
HoldenKarnofsky
Reply to Eliezer on Biological Anchors
Richard_Ngo
AGI safety from first principles: Introduction
Daniel Kokotajlo
Fun with +12 OOMs of Compute
Wei Dai
AI Safety "Success Stories"
KatjaGrace
Counterarguments to the basic AI x-risk case
johnswentworth
The Plan
Rohin Shah
Reframing Superintelligence: Comprehensive AI Services as General Intelligence
lc
What an actually pessimistic containment strategy looks like
Eliezer Yudkowsky
MIRI announces new "Death With Dignity" strategy
evhub
Chris Olah’s views on AGI safety
So8res
Comments on Carlsmith's “Is power-seeking AI an existential risk?”
Adam Scholl
Safetywashing
abramdemski
The Parable of Predict-O-Matic
KatjaGrace
Let’s think about slowing down AI
nostalgebraist
human psycholinguists: a critical appraisal
nostalgebraist
larger language models may disappoint you [or, an eternally unfinished draft]
Daniel Kokotajlo
Against GDP as a metric for timelines and takeoff speeds
paulfchristiano
Arguments about fast takeoff
Eliezer Yudkowsky
Six Dimensions of Operational Adequacy in AGI Projects
+

Technical AI Safety

Andrew_Critch
Some AI research areas and their relevance to existential safety
1a3orn
EfficientZero: How It Works
elspood
Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment
So8res
Decision theory does not imply that we get to have nice things
TurnTrout
Reward is not the optimization target
johnswentworth
Worlds Where Iterative Design Fails
Vika
Specification gaming examples in AI
Rafael Harth
Inner Alignment: Explain like I'm 12 Edition
evhub
An overview of 11 proposals for building safe advanced AI
johnswentworth
Alignment By Default
johnswentworth
How To Go From Interpretability To Alignment: Just Retarget The Search
Alex Flint
Search versus design
abramdemski
Selection vs Control
Mark Xu
The Solomonoff Prior is Malign
paulfchristiano
My research methodology
Eliezer Yudkowsky
The Rocket Alignment Problem
Eliezer Yudkowsky
AGI Ruin: A List of Lethalities
So8res
A central AI alignment problem: capabilities generalization, and the sharp left turn
TurnTrout
Reframing Impact
Scott Garrabrant
Robustness to Scale
paulfchristiano
Inaccessible information
TurnTrout
Seeking Power is Often Convergently Instrumental in MDPs
So8res
On how various plans miss the hard bits of the alignment challenge
abramdemski
Alignment Research Field Guide
paulfchristiano
The strategy-stealing assumption
Veedrac
Optimality is the tiger, and agents are its teeth
Sam Ringer
Models Don't "Get Reward"
johnswentworth
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables
Buck
Language models seem to be much better than humans at next-token prediction
abramdemski
An Untrollable Mathematician Illustrated
abramdemski
An Orthodox Case Against Utility Functions
johnswentworth
Selection Theorems: A Program For Understanding Agents
Rohin Shah
Coherence arguments do not entail goal-directed behavior
Alex Flint
The ground of optimization
paulfchristiano
Where I agree and disagree with Eliezer
Eliezer Yudkowsky
Ngo and Yudkowsky on alignment difficulty
abramdemski
Embedded Agents
evhub
Risks from Learned Optimization: Introduction
nostalgebraist
chinchilla's wild implications
johnswentworth
Why Agent Foundations? An Overly Abstract Explanation
zhukeepa
Paul's research agenda FAQ
Eliezer Yudkowsky
Coherent decisions imply consistent utilities
paulfchristiano
Open question: are minimal circuits daemon-free?
evhub
Gradient hacking
janus
Simulators
LawrenceC
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
TurnTrout
Humans provide an untapped wealth of evidence about alignment
Neel Nanda
A Mechanistic Interpretability Analysis of Grokking
Collin
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
evhub
Understanding “Deep Double Descent”
Quintin Pope
The shard theory of human values
TurnTrout
Inner and outer alignment decompose one hard problem into two extremely hard problems
Eliezer Yudkowsky
Challenges to Christiano’s capability amplification proposal
Scott Garrabrant
Finite Factored Sets
paulfchristiano
ARC's first technical report: Eliciting Latent Knowledge
Diffractor
Introduction To The Infra-Bayesianism Sequence
#1

Paul Christiano paints a vivid and disturbing picture of how AI could go wrong, not with sudden violent takeover, but through a gradual loss of human control as AI systems optimize for the wrong things and develop influence-seeking behaviors. 

#2

AI researchers warn that advanced machine learning systems may develop their own internal goals that don't match what we intended. This "mesa-optimization" could lead AI systems to pursue unintended and potentially dangerous objectives, even if we tried to design them to be safe and aligned with human values.

32adamShimi
In “Why Read The Classics?”, Italo Calvino proposes many different definitions of a classic work of literature, including this one: For me, this captures what makes this sequence and corresponding paper a classic in the AI Alignment literature: it keeps on giving, readthrough after readthrough. That doesn’t mean I agree with everything in it, or that I don’t think it could have been improved in terms of structure. But when pushed to reread it, I found again and again that I had missed or forgotten some nice argument, some interesting takeaway. With that, a caveat: I’m collaborating with Evan Hubinger (one of the authors) on projects related to ideas introduced in this sequence, especially to Deceptive Alignment. I am thus probably biased positively about this work. That being said, I have no problem saying I disagree with collaborators, so I don’t think I’m too biased to write this review. (Small point: I among other people tend to describe this sequence/paper as mainly Evan’s work, but he repeatedly told me that everyone participated equally, and that the names are in alphabetic order, not contribution order. So let’s keep that in mind) Summary Let’s start the review proper with a post by post summary (except for the conclusion): * (Introduction) This first post introduces the idea of mesa-optimizers, the learned optimizers from the title. A mesa-optimizer is an optimizer which is the result of a learning process, and it comes with the issue of inner alignment: how aligned is the objective of the mesa-optimizer (on which we don’t have direct control) with the objective of the base-optimizer that produced this mesa-optimizer? The post then split the safety questions related to mesa-optimizer in two categories: understanding which conditions make mesa-optimizer appear; and understanding how aligned is the mesa-objective with the base-objective. * (Conditions for Mesa-Optimization) This post tackles the first category outlined in the introduction: how can m
#3

A story in nine parts about someone creating an AI that predicts the future, and multiple people who wonder about the implications. What happens when the predictions influence what future happens? 

19fiddler
I think this post is incredibly useful as a concrete example of the challenges of seemingly benign powerful AI, and makes a compelling case for serious AI safety research being a prerequisite to any safe further AI development. I strongly dislike part 9, as painting the Predict-o-matic as consciously influencing others personality at the expense of short-term prediction error seems contradictory to the point of the rest of the story. I suspect I would dislike part 9 significantly less if it was framed in terms of a strategy to maximize predictive accuracy. More specifically, I really enjoy the focus on the complexity of “optimization” on a gears-level: I think that it’s a useful departure from high abstraction levels, as the question of what predictive accuracy means, and the strategy AI would use to pursue it, is highly influenced by the approach taken. I think a more rigorous approach to analyzing whether different AI approaches are susceptible to “undercutting” as a safety feature would be an extremely valuable piece. My suspicion is that even the engineer’s perspective here is significantly under-specified with the details necessary to determine whether this vulnerability exists. I also think that Part 9 detracts from the piece in two main ways: by painting the predict-o-matic as conscious, it implies a significantly more advanced AI than necessary to exhibit this effect. Additionally, because the AI admits to sacrificing predictIve accuracy in favor of some abstract value-add, it seems like pretty much any naive strategy would outcompete the current one, according to the engineer, meaning that the type of threat is also distorted: the main worry should be AI OPTIMIZING for predictive accuracy, not pursuing its own goals. That’s bad sci-fi or very advanced GAI, not a prediction-optimizer. I would support the deletion or aggressive editing of part 9 in this and future similar pieces: I’m not sure what it adds. ETA-I think whether or not this post should be upd
#4

John Wentworth argues that becoming one of the best in the world at *one* specific skill is hard, but it's not as hard to become the best in the world at the *combination* of two (or more) different skills. He calls this being "Pareto best" and argues it can circumvent the generalized efficient markets principle. 

16Eigil Rischel
This post introduces a potentially very useful model, both for selecting problems to work on and for prioritizing personal development. This model could be called "The Pareto Frontier of Capability". Simply put: 1. By an efficient markets-type argument, you shouldn't expect to have any particularly good ways of achieving money/status/whatever - if there was an unusually good way of doing that, somebody else would already be exploiting it. 2. The exception to this is that if only a small amount of people can exploit an opportunity, you may have a shot. So you should try to acquire skills that only a small number of people have. 3. Since there are a lot of people in the world, it's incredibly hard to become among the best in the world at any particular skill. 4. This means you should position yourself on the Pareto Frontier - you should seek out a combination of skills where nobody else is better than you at everything. Then you will have the advantage in problems where all these skills matter. It might be important to contrast this with the economical term comparative advantage, which is often used informally in a similar context. But its meaning is different. If we are both excellent programmers, but you are also a great writer, while I suck at writing, I have a comparative advantage in programming. If we're working on a project together where both writing and programming are relevant, it's best if I do as much programming as possible while you handle as much as the writing as possible - even though you're as good at me as programming, if someone has to take off time from programming to write, it should be you. This collaboration can make you more effective even though you're better at everything than me (in the economics literature this is usually conceptualized in terms of nations trading with each other). This is distinct from the Pareto optimality idea explored in this post. Pareto optimality matters when it's important that the same person does both the
#5

The Secret of Our Success argues that cultural traditions have had a lot of time to evolve. So seemingly arbitrary cultural practices may actually encode important information, even if the practitioners can't tell you why. 

32fiddler
I strongly oppose collation of this post, despite thinking that it is an extremely well-written summary of an interesting argument on an interesting topic. The reason that I do so is because I believe it represents a substantial epistemic hazard because of the way it was written, and the source material it comes from. I think this is particularly harmful because both justifications for nominations amount to "this post was key in allowing percolation of a new thesis unaligned with the goals of the community into community knowledge," which is a justification that necessitates extremely rigorous thresholds for epistemic virtue: a poor-quality argument both risks spreading false or over-proven ideas into a healthy community, if the nominators are correct, and also creates conditions for an over-correction caused by the tearing down of a strongman. When assimilating new ideas and improving models, extreme care must be taken to avoid inclusion of non-steelmanned parts of the model, and this post does not represent that. In this case, isolated demands for rigor are called for! The first major issue is the structure of the post. A more typical book review includes critique, discussion, and critical analysis of the points made in the book. This book review forgoes these, instead choosing to situate the thesis of the book in the fabric of anthropology and discuss the meta-level implications of the contributions at the beginning and end of the review. The rest of the review is dedicated to extremely long, explicitly cherry-picked block quotes of anecdotal evidence and accessible explanations of Heinrich's thesis. Already, this poses an issue: it's not possible to evaluate the truth of the thesis, or even the merit of the arguments made for it, with evidence that's explicitly chosen to be the most persuasive and favorable summaries of parts glossed over. Upon closer examination, even without considering that this is filtered evidence, this is an attempt to prove a thesis usin
12jacobjacob
For the Review, I'm experimenting with using the predictions feature to poll users for their opinions about claims made in posts.  Elicit Prediction (elicit.org/binary/questions/itSayrbzc) Elicit Prediction (elicit.org/binary/questions/5SRTLX3p_) Elicit Prediction (elicit.org/binary/questions/VMv-KjR87) The first two cites Scott almost verbatim, but for the third I tried to specify further.  Feel free to add your predictions above, and let me know if you have any questions about the experience.
#6

"Some of the people who have most inspired me have been inexcusably wrong on basic issues. But you only need one world-changing revelation to be worth reading."

Scott argues that our interest in thinkers should not be determined by their worst idea, or even their average idea, but by their best ideas. Some of the best thinkers in history believed ludicrous things, like Newton believing in Bible codes.

31philh
I think I agree with the thrust of this, but I think the comment section raises caveats that seem important. Scott's acknowledged that there's danger in this, and I hope an updated version would put that in the post. But also... This seems like a strange model to use. We don't know, a priori, what % are false. If 50% are obviously false, probably most of the remainder are subtly false. Giving me subtly false arguments is no favor. Scott doesn't tell, us, in this essay, what Steven Pinker has given him / why Steven Pinker is ruled in. Has Steven Pinker given him valuable insights? How does Scott know they're valuable? (There may have been some implicit context when this was posted. Possibly Scott had recently reviewed a Pinker book.) Given Anna's example, I find myself wondering, has Scott checked Pinker's straightforwardly checkable facts? I wouldn't be surprised if he has. The point of these questions isn't to say that Pinker shouldn't be ruled in, but that the questions need to be asked and answered. And the essay doesn't really acknowledge that that's actually kind of hard. It's even somewhat dismissive; "all you have to do is *test* some stuff to *see if it’s true*?" Well, the Large Hadron Collider cost €7.5 billion. On a less extreme scale, I recently wanted to check some of Robert Ellickson's work; that cost me, I believe, tens of hours. And that was only checking things close to my own specialty. I've done work that could have ruled him out and didn't, but is that enough to say he's ruled in? So this advice only seems good if you're willing and able to put in the time to find and refute the bad arguments. Not only that, if you actually will put in that time. Not everyone can, not everyone wants to, not everyone will do. (This includes: "if you fact-check something and discover that it's false, the thing doesn't nevertheless propagate through your models influencing your downstream beliefs in ways it shouldn't".) If you're not going to do that... I don
11Zvi
The central point here seems strong and important. One can, as Scott notes, take it too far, but mostly yes one should look where there are very interesting things even if the hit rate is not high, and it's important to note that. Given the karma numbers involved and some comments sometimes being included I'd want assurance that we wouldn't include any of that with regard to particular individuals.  That comment section, though, I believe has done major harm and could keep doing more even in its current state, so I still worry about bringing more focus on this copy of the post (as opposed to the SSC copy). Also, I worry about this giving too much of a free pass to what it calls "outrage culture" - there's an implicit "yeah, it's ok to go all essentialist and destroy someone for one statement that breaks your outrage mob's rules, I can live with that and please don't do it to me here, but let's not extend that to things that are merely stupid or wrong." I don't think you can do that, it doesn't work that way. Could be fixed with an edit if Scott wanted it fixed. 
#7

If the thesis in Unlocking the Emotional Brain is even half-right, it may be one of the most important books that I have read. It claims to offer a neuroscience-grounded, comprehensive model of how effective therapy works. In so doing, it also happens to formulate its theory in terms of belief updating, helping explain how the brain models the world and what kinds of techniques allow us to actually change our minds.

13orthonormal
As mentioned in my comment, this book review overcame some skepticism from me and explained a new mental model about how inner conflict works. Plus, it was written with Kaj's usual clarity and humility. Recommended.
13MalcolmOcean
This was a profoundly impactful post and definitely belongs in the review. It prompted me and many others to dive deep into understanding how emotional learnings have coherence and to actually engage in dialogue with them rather than insisting they don't make sense. I've linked this post to people more than probably any other LessWrong post (50-100 times) as it is an excellent summary and introduction to the topic. It works well as a teaser for the full book as well as a standalone resource. The post makes both conceptual and pragmatic claims. I haven't exactly crosschecked the models although they do seem compatible with other models I've read. I did read the whole book and it seemed pretty sound and based in part on relevant neuroscience. There's a kind of meeting-in-the-middle thing there where the neuroscience is quite low-level and therapy is quite high-level. I think it'll be cool to see the middle layers fleshed out a bit. Just because your brain uses Bayes' theorem at the neural level and at higher levels of abstraction, doesn't mean that you consciously know what all of its priors & models are! And it seems the brain's basic organization is set up to prevent people from calmly arguing against emotionally intense evidence without understanding it—which makes a lot of sense if you think about it. And it also makes sense that your brain would be able to update under the right circumstances. I've tested the pragmatic claims personally, by doing the therapeutic reconsolidation process using both Coherence Therapy methods & other methods, both on myself & working with others. I've found that these methods indeed find coherent underlying structures (eg the same basic structures using different introspective methods, that relate and are consistent) and that accessing those emotional truths and bringing them in contact with contradictory evidence indeed causes them to update, and once updated there's no longer a sense of needing to argue with yourself. It doesn'
#8

According to Zvi, people have a warped sense of justice. For any harm you cause, regardless of intention and or motive,  you earn "negative points" that merit punishment. At least implicitly, however, people only want to reward good outcomes a person causes only if their sole goal was being altruistic. Curing illness to make profit? No "positive points" for you!

33abramdemski
I really like this post. I think it points out an important problem with intuitive credit-assignment algorithms which people often use. The incentive toward inaction is a real problem which is often encountered in practice. While I was somewhat aware of the problem before, this post explains it well. I also think this post is wrong, in a significant way: asymmetric justice is not always a problem and is sometimes exactly what you want. in particular, it's how you want a justice system (in the sense of police, judges, etc) to work. The book Law's Order explains it like this: you don't want theft to be punished in keeping with its cost. Rather, in order for the free market to function, you want theft to be punished harshly enough that theft basically doesn't happen. Zvi speaks as if the purpose of the justice system is to reward positive externalities and punish negative externalities, to align everyone's incentives. While this is a noble goal, Law's Order sees it as a goal to be taken care of by other parts of society, in particular the free market. (Law's Order is a fairly libertarian book, so it puts a lot of faith in the free market.) The purpose of the justice system is to enforce the structure such that those other institutions can do their jobs. The free market can't optimize people's lives properly if theft and murder are a constant and contracts cannot be enforced. So, it makes perfect sense for a justice system to be asymmetric. Its role is to strongly disincentivize specific things, not to broadly provide compensatory incentives. (For this reason, scales are a pretty terrible symbol for justice.) In general, we might conclude that credit assignment systems need two parts: 1. A "symmetric" part, which attempts to allocate credit in as calibrated a way as it can, rewarding good work and punishing bad. 2. An "asymmetric" part, which harshly enforces the rules which ensure that the symmetric part can function, ensuring that those rules are followed fr
#9

Suppose you had a society of multiple factions, each of whom only say true sentences, but are selectively more likely to repeat truths that favor their preferred tribe's policies. Zack explores the math behind what sort of beliefs people would be able to form, and what consequences might befall people who aren't aware of the selective reporting.

16Zack_M_Davis
(Self-review.) I've edited the post to include the 67log27+17log221 calculation as footnote 10. The post doesn't emphasize this angle, but this is also more-or-less my abstract story for the classic puzzle of why disagreement is so prevalent, which, from a Bayesian-wannabe rather than a human perspective, should be shocking: there's only one reality, so honest people should get the same answers. How can it simultaneously be the case that disagreement is ubiquitous, but people usually aren't outright lying? Explanation: the "dishonesty" is mostly in the form of motivatedly asking different questions. Possible future work: varying the model assumptions might yield some more detailed morals. I never got around to trying the diminishing-marginal-relevance variation suggested in footnote 8. Another variation I didn't get around to trying would be for the importance of a fact to each coalition's narrative to vary: maybe there are a few "sacred cows" for which the social cost of challenging is huge (as opposed to just having to keep one's ratio of off-narrative reports in line). Prior work: So, I happened to learn about the filtered-evidence problem from the Sequences, but of course, there's a big statistics literature about learning from missing data that I learned a little bit about in 2020 while perusing Ch. 19 of Probabilistic Graphical Models: Principles and Techniques by Daphne Koller and the other guy.
#10

Examining the concept of optimization, Abram Demski distinguishes between "selection" (like search algorithms that evaluate many options) and "control" (like thermostats or guided missiles). He explores how this distinction relates to ideas of agency and mesa-optimization, and considers various ways to define the difference. 

30adamShimi
Selection vs Control is a distinction I always point to when discussing optimization. Yet this is not the two takes on optimization I generally use. My favored ones are internal optimization (which is basically search/selection), and external optimization (optimizing systems from Alex Flint’s The ground of optimization). So I do without control, or at least without Abram’s exact definition of control. Why? Simply because the internal structure vs behavior distinction mentioned in this post seems more important than the actual definitions (which seem constrained by going back to Yudkowski’s optimization power). The big distinction is between doing internal search (like in optimization algorithms or mesa-optimizers) and acting as optimizing something. It is intuitive that you can do the second without the first, but before Alex Flint’s definition, I couldn’t put words on my intuition than the first implies the second. So my current picture of optimization is Internal Optimization (Internal Search/Selection) \subset External Optimization (Optimizing systems). This means that I think of this post as one of the first instances of grappling at this distinction, without agreeing completely with the way it ends up making that distinction.
13johnswentworth
In a field like alignment or embedded agency, it's useful to keep a list of one or two dozen ideas which seem like they should fit neatly into a full theory, although it's not yet clear how. When working on a theoretical framework, you regularly revisit each of those ideas, and think about how it fits in. Every once in a while, a piece will click, and another large chunk of the puzzle will come together. Selection vs control is one of those ideas. It seems like it should fit neatly into a full theory, but it's not yet clear what that will look like. I revisit the idea pretty regularly (maybe once every 3-4 months) to see how it fits with my current thinking. It has not yet had its time, but I expect it will (that's why it's on the list, after all). Bearing in mind that the puzzle piece has not yet properly clicked, here are some current thoughts on how it might connect to other pieces: * Selection and control have different type signatures. * A selection process optimizes for the values of variables in some model, which may or may not correspond anything in the real world. Human values seem to be like this - see Human Values Are A Function Of Humans' Latent Variables. * A control process, on the other hand, directly optimizes things in its environment. A thermostat, for instance, does not necessarily contain any model of the temperature a few minutes in the future; it just directly optimizes the value of the temperature a few minutes in the future. * The post basically says it, but it's worth emphasizing: reinforcement learning is a control process, expected utility maximization is a selection process. The difference in type signatures between RL and EU maximization is the same as the difference in type signatures between selection and control. * Inner and outer optimizers can have different type signatures: an outer controller (e.g. RL) can learn an inner selector (e.g. utility maximizer), or an outer selector (e.g. a human) can build an inner controller (e
#11

When it comes to coordinating people around a goal, you don't get limitless communication bandwidth for conveying arbitrarily nuanced messages. Instead, the "amount of words" you get to communicate depends on how many people you're trying to coordinate. Once you have enough people....you don't get many words.

33DanielFilan
I think this post, as promised in the epistemic status, errs on the side of simplistic poetry. I see its core contribution as saying that the more people you want to communicate to, the less you can communicate to them, because the marginal people aren't willing to put in work to understand you, and because it's harder to talk to marginal people who are far away and can't ask clarifying questions or see your facial expressions or hear your tone of voice. The numbers attached (e.g. 'five' and 'thousands of people') seem to not be super precise. That being said: the numbers are the easiest thing to take away from this post. The title includes the words 'about five' but not the words 'simplifed poetry'. And I'm just not sure about the numbers. The best part of the post is the initial part, which does a calculation and links to a paper to support an order-of-magnitude calculation on how many words you can communicate to people. But as the paragraphs go on, the justifications get less airtight, until it's basically an assertion. I think I understand stylistically why this was done, but at the end of the day that's the trade-off that was made. So a reader of this post has to ask themselves: Why is the number about five? Is this 'about' meaning that you have a factor of 2 wiggle-room? 10? 100? How do I know that this kicks in once I hit thousands of people, rather than hundreds or millions? If I want to communicate to billions of people, does that go down much? These questions are left unanswered in the post. That would be fine if they were answered somewhere else that was linked, but they aren't. As such, the discerning reader should only believe the conclusion (to the extent they can make it out) if they trust Ray Arnold, the author. I think plausibly people should trust Ray on this, at least people who know him. But much of the readership of this post doesn't know him and has no reason to trust him on this one. Overall: this post has a true and important core that c
10Raemon
Partial Self Review: There's an obvious set of followup work to be done here, which is to ask "Okay, this post was vague poetry meant to roughly illustrate a point. But, how many words do you actually precisely have?" What are the in-depth models that let you predict precisely how much nuance you have to work with? Less obvious to me is whether this post should become a longer, more rigorous post, or whether it should stay it's short, poetic self, and have those questions get explored in a different post with different goals.  Also less obvious to me is how the LessWrong Review should relate to short, poetic posts. I think it's quite important that this post be clearly labeled as poetry, and also, that we consider the work "unfinished" until there is a some kind of post that delves more deeply into these questions. But, for example, I think Babble last year was more like poetry than like a clear model, and it was nonetheless valuable and good to be part of the Best Of book. So, I'm thinking about this post from two lenses.  1. What are simple net-improvements I can make to this post, without sacrificing it's overall aim of being short/accessible/poetic? 2. Sketch out the research/theory agenda I'd want to see for the more detailed version. I did just look over the post, and notice that the poetry... wasn't especially good. There is mild cleverness in having the sections get shorter as they discuss larger coordination-groups. But I think I probably could write a post that was differently poetic. Or, find ways of making a short/accessible version that doesn't bother being poetic but is is nonetheless clear. I'm worried about every post having to be a rigorous, fully caveated explanation. That might be the right standard for the review, but not obviously. Some points that should be made somewhere, whether editing the OP or in a followup: 1. Yes, you can invest in processes that help you reach more people, more reliably. But those tools are effortful to build
#12

When trying to coordinate with others, we often assume the default should be full cooperation ("stag hunting"). Raemon argues this isn't realistic - the default is usually for people to pursue their own interests ("rabbit hunting"). If you want people to cooperate on a big project, you need to put in special effort to get buy-in.

18Raemon
Self Review. I still endorse the broad thrusts of this post. But I think it should change at least somewhat. I'm not sure how extensively, but here are some considerations Clearer distinctions between Prisoner's Dilemma and Stag Hunts I should be more clear about what the game theoretical distinctions I'm actually making between Prisoners Dilemma and Stag Hunt. I think Rob Bensinger rightly criticized the current wording, which equivocates between "stag hunting is meaningfully different" and "'hunting rabbit' has nicer aesthetic properties than 'defect'".  I think Turntrout spelled out in the comments why it's meaningful to think in terms of stag hunts. I'm not sure it's the post's job to lay it out in the exhaustive detail that his comment does, but it should at least gesture at the idea. Future Work: Explore a lot of coordination failures and figure out what the actual most common rules / payoff structures are. Stag Hunting is relevant sometimes, but not always. I think it's probably more relevant than Prisoner's Dilemma, which is a step up, but I think it's worth actually checking which game theory archetypes are most relevant most of the time.  Reworked Example Some people comment that my proposed stag hunt... wasn't a stag hunt. I think that's actually kind of the point (i.e. most things that look like stag hunts are more complicated than you think, and people may not agree on the utility payoff). Coming up with good examples is hard, but I think at the very least the post should make it more clear that no, my original intended Stag Hunt did not have the appropriate payoff matrix after all. What's the correct title? While I endorse most of the models and gears in this post, I... have mixed feelings about the title. I'm not actually sure what the key takeaway of the post is meant to be. Abram's comment gets at some of the issues here. Benquo also notes that we do have plenty of stag hunts where the schelling choice is Stag (i.e. don't murder) I think
#13

When disagreements persist despite lengthy good-faith communication, it may not just be about factual disagreements – it could be due to people operating in entirely different frames — different ways of seeing, thinking and/or communicating.

#14

Nine parables, in which people find it hard to trust that they've actually gotten a "yes" answer.

24Zvi
The only way to get information from a query is to be willing to (actually) accept different answers. Otherwise, conservation of expected evidence kicks in. This is the best encapsulation of this point, by far, that I know about, in terms of helping me/others quickly/deeply grok it. Seems essential. Reading this again, the thing I notice most is that I generally think of this point as being mostly about situations like the third one, but most of the post's examples are instead about internal epistemic situations, where someone can't confidently conclude or believe some X because they realize something is blocking a potential belief in (not X), which means they can't gather meaningful evidence. Which is the same point at core - Bob can't know Charlie consents because he doesn't let Charlie refuse. Yet it feels like a distinct takeaway in the Five Words sense - evidence must run both ways vs. consent requires easy refusal, or something. And the first lesson is the one emphasized here, because 1->2 but not 2->1. And I do think I got the intended point for real. Yet I can see exactly why the attention/emphasis got hijacked in hindsight when remembering the post.  Also wondering about the relationship between this and Choices are Bad. Not sure what is there but I do sense something is there. 
#15

Concerningly, it can be much easier to spot holes in the arguments of others than it is in your own arguments. The author of this post reflects that historically, he's been too hasty to go from "other people seem very wrong on this topic" to "I am right on this topic". 

15Richard_Ngo
This has been one of the most useful posts on LessWrong in recent years for me personally. I find myself often referring to it, and I think almost everyone underestimates the difficulty gap between critiquing others and proposing their own, correct, ideas.
#16

A Recovery Day (or "slug day"), is where you're so tired you can only binge Netflix or stay in bed. A Rest Day is where you have enough energy to "follow your gut" with no obligations or pressure. Unreal argues that true rest days are important for avoiding burnout, and gives suggestions on how to implement them.

#17

Alex Turner lays out a framework for understanding how and why artificial intelligences pursuing goals often end up seeking power as an instrumental strategy, even if power itself isn't their goal. This tendency emerges from basic principles of optimal decision-making.

But, he cautions that if you haven't internalized that Reward is not the optimization target, the concepts here, while technically accurate, may lead you astray in alignment research.

58johnswentworth
This review is mostly going to talk about what I think the post does wrong and how to fix it, because the post itself does a good job explaining what it does right. But before we get to that, it's worth saying up-front what the post does well: the post proposes a basically-correct notion of "power" for purposes of instrumental convergence, and then uses it to prove that instrumental convergence is in fact highly probable under a wide range of conditions. On that basis alone, it is an excellent post. I see two (related) central problems, from which various other symptoms follow: 1. POWER offers a black-box notion of instrumental convergence. This is the right starting point, but it needs to be complemented with a gears-level understanding of what features of the environment give rise to convergence. 2. Unstructured MDPs are a bad model in which to formulate instrumental convergence. In particular, they are bad for building a gears-level understanding of what features of the environment give rise to convergence. Some things I've thought a lot about over the past year seem particularly well-suited to address these problems, so I have a fair bit to say about them. Why Unstructured MDPs Are A Bad Model For Instrumental Convergence The basic problem with unstructured MDPs is that the entire world-state is a single, monolithic object. Some symptoms of this problem: * it's hard to talk about "resources", which seem fairly central to instrumental convergence * it's hard to talk about multiple agents competing for the same resources * it's hard to talk about which parts of the world an agent controls/doesn't control * it's hard to talk about which parts of the world agents do/don't care about * ... indeed, it's hard to talk about the world having "parts" at all * it's hard to talk about agents not competing, since there's only one monolithic world-state to control * any action which changes the world at all changes the entire world-state; there's no built-in w
12TurnTrout
One year later, I remain excited about this post, from its ideas, to its formalisms, to its implications. I think it helps us formally understand part of the difficulty of the alignment problem. This formalization of power and the Attainable Utility Landscape have together given me a novel frame for understanding alignment and corrigibility. Since last December, I’ve spent several hundred hours expanding the formal results and rewriting the paper; I’ve generalized the theorems, added rigor, and taken great pains to spell out what the theorems do and do not imply. For example, the main paper is 9 pages long; in Appendix B, I further dedicated 3.5 pages to exploring the nuances of the formal definition of ‘power-seeking’ (Definition 6.1).  However, there are a few things I wish I’d gotten right the first time around. Therefore, I’ve restructured and rewritten much of the post. Let’s walk through some of the changes. ‘Instrumentally convergent’ replaced by ‘robustly instrumental’ Like many good things, this terminological shift was prompted by a critique from Andrew Critch.  Roughly speaking, this work considered an action to be ‘instrumentally convergent’ if it’s very probably optimal, with respect to a probability distribution on a set of reward functions. For the formal definition, see Definition 5.8 in the paper. This definition is natural. You can even find it echoed by Tony Zador in the Debate on Instrumental Convergence: (Zador uses “set of scenarios” instead of “set of reward functions”, but he is implicitly reasoning: “with respect to my beliefs about what kind of objective functions we will implement and what the agent will confront in deployment, I predict that deadly actions have a negligible probability of being optimal.”) While discussing this definition of ‘instrumental convergence’, Andrew asked me: “what, exactly, is doing the converging? There is no limiting process. Optimal policies just are.”  It would be more appropriate to say that an ac
#18

In thinking about AGI safety, I’ve found it useful to build a collection of different viewpoints from people that I respect, such that I can think from their perspective. I will often try to compare what an idea feels like when I put on my Paul Christiano hat, to when I put on my Scott Garrabrant hat. Recently, I feel like I’ve gained a "Chris Olah" hat, which often looks at AI through the lens of interpretability. 

The goal of this post is to try to give that hat to more people.

19DanielFilan
* Olah’s comment indicates that this is indeed a good summary of his views. * I think the first three listed benefits are indeed good reasons to work on transparency/interpretability. I am intrigued but less convinced by the prospect of ‘microscope AI’. * The ‘catching problems with auditing’ section describes an ‘auditing game’, and says that progress in this game might illustrate progress in using interpretability for alignment. It would be good to learn how much success the auditors have had in this game since the post was published. * One test of ‘microscope AI’: the go community has had a couple of years of the computer era, in which time open-source go programs stronger than AlphaGo have been released. This has indeed changed the way that humans think about go: seeing the corner variations that AIs tend to play has changed our views on which variations are good for which player, and seeing AI win probabilities conditioned on various moves, as well as the AI-recommended continuations, has made it easier to review games. Yet sadly, there has been to my knowledge no new go knowledge generated from looking at the internals of these systems, despite some visualization research being done (https://arxiv.org/pdf/1901.02184.pdf, https://link.springer.com/chapter/10.1007/978-3-319-97304-3_20). As far as I’m aware, we do not even know if these systems understand the combinatorial game theory of the late endgame, the one part of go that has been satisfactorily mathematized (and therefore unusually amenable to checking whether some program implements it). It’s not clear to me whether this is for a lack of trying, but this does seem like a setting where microscope AI would be useful if it were promising. * The paper mostly focuses on the benefits of transparency/interpretability for AI alignment. However, as far as I’m aware, since before this post was published, the strongest argument against work in this direction has been the problem of tractability - can we ac
#19

Eric Drexler's CAIS model suggests that before we get to a world with monolithic AGI agents, we will already have seen an intelligence explosion due to automated R&D. This reframes the problems of AI safety and has implications for what technical safety researchers should be doing. Rohin reviews and summarizes the model

#20

The strategy-stealing assumption posits that for any strategy an unaligned AI could use to influence the long-term future, there is an analogous strategy that humans could use to capture similar influence. Paul Christiano explores why this assumption might be true, and eleven ways it could potentially fail. 

21Zvi
This post is even-handed and well-reasoned, and explains the issues involved well. The strategy-stealing assumption seems important, as a lot of predictions are inherently relying on it either being essentially true, or effectively false, and I think the assumption will often effectively be a crux in those disagreements, for reasons the post illustrates well. The weird thing is that Paul ends the post saying he thinks the assumption is mostly true, whereas I thought the post was persuasive that the assumption is mostly false. The post illustrates that the unaligned force is likely to have many strategic and tactical advantages over aligned forces, that should allow the unaligned force to, at a minimum, 'punch above its weight' in various ways even under close-to-ideal conditions. And after the events of 2020, and my resulting updates to my model of humans, I'm highly skeptical that we'll get close to ideal. Either way, I'm happy to include this.
#21

Impact measures may be a powerful safeguard for AI systems - one that doesn't require solving the full alignment problem. But what exactly is "impact", and how can we measure it?

#22

Double descent is a puzzling phenomenon in machine learning where increasing model size/training time/data can initially hurt performance, but then improve it. Evan Hubinger explains the concept, reviews prior work, and discusses implications for AI alignment and understanding inductive biases.

16Zvi
I've stepped back from thinking about ML and alignment the last few years, so I don't know how this fits into the discourse about it, but I felt like I got important insight here and I'd be excited to include this. The key concept that bigger models can be simpler seems very important.  In my words, I'd say that when you don't have enough knobs, you're forced to find ways for each knob to serve multiple purposes slash combine multiple things, which is messy and complex and can be highly arbitrary, whereas with lots of knobs you can do 'the thing you naturally actually want to do.' And once you get sufficiently powerful, the overfitting danger isn't getting any worse with the extra knobs, so sure, why not? I also strongly agree with orthonormal that including the follow-up as an addendum adds a lot to this post. If it's worth including this, it's worth including both, even if the follow-up wasn't also nominated. 
13orthonormal
If this post is selected, I'd like to see the followup made into an addendum—I think it adds a very important piece, and it should have been nominated itself.
#23

Scott Alexander's "Meditations on Moloch" paints a gloomy picture of the world being inevitably consumed by destructive forces of competition and optimization. But Zvi argues this isn't actually how the world works - we've managed to resist and overcome these forces throughout history. 

13fiddler
This review is more broadly of the first several posts of the sequence, and discusses the entire sequence.  Epistemic Status: The thesis of this review feels highly unoriginal, but I can't find where anyone else discusses it. I'm also very worried about proving too much. At minimum, I think this is an interesting exploration of some abstract ideas. Considering posting as a top-level post. I DO NOT ENDORSE THE POSITION IMPLIED BY THIS REVIEW (that leaving immoral mazes is bad), AND AM FAIRLY SURE I'M INCORRECT. The rough thesis of "Meditations on Moloch" is that unregulated perfect competition will inevitably maximize for success-survival, eventually destroying all value in service of this greater goal. Zvi (correctly) points out that this does not happen in the real world, suggesting that something is at least partially incorrect about the above mode, and/or the applicability thereof. Zvi then suggests that a two-pronged reason can explain this: 1. most competition is imperfect, and 2. most of the actual cases in which we see an excess of Moloch occur when there are strong social or signaling pressures to give up slack.  In this essay, I posit an alternative explanation as to how an environment with high levels of perfect competition can prevent the destruction of all value, and further, why the immoral mazes discussed later on in this sequence are an example of highly imperfect competition that causes the Molochian nature thereof.  First, a brief digression on perfect competition: perfect competition assumes perfectly rational agents. Because all strategies discussed are continuous-time, the decisions made in any individual moment are relatively unimportant assuming that strategies do not change wildly from moment to moment, meaning that the majority of these situations can be modeled as perfect-information situations.  Second, the majority of value-destroying optimization issues in a perfect-competition environment can be presented as prisoners dilemmas: both
#24

Integrity isn't just about honesty - it's about aligning your actions with your stated beliefs. But who should you be accountable to? Too broad an audience, and you're limited to simplistic principles. Too narrow, and you miss out on opportunities for growth and collaboration. 

19fiddler
This post seems excellent overall, and makes several arguments that I think represent the best of LessWrong self-reflection about rationality. It also spurred an interesting ongoing conversation about what integrity means, and how it interacts with updating. The first part of the post is dedicated to discussions of misaligned incentives, and makes the claim that poorly aligned incentives are primarily to blame for irrational or incorrect decisions. I’m a little bit confused about this, specifically that nobody has pointed out the obvious corollary: the people in a vacuum, and especially people with well-aligned incentive structures, are broadly capable of making correct decisions. This seems to me like a highly controversial statement that makes the first part of the post suspicious, because it treads on the edge of proving (hypothesizing?) too much: it seems like a very ambitious statement worthy of further interrogation that people’s success at rationality is primarily about incentive structures, because that assumes a model in which humans are capable and preform high levels of rationality regularly. However, I can’t think of an obvious counterexample (a situation in which humans are predictably irrational despite having well-aligned incentives for rationality), and the formulation of this post has a ring of truth for me, which suggests to me that there’s at least something here. Conditional on this being correct, and there not being obvious counterexamples, this seems like a huge reframing that makes a nontrivial amount of the rationality community’s recent work inefficient-if humans are truly capable of behaving predictably rationally through good incentive structures, then CFAR, etc. should be working on imposing external incentive structures that reward accurate modeling, not rationality as a skill. The post obliquely mentions this through discussion of philosopher-kings, but I think this is a case in which an apparently weaker version of a thesis actually i
#25

Building gears-level models is expensive - often prohibitively expensive. Black-box approaches are usually cheaper and faster. But black-box approaches rarely generalize - they need to be rebuilt when conditions change, don’t identify unknown unknowns, and are hard to build on top of. Gears-level models, on the other hand, offer permanent, generalizable knowledge which can be applied to many problems in the future, even if conditions shift.

19jimrandomh
There is a joke about programmers, that I picked up long ago, I don't remember where, that says: A good programmer will do hours of work to automate away minutes of drudgery. Some time last month, that joke came into my head, and I thought: yes of course, a programmer should do that, since most of the hours spent automating are building capital, not necessarily in direct drudgery-prevention but in learning how to automate in this domain. I did not think of this post, when I had that thought. But I also don't think I would've noticed, if that joke had crossed my mind two years ago. This, I think, is what a good concept-crystallization feels like: an application arises, and it simply feels like common sense, as you have forgotten that there was ever a version of you which would not have noticed that.
#26

If you want to bring up a norm or expectation that's important to you, but not something you'd necessarily argue should be universal, an option is to preface it with the phrase "in my culture." In Duncan's experience, this helps navigate tricky situations by taking your own personal culture as object, and discussing how it is important to you without making demands of others.

#27

Jeff argues that people should fill in some of the San Francisco Bay, south of the Dumbarton Bridge, to create new land for housing. This would allow millions of people to live closer to jobs, reducing sprawl and traffic. While there are environmental concerns, the benefits of dense urban housing outweigh the localized impacts. 

14David Hornbein
This sort of thing is exactly what Less Wrong is supposed to produce. It's a simple, straightforward and generally correct argument, with important consequences for the world, which other people mostly aren't making. That LW can produce posts like this—especially with positive reception and useful discussion—is a vindication of this community's style of thought.
13jacobjacob
I'm trying out making some polls about posts for the Review (using the predictions feature). You can answer by hovering over the scale and clicking a number to indicate your agreement with the claim.  Making more land out of the about 50mi^2 shallow water in the San Francisco Bay, South of the Dumbarton Bridge, would...  Elicit Prediction (elicit.org/binary/questions/KkqpSr5rW) Elicit Prediction (elicit.org/binary/questions/qzzNzEfa9) Elicit Prediction (elicit.org/binary/questions/csYlcNdhZ) Elicit Prediction (elicit.org/binary/questions/RwtAoMlnP) Elicit Prediction (elicit.org/binary/questions/xGIZipvb-) Elicit Prediction (elicit.org/binary/questions/zAtqSgbnS) For some of these questions, I tried to operationalise them to be less ambiguous than Jeff's original formulation. 
#28

In this post, I proclaim/endorse forum participation (aka commenting) as a productive research strategy that I've managed to stumble upon, and recommend it to others (at least to try). Note that this is different from saying that forum/blog posts are a good way for a research community to communicate. It's about individually doing better as researchers.

#29

There are at least three ways in which incentives affect behaviour: Consciously motivating agents, unconsciously reinforcing certain behaviors, and selection effects.

Jacob argues that  #2 and probably #3 are more important, but much less talked about.

31johnswentworth
Connection to Alignment One of the main arguments in AI risk goes something like: * AI is likely to be a utility maximizer (or goal-directed in some other sense) * Goodhart, instrumental convergence, etc make powerful goal-directed agents dangerous by default One common answer to this is "ok, how about we make AI which isn't goal-directed"? Unconscious Economics says: selection effects will often create the same effect as goal-directedness, even if we're trying to build a non-goal-directed AI. Discussions around CAIS are one obvious application. Paul's "you get what you measure" failure-mode is another. A less-obvious application which I've personally run into recently: one strategy to deal with inner optimizers is to design learning algorithms which specifically avoid regions of parameter space in which the trained system will perform optimization. The Unconscious Economics argument says that this won't actually avoid the risk: selection effects from the outer optimizer will push the trained system to misbehave in exactly the same ways, even without an inner optimizer. Connection to the Economics Literature During the past year I've found and read a bit more of the formal economics literature related to selection-effect-driven economics. The most notable work seems to be Nelson and Winter's "An Evolutionary Theory of Economic Change", from 1982. It was a book-length attempt to provide a mathematical foundation for microeconomics grounded in selection effects, rather than assuming utility-maximizing agents from the get-go. Reading through that book, it's pretty clear why the perspective hasn't taken over economics: Nelson and Winter's models are not very good. Some of the larger shortcomings: * They limit themselves to competition between firms, and their models contain details which limit their generalization to other kinds of agents * They use a "static" notion of equilibrium (i.e. all agents are individually unchanging), rather than a "dynamic" noti
#30

I've wrestled with applying ideas like "conservation of expected evidence," and want to warn others about some common mistakes. Some of the "obvious inferences" that seem to follow from these ideas are actually mistaken, or stop short of the optimal conclusion.

#31

A thoughtful exploration of the risks and benefits of sharing information about biosecurity and biological risks. The authors argue that while there are real risks to sharing sensitive information, there are also important benefits that need to be weighed carefully. They provide frameworks for thinking through these tradeoffs. 

12hamnox
Biorisk - well wouldn't it be nice if we'd all been familiar with the main principles of biorisk before 2020? i certainly regretted sticking my head in the sand. > If concerned, intelligent people cannot articulate their reasons for censorship, cannot coordinate around principles of information management, then that itself is a cause for concern. Discussions may simply move to unregulated forums, and dangerous ideas will propagate through well intentioned ignorance. Well. It certainly sounds prescient in hindsight, doesn't it? Infohazards in particular cross my mind: so many people operate on extremely bad information right now. Conspiracies theories abound, and I imagine the legitimate coordination for secrecy surrounding the topic do not help in the least. What would help? Exactly this essay. A clear model of *what* we should expect well-intentioned secrecy to cover, so we can reason sanely over when it's obviously not. Y'all done good. This taxonomy clarifies risk profiles better than Gregory Lewis' article, though I think his includes a few vivid-er examples. I opened a document to experiment tweaking away a little dryness from the academic tone. I hope you don't take offense. Your writing represents massive improvements in readability in its examples and taxonomy, and you make solid, straightforward choices in phrasing. No hopelessly convoluted sentence trees. I don't want to discount that. Seriously! Good job. As I read I had a few ideas spark on things that could likely get done at a layman level, in line with spiracular's comment. That comment could use some expansion, especially in the direction of "Prefer to discuss this over that, or discuss in *this way* over *that way" for bad topics. Very relevantly, I think basic facts should get added to some the good discussion topics, since they represent information it's better to disseminate! we seek to review basic facts under the good discussion topics, since they represent information it's better to diss
#32

Ben and Jessica discuss how language and meaning can degrade through four stages as people manipulate signifiers. They explore how job titles have shifted from reflecting reality, to being used strategically, to becoming meaningless.

This post kicked off subsequent discussion on LessWrong about simulacrum levels.

17Benquo
There are two aspects of this post worth reviewing: as an experiment in a different mode of discourse, and as a description of the procession of simulacra, a schema originally advanced by Baudrillard. As an experiment in a diffferent mode of discourse, I think this was a success on its own terms, and a challenge to the idea that we should be looking for the best blog posts rather than the behavior patterns that lead to the best overall discourse. The development of the concept occurred over email quite naturally without forceful effort. I would have written this post much later, and possibly never, had I held it to the standard of "written specifically as a blog post." I have many unfinished drafts. emails, tweets, that might have advanced the discourse had I compiled them into rough blog posts like this. The description was sufficiently clear and compelling that others, including my future self, were motivated to elaborate on it later with posts drafted as such. I and my friends have found this schema - especially as we've continued to refine it - a very helpful compression of social reality allowing us to compare different modes of speech and action. As a description of the procession of simulacra it differs from both Baudrillard's description, and from the later refinement of the schema among people using it actively to navigate the world.  I think that it would be very useful to have a clear description of the updated schema from my circle somewhere to point to, and of some historical interest for this description to clearly describe deviations from Baudrillard's account. I might get around to trying to draft the former sometime, but the latter seems likely to take more time than I'm willing to spend reading and empathizing with Baudrillard. Over time it's become clear that the distinction between stages 1 and 2 is not very interesting compared with the distinction between 1&2, 3, and 4, and a mature naming convention would probably give these more natural
15Zvi
This came out in April 2019, and bore a lot of fruit especially in 2020. Without it, I wouldn't have thought about the simulacra concept and developed the ideas, and without those ideas, I don't think I would have made anything like as much progress understanding 2020 and its events, or how things work in general.  I don't think this was an ideal introduction to the topic, but it was highly motivating regarding the topic, and also it's a very hard topic to introduce or grok, and this was the first attempt that allowed later attempts. I think we should reward all of that.
#33

nostalgebraist argues that GPT-2 is a fascinating and important development for our understanding of language and the mind, despite its flaws. They're frustrated that many psycholinguists who previously studied language in detail now seem uninterested in looking at what GPT-2 tells us about language, instead focusing on whether it's "real AI".

20nostalgebraist
I wrote this post about a year ago.  It now strikes me as an interesting mixture of 1. Ideas I still believe are true and important, and which are (still) not talked about enough 2. Ideas that were plausible at the time, but are much less so now 3. Claims I made for their aesthetic/emotional appeal, even though I did not fully believe them at the time In category 1 (true, important, not talked about enough): * GPT-2 is a source of valuable evidence about linguistics, because it demonstrates various forms of linguistic competence that previously were only demonstrated by humans. * Much scholarly ink has been spilled over questions of the form "what would it take, computationally, to do X?" -- where X is something GPT-2 can actually do.  Since we now have a positive example, we should revisit these debates and determine which claims GPT-2 disproves, and which it supports. * Some of the key participants in those debates are not revisiting them in this way, and appear to think GPT-2 is entirely irrelevant to their work. In category 2 (plausible then but not now): * "The structure of the transformer is somehow specially apt for language, relative to other architectures that were tried." * I now think this is much less likely thanks to the 2 OpenAI scaling papers in 2020. * The first paper made it seem more plausible that LSTMs would behave like GPT-2 if given a much larger quantity of compute/data * The second paper showed that the things we know about transformers from the text domain generalize very well to image/video/math * I now think transformers are just a "good default architecture" for our current compute regime and may not have special linguistic properties * I'm finding this difficult to phrase, but in 2019 I think I believed Gary Marcus had similar preconceptions to me but was misreading the current evidence. * I now think he's more committed to the idea that GPT-2-like approaches are fundamentally barking up the wrong tree, and wi
#34

AI safety researchers have different ideas of what success would look like. This post explores five different AI safety "success stories" that researchers might be aiming for and compares them along several dimensions. 

#35

Heated, tense arguments can often be unproductive and unpleasant. Neither side feels heard, and they are often working desperately to defend something they feel is very important. Ruby explores this problem and some solutions.

10alkjash
This feels like an extremely important point. A huge number of arguments devolve into exactly this dynamic because each side only feels one of (the Rock|the Hard Place) as a viscerally real threat, while agreeing that the other is intellectually possible.  Figuring out that many, if not most, life decisions are "damned if you do, damned if you don't" was an extremely important tool for me to let go of big, arbitrary psychological attachments which I initially developed out of fear of one nasty outcome.
#36

It might be the case that what people find beautiful and ugly is subjective, but that's not an explanation of ~why~ people find some things beautiful or ugly. Things, including aesthetics, have causal reasons for being the way they are. You can even ask "what would change my mind about whether this is beautiful or ugly?". Raemon explores this topic in depth.

18johnswentworth
I revisited this post a few months ago, after Vaniver's review of Atlas Shrugged. I've felt for a while that Atlas Shrugged has some really obvious easy-to-articulate problems, but also offers a lot of value in a much-harder-to-articulate way. After chewing on it for a while, I think the value of Atlas Shrugged is that it takes some facts about how incentives and economics and certain worldviews have historically played out, and propagates those facts into an aesthetic. (Specifically, the facts which drove Rand's aesthetics presumably came from growing up in the early days of Soviet Russia.) It's mainly the aesthetic that's valuable. Generalizing: this post has provided me with a new model of how art can offer value. Better yet, the framing of "propagate facts into aesthetics" suggests a concrete approach to creating or recognizing art with this kind of value. As in the case of Atlas Shrugged, we can look at the aesthetic of some artwork, and ask "what are the facts which fed into this aesthetic?". This also gives us a way to think about when the aesthetic will or will not be useful/valuable. Overall, this is one of the gearsiest models I've seen for instrumental thinking about art, especially at a personal (as opposed to group/societal) level.
#37

Gradient hacking is when a deceptively aligned AI deliberately acts to influence how the training process updates it. For example, it might try to become more brittle in ways that prevent its objective from being changed. This poses challenges for AI safety, as the AI might try to remove evidence of its deception during training.

25adamShimi
This post states the problem of gradient hacking. It is valuable in that this problem is far from obvious, and if plausible, very dangerous. On the other hand, the presentation doesn’t go into enough details, and so leaves gradient hacking open to attacks and confusion. Thus instead of just reviewing this post, I would like to clarify certain points, while interweaving my critics about the way gradient hacking was initially stated, and explaining why I consider this problem so important. (Caveat: I’m not pretending that any of my objections are unknown to Evan. We’re working together on searching for experimental evidence of gradient hacking, and this review is heavily influenced by this collaboration.) What is gradient hacking? Evan defines it as: So we have a trained mesa-optimizer that is deceptively aligned (its mesa-objective differs from the base-objective it is trained on). This deceptive system finds a way to influence the gradient generated during training, and thus can basically ensure it never becomes inner aligned (among other things). Before checking on how exactly this could be possible, we should think a bit more about what this implies. If the deceptive model has arbitrary access to the gradient, then from that point on, the base-objective has only minimal influence on the training. Some influence remains because we probably want to test the end result, and thus it should work well with the base-objective. But that's pretty much the only constraint left. It could also pretty much deals with deception detectors because it can make itself not detectable: To say it pithy: if gradient hacking happens, we’re fucked. How could it happen, though? Well, that’s actually two distinct questions: how could a model gradient hack, and how could training create a model which gradient hacks. The post mostly focuses on the first one. How could a model gradient hack? The first example comes from a quoted footnote of Risks from Learned Optimization: This im
#38

The Amish relationship to technology is not "stick to technology from the 1800s", but rather "carefully think about how technology will affect your culture, and only include technology that does what you want." Raemon explores how these ideas could potentially be applied in other contexts.

#39

Power allows people to benefit from immoral acts without having to take responsibility or even be aware of them. The most powerful person in a situation may not be the most morally culpable, as they can remain distant from the actual "crime". If you're not actively looking into how your wants are being met, you may be unknowingly benefiting from something unethical.

26johnswentworth
ETA 1/12: This review is critical and at times harsh, not because I want to harshly criticize the post or the author, but because I did not consider harshness of criticism when writing. I still think the post is positive-net-value, and might even vote it up in the review. I especially want to emphasize that I do not think it is in any way useful to blame or punish the author for the things I complain about below; this is intended as a "pointing out a problematic habit which a lot of people have and society often encourages" criticism, not a "bad thing must be punished" criticism. When this post first came out, I said something felt off about it. The same thing still feels off about it, but I no longer endorse my original explanation of what-felt-off. So here's another attempt. First, what this post does well. There's a core model which says something like "people with the power to structure incentives tend get the appearance of what they ask for, which often means bad behavior is hidden". It's a useful and insightful model, and the post presents it with lots of examples, producing a well-written and engaging explanation. The things which the post does well more than outweigh the problems below; it's a great post. On to the problem. Let's use the slave labor example, because that's the first spot where the problem comes up: ... so far, so good. This is generally solid analysis of an interesting phenomenon. But then we get to the next sentence: ... and this where I want to say NO. My instinct says DO NOT EVER ASK THAT QUESTION, it is a WRONG QUESTION, you will be instantly mindkilled every time you ask "who should be blamed for X?". ... on reflection, I do not want to endorse this as an all-the-time heuristic, but I do want to endorse it whenever good epistemic discussion is an objective. Asking "who should we blame?" is always engaging in a status fight. Status fights are generally mindkillers, and should be kept strictly separate from modelling and epistemics
#40

Most advice on reading scientific papers focuses on evaluating individual claims. But what if you want to build a deeper "gears-level" understanding of a system? John Wentworth offers advice on how to read papers to build such models, including focusing on boring details, reading broadly, and looking for mediating variables. 

11adamShimi
This post proposes 4 ideas to help building gears-level models from papers that already passed the standard epistemic check (statistics, incentives): * Look for papers which are very specific and technical, to limit the incentives to overemphasize results and present them in a “saving the world” light. * Focus on data instead of on interpretations. * Read papers on different aspects of the same question/gear * Look for mediating variables/gears to explain multiple results at once (The second section, “Zombie Theories”, sounds more like epistemic check than gears-level modeling to me) I didn’t read this post before today, so it’s hard to judge the influence it will have on me. Still, I can already say that the first idea (move away from the goal) is one I had never encountered, and by itself it probably helps a lot in literature search and paper reading. The other three ideas are more obvious to me, but I’m glad that they’re stated somewhere in detail. The examples drawn from biology also definitely help.
#41

Since middle school I've thought I was pretty good at dealing with my emotions, and a handful of close friends and family have made similar comments. Now I can see that though I was particularly good at never flipping out, I was decidedly not good "healthy emotional processing".

22[anonymous]
The parent-child model is my cornerstone of healthy emotional processing. I'd like to add that a child often doesn't need much more than your attention. This is one analogy of why meditation works: you just sit down for a while and you just listen.  The monks in my local monastery often quip about "sitting in a cave for 30 years", which is their suggested treatment for someone who is particularly deluded. This implies a model of emotional processing which I cannot stress enough: you can only get in the way. Take all distractions away from someone and they will asymptotically move towards healing. When they temporarily don't, it's only because they're trying to do something, thereby moving away from just listening. They'll get better if they give up. Another supporting quote from my local Roshi: "we try to make this place as boring as possible". When you get bored, the only interesting stuff left to do is to move your attention inward. As long as there is no external stimulus, you cannot keep your thoughts going forever. By sheer ennui you'll finally start listening to those kids, which is all you need to do.
15johnswentworth
This is an excellent post, with a valuable and well-presented message. This review is going to push back a bit, talk about some ways that the post falls short, with the understanding that it's still a great post. There's this video of a toddler throwing a tantrum. Whenever the mother (holding the camera) is visible, the child rolls on the floor and loudly cries. But when the mother walks out of sight, the toddler soon stops crying, gets up, and goes in search of the mother. Once the toddler sees the mother again, it's back to rolling on the floor crying. A key piece of my model here is that the child's emotions aren't faked. I think this child really does feel overcome, when he's rolling on the floor crying. (My evidence for this is mostly based on discussing analogous experiences with adults - I know at least one person who has noticed some tantrum-like emotions just go away when there's nobody around to see them, and then come back once someone else is present.) More generally, a lot of human emotions are performative. They're emotions which some subconscious process puts on for an audience. When the audience goes away, or even just expresses sufficient disinterest, the subconscious stops expressing that emotion. In other words: ignoring these emotions is actually a pretty good way to deal with them. "Ignore the emotion" is decent first-pass advice for grown-up analogues of that toddler. In many such cases, the negative emotion will actually just go away if ignored. Now, obviously a lot of emotions don't fall into this category. The post is talking about over-applying the "ignore your emotions" heuristic, and the hazards of applying in places where it doesn't work. But what we really want is not an argument that applying the heuristic more/less often is better, but rather a useful criterion for when the "ignore your emotions" heuristic is useful. I suggest something like: will this emotion actually go away if ignored? The post is mainly talking about dealing
#42

Said argues that there's no such thing as a real exception to a rule. If you find an exception, this means you need to update the rule itself. The "real" rule is always the one that already takes into account all possible exceptions.

11Unnamed
It seems like the core thing that this post is doing is treating the concept of "rule" as fundamental.  If you have a general rule plus some exceptions, then obviously that "general rule" isn't the real process that is determining the results. And noticing that (obvious once you look at it) fact can be a useful insight/reframing. The core claim that this post is putting forward, IMO, is that you should think of that "real process" as being a rule, and aim to give it the virtues of good rules such as being simple, explicit, stable, and legitimate (having legible justifications). An alternative approach is to step outside of the "rules" framework and get in touch with what the rule is for - what preferences/values/strategy/patterns/structures/relationships/etc. it serves. Once you're in touch with that purpose, then you can think about both the current case, and what will become of the "general rule", in that light. This could end up with an explicitly reformulated rule, or not. It seems like treating the "real process" as a rule is more fitting in some cases than others, a better fit for some people's style of thinking than for other people's, and also something that a person could choose to aim for more or less. I think I'd find it easier to think through this topic if there was a long, diverse list of brief examples.
#43

So we're talking about how to make good decisions, or the idea of 'bounded rationality', or what sufficiently advanced Artificial Intelligences might be like; and somebody starts dragging up the concepts of 'expected utility' or 'utility functions'.

And before we even ask what those are, we might first ask, Why?

123johnswentworth
Things To Take Away From The Essay First and foremost: Yudkowsky makes absolutely no mention whatsoever of the VNM utility theorem. This is neither an oversight nor a simplification. The VNM utility theorem is not the primary coherence theorem. It's debatable whether it should be considered a coherence theorem at all. Far and away the most common mistake when arguing about coherence (at least among a technically-educated audience) is for people who've only heard of VNM to think they know what the debate is about. Looking at the top-voted comments on this essay: * the first links to a post which argues against VNM on the basis that it assumes probabilities and preferences are already in the model * the second argues that two of the VNM axioms are unrealistic I expect that if these two commenters read the full essay, and think carefully about how the theorems Yudkowsky is discussing differ from VNM, then their objections will look very different. So what are the primary coherence theorems, and how do they differ from VNM? Yudkowsky mentions the complete class theorem in the post, Savage's theorem comes up in the comments, and there are variations on these two and probably others as well. Roughly, the general claim these theorems make is that any system either (a) acts like an expected utility maximizer under some probabilistic model, or (b) throws away resources in a pareto-suboptimal manner. One thing to emphasize: these theorems generally do not assume any pre-existing probabilities (as VNM does); an agent's implied probabilities are instead derived. Yudkowsky's essay does a good job communicating these concepts, but doesn't emphasize that this is different from VNM. One more common misconception which this essay quietly addresses: the idea that every system can be interpreted as an expected utility maximizer. This is technically true, in the sense that we can always pick a utility function which is maximized under whatever outcome actually occurred. And yet
#44

A general guide for pursuing independent research, from conceptual questions like "how to figure out how to prioritize, learn, and think", to practical questions like "what sort of snacks to should you buy to maximize productivity?"

10magfrump
I want to have this post in a physical book so that I can easily reference it. It might actually work better as a standalone pamphlet, though. 
10adamShimi
How do you review a post that was not written for you? I’m already doing research in AI Alignment, and I don’t plan on creating a group of collaborators for the moment. Still, I found some parts of this useful. Maybe that’s how you do it: by taking different profiles, and running through the most useful advice for each profile from the post. Let’s do that. Full time researcher (no team or MIRIx chapter) For this profile (which is mine, by the way), the most useful piece of advice from this post comes from the model of transmitters and receivers. I’m convinced that I’ve been using it intuitively for years, but having an explicit model is definitely a plus when trying to debug a specific situation, or to explain how it works to someone less used to thinking like that. Full time research who wants to build a team/MIRIx chapter Obviously, this profile benefits from the great advice on building a research group. I would expect someone with this profile to understand relatively well the social dynamics part, so the most useful advice is probably the detailed logistics of getting such a group off the ground. I also believe that the escalating asks and rewards is a less obvious social dynamic to take into account. Aspiring researcher (no team or MIRIx chapter) The section You and your research was probably written with this profile in mind. It tries to push towards exploration instead of exploitation, babble instead of prune. And for so many people that I know who feel obligated to understand everything before toying with a question, this is the prescribed medicine. I want to push-back just a little about the “follow your curiosity” vibe, as I believe that there are ways to check how promising the current ideas are for AI Alignment. But I definitely understand that the audience is more “wannabe researchers stifled by their internal editor”, so pushing for curiosity and exploration makes sense. Aspiring researcher who wants to build a team/MIRIx chapter In additio
#45

Smart people are failing to provide strong arguments for why blackmail should be illegal. Robin Hanson is explicitly arguing it should be legal. Zvi Mowshowitz argues this is wrong, and gives his perspective on why blackmail is bad.

#46

Many of us are held back by mental patterns that compare reality to imaginary "shoulds". PJ Eby explains how to recognize this pattern and start to get free of it.

11pjeby
I got an email from Jacob L. suggesting I review my own post, to add anything that might offer a more current perspective, so here goes... One thing I've learned since writing this is that counterfactualizing, while it doesn't always cause akrasia, it is definitely an important part of how we maintain akrasia: what some people have dubbed "meta-akrasia". When we counterfactualize that we "should have done" something, we create moral license for our past behavior. But also, when we encounter a problem and think, "I should [future action]", we are often licensing ourselves to not do something now. In both cases, the real purpose of the "should" in our thoughts is to avoid thinking about something unpleasant in the current moment. Whether we punish our past self or promote our future self, both moves will feel better than thinking about the actual problem... if the problem conflicts with our desired self-image. But neither one actually results in any positive change, because our subconscious intent is to virtue-signal away the cognitive dissonance arising from an ego threat... not to actually do anything about the problem from which the ego threat arose. In the year since I wrote this article, I've stopped viewing the odd things people have to be talked out of (in order to change) as weird, individual, one-off phenomena, and begun viewing them in terms of "flinch defenses"... which is to say, "how people keep themselves stuck by rationalizing away ego threats instead of addressing them directly." There are other rationalizations besides counterfactual ones, of course, but the concepts in this article (and the subsequent discussion in comments) helped to point me in the right direction to refine the flinch-defense pattern as a specific pattern and category, rather than as an ad hoc collection of similar-but-different behavior patterns.
#47

The credit assignment problem – the challenge of figuring out which parts of a complex system deserve credit for good or bad outcomes – shows up just about everywhere. Abram Demski describes how credit assignment appears in areas as diverse as AI, politics, economics, law, sociology, biology, ethics, and epistemology. 

#48

Some people use the story of manioc as a cautionary tale against innovating through reason. But is this really a fair comparison? Is it reasonable to expect a day of untrained thinking to outperform hundreds of years of accumulated tradition? The author argues that this sets an unreasonably high bar for reason, and that even if reason sometimes makes mistakes, it's still our best tool for progress.

45DirectedEvolution
1. Manioc poisoning in Africa vs. indigenous Amazonian cultures: a biological explanation? Note that while Josef Henrich, the author of TSOOS, correctly points out that cassava poisoning remains a serious public health concern in Africa, he doesn't supply any evidence that it wasn't also a public health issue in Amazonia. One author notes that "none of the disorders which have been associated with high cassava diets in Africa have been found in Tukanoans or other indigenous groups on cassava-based diets in Amazonia." Is this because Tukanoans have superior processing methods, or is it perhaps because Tukanoan metabolism has co-evolved through conventional natural selection to eliminate cyanide from the body? I don't know, but it doesn't seem impossible. 2. It's not that hard to tell that manioc causes health issues. Last year, the CDC published a report about an outbreak of cassava (manioc) poisoning including symptoms of "dizziness, vomiting, tachypnea, syncope, and tachycardia." These symptoms began to develop 4-6 hours after the meal. They reference another such outbreak from 2017. It certainly doesn't take "20 years," as Scott claims, to notice the effects. There's a difference between sweet and bitter cassava. Peeling and thorough cooking is enough for sweet cassava, while extensive treatments are needed for bitter cassava. The latter gives better protection against insects, animals, and thieves, so farmers sometimes like it better. Another analysis says that "A short soak (4 h) has no effect, but if prolonged (18 to 24 h), the amounts of cyanide can be halved or even reduced by more than six times when soaked for several days." Even if the level is cut by 1/6, is this merely slowing, or actually preventing the damage? Wikipedia says that "Spaniards in their early occupation of Caribbean islands did not want to eat cassava or maize, which they considered insubstantial, dangerous, and not nutritious." If you didn't know the difference between sweet and b
14Benquo
This post makes a straightforward analytic argument clarifying the relationship between reason and experience. The popularity of this post suggests that the ideas of cultural accumulation of knowledge, and the power of reason, have been politicized into a specious Hegelian opposition to each other. But for the most part neither Baconian science nor mathematics (except for the occasional Ramanujan) works as a human institution except by the accumulation of knowledge over time. A good follow-up post would connect this to the ways in which modernist ideology poses as the legitimate successor to the European Enlightenment, claiming credit for the output of Enlightenment institutions, and then characterizing its own political success as part of the Enlightenment. Steven Pinker's "Enlightenment Now" might be a good foil.
#49

A tour de force, this posts combines a review of Unlocking The Emotional Brain, Kaj Sotala's review of the book, and connections to predictive coding theory.

It's a deep dive into models of how human cognition is driven by emotional learning, and this learning is what drives many beliefs and behaviors. If that's the case, on big question is how people emotionally learn and unlearn things.

#50

Robin Hanson asked "Why do people like complex rules instead of simple rules?" and gave 12 examples.

Zvi responds with a detailed analysis of each example, suggesting that the desire for complex rules often stems from issues like Goodhart's Law, the Copenhagen Interpretation of Ethics, power dynamics, and the need to consider factors that can't be explicitly stated.

#51

Many people in the rationalist community are skeptical that rationalist techniques can really be trained and improved at a personal level. Jacob argues that rationality can be a skill that people can improve with practice, but that improvement is difficult to see in aggregate and requires consistent effort over long periods.

38johnswentworth
Looking back, I have quite different thoughts on this essay (and the comments) than I did when it was published. Or at least much more legible explanations; the seeds of these thoughts have been around for a while. On The Essay The basketballism analogy remains excellent. Yet searching the comments, I'm surprised that nobody ever mentioned the Fosbury Flop or the Three-Year Swim Club. In sports, from time to time somebody comes along with some crazy new technique and shatters all the records. Comparing rationality practice to sports practice, rationality has not yet had its Fosbury Flop. I think it's coming. I'd give ~60% chance that rationality will have had its first Fosbury Flop in another five years, and ~40% chance that the first Fosbury Flop of rationality is specifically a refined and better-understood version of gears-level modelling. It's the sort of thing that people already sometimes approximate by intuition or accident, but has the potential to yield much larger returns once the technique is explicitly identified and intentionally developed. Once that sort of technique is refined, the returns to studying technique become much larger. On The Comments - What Does Rationalist Self-Improvement Look Like? Scott's prototypical picture of rationalist self-improvement "starts looking a lot like therapy". A concrete image: ... and I find it striking that people mostly didn't argue with that picture, so much as argue that it's actually pretty helpful to just avoid a lot of socially-respectable stupid mistakes.  I very strongly doubt that the Fosbury Flop of rationality is going to look like therapy. It's going to look like engineering. There will very likely be math. Today's "rationalist self-help" does look a lot like therapy, but it's not the thing which is going to have impressive yields from studying the techniques. On The Comments - What Benefits Should Rationalist Self-Improvement Yield? This is one question where I didn't have a clear answer
16Jacob Falkovich
This is a self-review, looking back at the post after 13 months. I have made a few edits to the post, including three major changes: 1. Sharpening my definition of what counts as "Rationalist self-improvement" to reduce confusion. This post is about improved epistemics leading to improved life outcomes, which I don't want to conflate with some CFAR techniques that are basically therapy packaged for skeptical nerds. 2. Addressing Scott's "counterargument from market efficiency" that we shouldn't expect to invent easy self-improvement techniques that haven't been tried. 3. Talking about selection bias, which was the major part missing from the original discussion. My 2020 post The Treacherous Path to Rationality is somewhat of a response to this one, concluding that we should expect Rationality to work mostly for those who self-select into it and that we'll see limited returns to trying to teach it more broadly. The past 13 months also provided more evidence in favor of epistemic Rationality being ever more instrumentally useful. In 2020 I saw a few Rationalist friends fund successful startups and several friends cross the $100k mark for cryptocurrency earnings. And of course, LessWrong led the way on early and accurate analysis of most COVID-related things. One result of this has been increased visibility and legitimacy, and of course another is that Rationalists have a much lower number of COVID cases than all other communities I know. In general, this post is aimed at someone who discovered Rationality recently but is lacking the push to dive deep and start applying it to their actual life decisions. I think the main point still stands: if you're Rationalist enough to think seriously about it, you should do it.
#52

Elizabeth summarizes the literature on distributed teams. She provides recommendations for when remote teams are preferable, and gives tips to mitigate the costs of distribution, such as site visits, over-communication, and hiring people suited to remote work.

#53

Divination seems obviously worthless to most modern educated people. But Xunzi, an ancient Chinese philosopher, argued there was value in practices like divination beyond just predicting the future. This post explores how randomized access to different perspectives or principles could be useful for decision-making and self-reflection, even if you don't believe in supernatural forces.

19Vaniver
Rereading this post, I'm a bit struck by how much effort I put into explaining my history with the underlying ideas, and motivating that this specifically is cool. I think this made sense as a rhetorical move--I'm hoping that a skeptical audience will follow me into territory labeled 'woo' so that they can see the parts of it that are real--and also as a pedagogical move (proofs may be easy to verify, but all of the interesting content of how they actually discovered that line of thought in concept space has been cleaned away; in this post, rather than hiding the sprues they were part of the content, and perhaps even the main content. [Some part of me wants to signpost that a bit more clearly, tho perhaps it is obvious?] There's something that itches about this post, where it feels like I never turn 'the idea' into a sentence. "If one regards it as proper form, one will have good fortune." Sure, but that leaves much of the work to the reader; this post is more like a log of me as a reader doing some more of the work, and leaving yet more work to my reader. It's not a clear condensation of the point, it doesn't address previous scholarship, it doesn't even clearly identify the relevant points that I had identified, and it doesn't transmit many of the tips and tricks I picked up. A sentence that feels like it would have fit (at least some of what I wanted to convey?) is this description of Tarot readings: "they are not about fortelling your inevitable future, but taking control of it through self knowledge and awareness." [But in reading that, there's something pleasing about the holistic vagueness of "proper form"; the point of having proper form is not just 'taking control'!] For example, an important point that came up when reading AllAmericanBreakfast's exploration of using divination was the 'skill of discernment', and that looking at random perspectives and lenses helps train this as well. Once I got a Tarot reading that I'll paraphrase as "this person you're
#54

Evolution doesn't optimize for biological systems to be understandable. But, because only a small subset of possible biological designs can robustly certain common goals (i.e. robust recognition of molecules, robust signal-passing, robust fold-change detection, etc) the requirement to work robustly limits evolution to use a handful of understandable structures.

13habryka
This post surprised me a lot. It still surprises me a lot, actually. I've also linked it a lot of times in the past year.  The concrete context where this post has come up is in things like ML transparency research, as well as lots of theories about what promising approaches to AGI capabilities research are. In particular, there is a frequently recurring question of the type "to what degree do optimization processes like evolution and stochastic gradient descent give rise to understandable modular algorithms?". 
#55

Kaj Sotala gives a step-by-step rationalist argument for why Internal Family Systems therapy might work. He begins by talking about how you might build an AI, only to stumble into the same failure modes that IFS purports to treat. Then, explores how IFS might actually be solving these problems.

#56

Fun fact: biological systems are highly modular, at multiple different scales. This can be quantified and verified statistically. On the other hand, systems designed by genetic algorithms (aka simulated evolution) are decidedly not modular.  They're a mess. This can also be verified statistically (as well as just by qualitatively eyeballing them)

What's up with that?

18johnswentworth
The material here is one seed of a worldview which I've updated toward a lot more over the past year. Some other posts which involve the theme include Science in a High Dimensional World, What is Abstraction?, Alignment by Default, and the companion post to this one Book Review: Design Principles of Biological Circuits. Two ideas unify all of these: 1. Our universe has a simplifying structure: it abstracts well, implying a particular kind of modularity. 2. Goal-oriented systems in our universe tend to evolve a modular structure which reflects the structure of the universe. One major corollary of these two ideas is that goal-oriented systems will tend to evolve similar modular structures, reflecting the relevant parts of their environment. Systems to which this applies include organisms, machine learning algorithms, and the learning performed by the human brain. In particular, this suggests that biological systems and trained deep learning systems are likely to have modular, human-interpretable internal structure. (At least, interpretable by humans familiar with the environment in which the organism/ML system evolved.) This post talks about some of the evidence behind this model: biological systems are indeed quite modular, and simulated evolution experiments find that circuits do indeed evolve modular structure reflecting the modular structure of environmental variations. The companion post reviews the rest of the book, which makes the case that the internals of biological systems are indeed quite interpretable. On the deep learning side, researchers also find considerable modularity in trained neural nets, and direct examination of internal structures reveals plenty of human-recognizable features. Going forward, this view is in need of a more formal and general model, ideally one which would let us empirically test key predictions - e.g. check the extent to which different systems learn similar features, or whether learned features in neural nets satisfy th
#57

While the scientific method developed in pieces over many centuries and places, Joseph Ben-David argues that in 17th century Europe there was a rapid accumulation of knowledge, restricted to a small area for about 200 years. Ruby explores whether this is true and why it might be, aiming to understand "what causes intellectual progress, generally?"

#58

Collect enough data about the input/output pairs for a system, and you might be able predict future input-output pretty well. However, says John, such models are vulnerable. In particular, they can fail on novel inputs in a way that models that describe what actually is happening inside the system won't; and people can make pretty bad inferences from them, e.g. economists in the 70s about inflation/unemployment. See the post for more detail.

11Zvi
After reading this, I went back and also re-read Gears in Understanding (https://www.lesswrong.com/posts/B7P97C27rvHPz3s9B/gears-in-understanding) which this is clearly working from. The key question to me was, is this a better explanation for some class of people? If so, it's quite valuable, since gears are a vital concept. If not, then it has to introduce something new in a way that I don't see here, or it's not worth including. It's not easy to put myself in the mind of someone who doesn't know about gears.  I think the original Gears in Understanding gives a better understanding of the central points, if you grok both posts fully, and gives better ways to get a sense of a given model's gear-ness level. What this post does better is Be Simpler, which can be important, and to provide a simpler motivation for What Happens Without Gears. In particular, this simplified version seems like it would be easier to get someone up to speed using, to the point where they can go 'wait a minute that doesn't have any gears' usefully. My other worry this brought up is that this reflects a general trend, of moving towards things that stand better alone and are simpler to grok and easier to appreciate, at the cost of richness of detail and grounding in related concepts and such - that years ago we'd do more of the thing Gears in Understanding did, and now we do Gears vs. Behavior thing more, and gears are important enough that I don't mind doing both (even if only to have a backup) but that there's a slippery slope where the second thing drives out the first thing and you're left pretty sad after a while.