Paul Christiano paints a vivid and disturbing picture of how AI could go wrong, not with sudden violent takeover, but through a gradual loss of human control as AI systems optimize for the wrong things and develop influence-seeking behaviors.
AI researchers warn that advanced machine learning systems may develop their own internal goals that don't match what we intended. This "mesa-optimization" could lead AI systems to pursue unintended and potentially dangerous objectives, even if we tried to design them to be safe and aligned with human values.
A story in nine parts about someone creating an AI that predicts the future, and multiple people who wonder about the implications. What happens when the predictions influence what future happens?
John Wentworth argues that becoming one of the best in the world at *one* specific skill is hard, but it's not as hard to become the best in the world at the *combination* of two (or more) different skills. He calls this being "Pareto best" and argues it can circumvent the generalized efficient markets principle.
The Secret of Our Success argues that cultural traditions have had a lot of time to evolve. So seemingly arbitrary cultural practices may actually encode important information, even if the practitioners can't tell you why.
"Some of the people who have most inspired me have been inexcusably wrong on basic issues. But you only need one world-changing revelation to be worth reading."
Scott argues that our interest in thinkers should not be determined by their worst idea, or even their average idea, but by their best ideas. Some of the best thinkers in history believed ludicrous things, like Newton believing in Bible codes.
If the thesis in Unlocking the Emotional Brain is even half-right, it may be one of the most important books that I have read. It claims to offer a neuroscience-grounded, comprehensive model of how effective therapy works. In so doing, it also happens to formulate its theory in terms of belief updating, helping explain how the brain models the world and what kinds of techniques allow us to actually change our minds.
According to Zvi, people have a warped sense of justice. For any harm you cause, regardless of intention and or motive, you earn "negative points" that merit punishment. At least implicitly, however, people only want to reward good outcomes a person causes only if their sole goal was being altruistic. Curing illness to make profit? No "positive points" for you!
Suppose you had a society of multiple factions, each of whom only say true sentences, but are selectively more likely to repeat truths that favor their preferred tribe's policies. Zack explores the math behind what sort of beliefs people would be able to form, and what consequences might befall people who aren't aware of the selective reporting.
Examining the concept of optimization, Abram Demski distinguishes between "selection" (like search algorithms that evaluate many options) and "control" (like thermostats or guided missiles). He explores how this distinction relates to ideas of agency and mesa-optimization, and considers various ways to define the difference.
When it comes to coordinating people around a goal, you don't get limitless communication bandwidth for conveying arbitrarily nuanced messages. Instead, the "amount of words" you get to communicate depends on how many people you're trying to coordinate. Once you have enough people....you don't get many words.
When trying to coordinate with others, we often assume the default should be full cooperation ("stag hunting"). Raemon argues this isn't realistic - the default is usually for people to pursue their own interests ("rabbit hunting"). If you want people to cooperate on a big project, you need to put in special effort to get buy-in.
When disagreements persist despite lengthy good-faith communication, it may not just be about factual disagreements – it could be due to people operating in entirely different frames — different ways of seeing, thinking and/or communicating.
Nine parables, in which people find it hard to trust that they've actually gotten a "yes" answer.
Concerningly, it can be much easier to spot holes in the arguments of others than it is in your own arguments. The author of this post reflects that historically, he's been too hasty to go from "other people seem very wrong on this topic" to "I am right on this topic".
A Recovery Day (or "slug day"), is where you're so tired you can only binge Netflix or stay in bed. A Rest Day is where you have enough energy to "follow your gut" with no obligations or pressure. Unreal argues that true rest days are important for avoiding burnout, and gives suggestions on how to implement them.
Alex Turner lays out a framework for understanding how and why artificial intelligences pursuing goals often end up seeking power as an instrumental strategy, even if power itself isn't their goal. This tendency emerges from basic principles of optimal decision-making.
But, he cautions that if you haven't internalized that Reward is not the optimization target, the concepts here, while technically accurate, may lead you astray in alignment research.
In thinking about AGI safety, I’ve found it useful to build a collection of different viewpoints from people that I respect, such that I can think from their perspective. I will often try to compare what an idea feels like when I put on my Paul Christiano hat, to when I put on my Scott Garrabrant hat. Recently, I feel like I’ve gained a "Chris Olah" hat, which often looks at AI through the lens of interpretability.
The goal of this post is to try to give that hat to more people.
Eric Drexler's CAIS model suggests that before we get to a world with monolithic AGI agents, we will already have seen an intelligence explosion due to automated R&D. This reframes the problems of AI safety and has implications for what technical safety researchers should be doing. Rohin reviews and summarizes the model
The strategy-stealing assumption posits that for any strategy an unaligned AI could use to influence the long-term future, there is an analogous strategy that humans could use to capture similar influence. Paul Christiano explores why this assumption might be true, and eleven ways it could potentially fail.
Impact measures may be a powerful safeguard for AI systems - one that doesn't require solving the full alignment problem. But what exactly is "impact", and how can we measure it?
Double descent is a puzzling phenomenon in machine learning where increasing model size/training time/data can initially hurt performance, but then improve it. Evan Hubinger explains the concept, reviews prior work, and discusses implications for AI alignment and understanding inductive biases.
Scott Alexander's "Meditations on Moloch" paints a gloomy picture of the world being inevitably consumed by destructive forces of competition and optimization. But Zvi argues this isn't actually how the world works - we've managed to resist and overcome these forces throughout history.
Integrity isn't just about honesty - it's about aligning your actions with your stated beliefs. But who should you be accountable to? Too broad an audience, and you're limited to simplistic principles. Too narrow, and you miss out on opportunities for growth and collaboration.
Building gears-level models is expensive - often prohibitively expensive. Black-box approaches are usually cheaper and faster. But black-box approaches rarely generalize - they need to be rebuilt when conditions change, don’t identify unknown unknowns, and are hard to build on top of. Gears-level models, on the other hand, offer permanent, generalizable knowledge which can be applied to many problems in the future, even if conditions shift.
If you want to bring up a norm or expectation that's important to you, but not something you'd necessarily argue should be universal, an option is to preface it with the phrase "in my culture." In Duncan's experience, this helps navigate tricky situations by taking your own personal culture as object, and discussing how it is important to you without making demands of others.
Jeff argues that people should fill in some of the San Francisco Bay, south of the Dumbarton Bridge, to create new land for housing. This would allow millions of people to live closer to jobs, reducing sprawl and traffic. While there are environmental concerns, the benefits of dense urban housing outweigh the localized impacts.
In this post, I proclaim/endorse forum participation (aka commenting) as a productive research strategy that I've managed to stumble upon, and recommend it to others (at least to try). Note that this is different from saying that forum/blog posts are a good way for a research community to communicate. It's about individually doing better as researchers.
There are at least three ways in which incentives affect behaviour: Consciously motivating agents, unconsciously reinforcing certain behaviors, and selection effects.
Jacob argues that #2 and probably #3 are more important, but much less talked about.
I've wrestled with applying ideas like "conservation of expected evidence," and want to warn others about some common mistakes. Some of the "obvious inferences" that seem to follow from these ideas are actually mistaken, or stop short of the optimal conclusion.
A thoughtful exploration of the risks and benefits of sharing information about biosecurity and biological risks. The authors argue that while there are real risks to sharing sensitive information, there are also important benefits that need to be weighed carefully. They provide frameworks for thinking through these tradeoffs.
Ben and Jessica discuss how language and meaning can degrade through four stages as people manipulate signifiers. They explore how job titles have shifted from reflecting reality, to being used strategically, to becoming meaningless.
This post kicked off subsequent discussion on LessWrong about simulacrum levels.
nostalgebraist argues that GPT-2 is a fascinating and important development for our understanding of language and the mind, despite its flaws. They're frustrated that many psycholinguists who previously studied language in detail now seem uninterested in looking at what GPT-2 tells us about language, instead focusing on whether it's "real AI".
AI safety researchers have different ideas of what success would look like. This post explores five different AI safety "success stories" that researchers might be aiming for and compares them along several dimensions.
Heated, tense arguments can often be unproductive and unpleasant. Neither side feels heard, and they are often working desperately to defend something they feel is very important. Ruby explores this problem and some solutions.
It might be the case that what people find beautiful and ugly is subjective, but that's not an explanation of ~why~ people find some things beautiful or ugly. Things, including aesthetics, have causal reasons for being the way they are. You can even ask "what would change my mind about whether this is beautiful or ugly?". Raemon explores this topic in depth.
Gradient hacking is when a deceptively aligned AI deliberately acts to influence how the training process updates it. For example, it might try to become more brittle in ways that prevent its objective from being changed. This poses challenges for AI safety, as the AI might try to remove evidence of its deception during training.
The Amish relationship to technology is not "stick to technology from the 1800s", but rather "carefully think about how technology will affect your culture, and only include technology that does what you want." Raemon explores how these ideas could potentially be applied in other contexts.
Power allows people to benefit from immoral acts without having to take responsibility or even be aware of them. The most powerful person in a situation may not be the most morally culpable, as they can remain distant from the actual "crime". If you're not actively looking into how your wants are being met, you may be unknowingly benefiting from something unethical.
Most advice on reading scientific papers focuses on evaluating individual claims. But what if you want to build a deeper "gears-level" understanding of a system? John Wentworth offers advice on how to read papers to build such models, including focusing on boring details, reading broadly, and looking for mediating variables.
Since middle school I've thought I was pretty good at dealing with my emotions, and a handful of close friends and family have made similar comments. Now I can see that though I was particularly good at never flipping out, I was decidedly not good "healthy emotional processing".
Said argues that there's no such thing as a real exception to a rule. If you find an exception, this means you need to update the rule itself. The "real" rule is always the one that already takes into account all possible exceptions.
So we're talking about how to make good decisions, or the idea of 'bounded rationality', or what sufficiently advanced Artificial Intelligences might be like; and somebody starts dragging up the concepts of 'expected utility' or 'utility functions'.
And before we even ask what those are, we might first ask, Why?
A general guide for pursuing independent research, from conceptual questions like "how to figure out how to prioritize, learn, and think", to practical questions like "what sort of snacks to should you buy to maximize productivity?"
Smart people are failing to provide strong arguments for why blackmail should be illegal. Robin Hanson is explicitly arguing it should be legal. Zvi Mowshowitz argues this is wrong, and gives his perspective on why blackmail is bad.
Many of us are held back by mental patterns that compare reality to imaginary "shoulds". PJ Eby explains how to recognize this pattern and start to get free of it.
The credit assignment problem – the challenge of figuring out which parts of a complex system deserve credit for good or bad outcomes – shows up just about everywhere. Abram Demski describes how credit assignment appears in areas as diverse as AI, politics, economics, law, sociology, biology, ethics, and epistemology.
Some people use the story of manioc as a cautionary tale against innovating through reason. But is this really a fair comparison? Is it reasonable to expect a day of untrained thinking to outperform hundreds of years of accumulated tradition? The author argues that this sets an unreasonably high bar for reason, and that even if reason sometimes makes mistakes, it's still our best tool for progress.
A tour de force, this posts combines a review of Unlocking The Emotional Brain, Kaj Sotala's review of the book, and connections to predictive coding theory.
It's a deep dive into models of how human cognition is driven by emotional learning, and this learning is what drives many beliefs and behaviors. If that's the case, on big question is how people emotionally learn and unlearn things.
Robin Hanson asked "Why do people like complex rules instead of simple rules?" and gave 12 examples.
Zvi responds with a detailed analysis of each example, suggesting that the desire for complex rules often stems from issues like Goodhart's Law, the Copenhagen Interpretation of Ethics, power dynamics, and the need to consider factors that can't be explicitly stated.
Many people in the rationalist community are skeptical that rationalist techniques can really be trained and improved at a personal level. Jacob argues that rationality can be a skill that people can improve with practice, but that improvement is difficult to see in aggregate and requires consistent effort over long periods.
Elizabeth summarizes the literature on distributed teams. She provides recommendations for when remote teams are preferable, and gives tips to mitigate the costs of distribution, such as site visits, over-communication, and hiring people suited to remote work.
Divination seems obviously worthless to most modern educated people. But Xunzi, an ancient Chinese philosopher, argued there was value in practices like divination beyond just predicting the future. This post explores how randomized access to different perspectives or principles could be useful for decision-making and self-reflection, even if you don't believe in supernatural forces.
Evolution doesn't optimize for biological systems to be understandable. But, because only a small subset of possible biological designs can robustly certain common goals (i.e. robust recognition of molecules, robust signal-passing, robust fold-change detection, etc) the requirement to work robustly limits evolution to use a handful of understandable structures.
Kaj Sotala gives a step-by-step rationalist argument for why Internal Family Systems therapy might work. He begins by talking about how you might build an AI, only to stumble into the same failure modes that IFS purports to treat. Then, explores how IFS might actually be solving these problems.
Fun fact: biological systems are highly modular, at multiple different scales. This can be quantified and verified statistically. On the other hand, systems designed by genetic algorithms (aka simulated evolution) are decidedly not modular. They're a mess. This can also be verified statistically (as well as just by qualitatively eyeballing them)
What's up with that?
While the scientific method developed in pieces over many centuries and places, Joseph Ben-David argues that in 17th century Europe there was a rapid accumulation of knowledge, restricted to a small area for about 200 years. Ruby explores whether this is true and why it might be, aiming to understand "what causes intellectual progress, generally?"
Collect enough data about the input/output pairs for a system, and you might be able predict future input-output pretty well. However, says John, such models are vulnerable. In particular, they can fail on novel inputs in a way that models that describe what actually is happening inside the system won't; and people can make pretty bad inferences from them, e.g. economists in the 70s about inflation/unemployment. See the post for more detail.