A few dozen reason that Eliezer thinks AGI alignment is an extremely difficult problem, which humanity is not on track to solve.
Paul writes a list of 19 important places where he agrees with Eliezer on AI existential risk and safety, and a list of 27 places where he disagrees. He argues Eliezer has raised many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument.
TurnTrout discusses a common misconception in reinforcement learning: that reward is the optimization target of trained agents. He argues reward is better understood as a mechanism for shaping cognition, not a goal to be optimized, and that this has major implications for AI alignment approaches.
Alex Turner argues that the concepts of "inner alignment" and "outer alignment" in AI safety are unhelpful and potentially misleading. The author contends that these concepts decompose one hard problem (AI alignment) into two extremely hard problems, and that they go against natural patterns of cognition formation. Alex argues that "robust grading" scheme based approaches are unlikely to work to develop AI alignment.
Nate Soares reviews a dozen plans and proposals for making AI go well. He finds that almost none of them grapple with what he considers the core problem - capabilities will suddenly generalize way past training, but alignment won't.
This post explores the concept of simulators in AI, particularly self-supervised models like GPT. Janus argues that GPT and similar models are best understood as simulators that can generate various simulacra, not as agents themselves. This framing helps explain many counterintuitive properties of language models. Powerful simulators could have major implications for AI capabilities and alignment.
People worry about agentic AI, with ulterior motives. Some suggest Oracle AI, which only answers questions. But I don't think about agents like that. It killed you because it was optimised. It used an agent because it was an effective tool it had on hand.
Optimality is the tiger, and agents are its teeth.
The DeepMind paper that introduced Chinchilla revealed that we've been using way too many parameters and not enough data for large language models. There's immense returns to scaling up training data size, but we may be running out of high-quality data to train on. This could be a major bottleneck for future AI progress.
In worlds where AI alignment can be handled by iterative design, we probably survive. So if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason. John explores several ways that could happen, beyond just fast takeoff and deceptive misalignment.
Nate Soares explains why he doesn't expect an unaligned AI to be friendly or cooperative with humanity, even if it uses logical decision theory. He argues that even getting a small fraction of resources from such an AI is extremely unlikely.
Neel Nanda reverse engineers neural networks that have "grokked" modular addition, showing that they operate using Discrete Fourier Transforms and trig identities. He argues grokking is really about phase changes in model capabilities, and that such phase changes may be ubiquitous in larger models.
Causal scrubbing is a new tool for evaluating mechanistic interpretability hypotheses. The algorithm tries to replace all model activations that shouldn't matter according to a hypothesis, and measures how much performance drops. It's been used to improve hypotheses about induction heads and parentheses balancing circuits.
How good are modern language models compared to humans at the task language models are trained on (next token prediction on internet text)? We found that humans seem to be consistently worse at next-token prediction (in terms of both top-1 accuracy and perplexity) than even small models like Fairseq-125M, a 12-layer transformer roughly the size and quality of GPT-1.
Models don't "get" reward. Reward is the mechanism by which we select parameters, it is not something "given" to the model. Reinforcement learning should be viewed through the lens of selection, not the lens of incentivisation. This has implications for how one should think about AI alignment.
Here's a simple strategy for AI alignment: use interpretability tools to identify the AI's internal search process, and the AI's internal representation of our desired alignment target. Then directly rewire the search process to aim at the alignment target. Boom, done.
Lessons from 20+ years of software security experience, perhaps relevant to AGI alignment:
1. Security doesn't happen by accident
2. Blacklists are useless but make them anyway
3. You get what you pay for (incentives matter)
4. Assurance requires formal proofs, which are provably impossible
5. A breach IS an existential risk
What's with all the strange pseudophilosophical questions from AI alignment researchers, like "what does it mean for some chunk of the world to do optimization?" or "how does an agent model a world bigger than itself?". John lays out why some people think solving these sorts of questions is a necessary prerequisite for AI alignment.
Nate Soares argues that one of the core problems with AI alignment is that an AI system's capabilities will likely generalize to new domains much faster than its alignment properties. He thinks this is likely to happen in a sudden, discontinuous way (a "sharp left turn"), and that this transition will break most alignment approaches. And this isn't getting enough focus from the field.
Alignment researchers often propose clever-sounding solutions without citing much evidence that their solution should help. Such arguments can mislead people into working on dead ends. Instead, Turntrout argues we should focus more on studying how human intelligence implements alignment properties, as it is a real "existence proof" of aligned intelligence.
How do humans form their values? Shard theory proposes that human values are formed through a relatively straightforward reinforcement process, rather than being hard-coded by evolution. This post lays out the core ideas behind shard theory and explores how it can explain various aspects of human behavior and decision-making.
A new paper proposes an unsupervised way to extract knowledge from language models. The authors argue this could be a key part of aligning superintelligent AIs, by letting us figure out what the AI "really believes" rather than what it thinks humans want to hear. But there are still some challenges to overcome before this could work on future superhuman AIs.