LESSWRONG
LW

Alignment Newsletter

Aug 01, 2018 by Rohin Shah

I publish the Alignment Newsletter, a weekly publication with recent content relevant to AI alignment. See here for more details. Quick links: email signup form, RSS feed, spreadsheet of all summaries.

12The Alignment Newsletter #1: 04/09/18
Ω
Rohin Shah
7y
Ω
3
8The Alignment Newsletter #2: 04/16/18
Ω
Rohin Shah
7y
Ω
0
9The Alignment Newsletter #3: 04/23/18
Ω
Rohin Shah
7y
Ω
0
8The Alignment Newsletter #4: 04/30/18
Ω
Rohin Shah
7y
Ω
0
8The Alignment Newsletter #5: 05/07/18
Ω
Rohin Shah
7y
Ω
0
8The Alignment Newsletter #6: 05/14/18
Ω
Rohin Shah
7y
Ω
0
8The Alignment Newsletter #7: 05/21/18
Ω
Rohin Shah
7y
Ω
0
8The Alignment Newsletter #8: 05/28/18
Ω
Rohin Shah
7y
Ω
0
8The Alignment Newsletter #9: 06/04/18
Ω
Rohin Shah
7y
Ω
0
16The Alignment Newsletter #10: 06/11/18
Ω
Rohin Shah
7y
Ω
0
8The Alignment Newsletter #11: 06/18/18
Ω
Rohin Shah
7y
Ω
0
15The Alignment Newsletter #12: 06/25/18
Ω
Rohin Shah
7y
Ω
0
70Alignment Newsletter #13: 07/02/18
Ω
Rohin Shah
7y
Ω
12
14Alignment Newsletter #14
Ω
Rohin Shah
7y
Ω
0
42Alignment Newsletter #15: 07/16/18
Ω
Rohin Shah
7y
Ω
0
42Alignment Newsletter #16: 07/23/18
Ω
Rohin Shah
7y
Ω
0
32Alignment Newsletter #17
Ω
Rohin Shah
7y
Ω
0
17Alignment Newsletter #18
Ω
Rohin Shah
7y
Ω
0
18Alignment Newsletter #19
Ω
Rohin Shah
7y
Ω
0
12Alignment Newsletter #20
Ω
Rohin Shah
7y
Ω
2
25Alignment Newsletter #21
Ω
Rohin Shah
7y
Ω
0
18Alignment Newsletter #22
Ω
Rohin Shah
7y
Ω
0
16Alignment Newsletter #23
Ω
Rohin Shah
7y
Ω
0
10Alignment Newsletter #24
Ω
Rohin Shah
7y
Ω
6
18Alignment Newsletter #25
Ω
Rohin Shah
7y
Ω
3
13Alignment Newsletter #26
Ω
Rohin Shah
7y
Ω
0
16Alignment Newsletter #27
Ω
Rohin Shah
7y
Ω
0
11Alignment Newsletter #28
Ω
Rohin Shah
7y
Ω
0
15Alignment Newsletter #29
Ω
Rohin Shah
7y
Ω
0
29Alignment Newsletter #30
Ω
Rohin Shah
7y
Ω
2
17Alignment Newsletter #31
Ω
Rohin Shah
7y
Ω
0
18Alignment Newsletter #32
Ω
Rohin Shah
7y
Ω
0
23Alignment Newsletter #33
Ω
Rohin Shah
7y
Ω
0
24Alignment Newsletter #34
Ω
Rohin Shah
7y
Ω
0
15Alignment Newsletter #35
Ω
Rohin Shah
7y
Ω
0
21Alignment Newsletter #36
Ω
Rohin Shah
7y
Ω
0
25Alignment Newsletter #37
Ω
Rohin Shah
7y
Ω
4
9Alignment Newsletter #38
Ω
Rohin Shah
7y
Ω
0
32Alignment Newsletter #39
Ω
Rohin Shah
7y
Ω
2
21Alignment Newsletter #40
Ω
Rohin Shah
7y
Ω
2
22Alignment Newsletter #41
Ω
Rohin Shah
6y
Ω
6
20Alignment Newsletter #42
Ω
Rohin Shah
6y
Ω
1
14Alignment Newsletter #43
Ω
Rohin Shah
6y
Ω
2
18Alignment Newsletter #44
Ω
Rohin Shah
6y
Ω
0
25Alignment Newsletter #45
Ω
Rohin Shah
6y
Ω
2
12Alignment Newsletter #46
Ω
Rohin Shah
6y
Ω
0
18Alignment Newsletter #47
Ω
Rohin Shah
6y
Ω
0
29Alignment Newsletter #48
Ω
Rohin Shah
6y
Ω
14
23Alignment Newsletter #49
Ω
Rohin Shah
6y
Ω
1
15Alignment Newsletter #50
Ω
Rohin Shah
6y
Ω
2
25Alignment Newsletter #51
Ω
Rohin Shah
6y
Ω
2
19Alignment Newsletter #52
Ω
Rohin Shah
6y
Ω
1
94Alignment Newsletter One Year Retrospective
Ω
Rohin Shah
6y
Ω
31
20Alignment Newsletter #53
Ω
Rohin Shah
6y
Ω
0
20[AN #54] Boxing a finite-horizon AI system to keep it unambitious
Ω
Rohin Shah
6y
Ω
0
17[AN #55] Regulatory markets and international standards as a means of ensuring beneficial AI
Ω
Rohin Shah
6y
Ω
2
21[AN #56] Should ML researchers stop running experiments before making hypotheses?
Ω
Rohin Shah
6y
Ω
8
26[AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming
Ω
Rohin Shah
6y
Ω
15
55[AN #58] Mesa optimization: what it is, and why we should care
Ω
Rohin Shah
6y
Ω
10
43[AN #59] How arguments for AI risk have changed over time
Ω
Rohin Shah
6y
Ω
4
23[AN #60] A new AI challenge: Minecraft agents that assist human players in creative mode
Ω
Rohin Shah
6y
Ω
6
12[AN #61] AI policy and governance, from two people in the field
Rohin Shah
6y
2
28[AN #62] Are adversarial examples caused by real but imperceptible features?
Ω
Rohin Shah
6y
Ω
10
21[AN #63] How architecture search, meta learning, and environment design could lead to general intelligence
Ω
Rohin Shah
6y
Ω
12
11[AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning
Ω
Rohin Shah
6y
Ω
8
11[AN #65]: Learning useful skills by watching humans “play”
Ω
Rohin Shah
6y
Ω
0
12[AN #66]: Decomposing robustness into capability robustness and alignment robustness
Ω
Rohin Shah
6y
Ω
1
17[AN #67]: Creating environments in which to study inner alignment failures
Ω
Rohin Shah
6y
Ω
0
17[AN #68]: The attainable utility theory of impact
Ω
Rohin Shah
6y
Ω
0
60[AN #69] Stuart Russell's new book on why we need to replace the standard model of AI
Ω
Rohin Shah
6y
Ω
12
16[AN #70]: Agents that help humans who are still learning about their own preferences
Ω
Rohin Shah
6y
Ω
0
12[AN #71]: Avoiding reward tampering through current-RF optimization
Ω
Rohin Shah
6y
Ω
0
26[AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety
Ω
Rohin Shah
6y
Ω
4
11[AN #73]: Detecting catastrophic failures by learning how agents tend to break
Ω
Rohin Shah
6y
Ω
0
19[AN #74]: Separating beneficial AI into competence, alignment, and coping with impacts
Ω
Rohin Shah
6y
Ω
0
38[AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee
Ω
Rohin Shah
6y
Ω
1
14[AN #76]: How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations
Ω
Rohin Shah
6y
Ω
6
21[AN #77]: Double descent: a unification of statistical theory and modern ML practice
Ω
Rohin Shah
6y
Ω
4
26[AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison
Ω
Rohin Shah
6y
Ω
10
13[AN #79]: Recursive reward modeling as an alignment technique integrated with deep RL
Ω
Rohin Shah
6y
Ω
0
36[AN #80]: Why AI risk might be solved without additional intervention from longtermists
Ω
Rohin Shah
6y
Ω
95
32[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment
Ω
Rohin Shah
6y
Ω
4
19[AN #82]: How OpenAI Five distributed their training computation
Ω
Rohin Shah
5y
Ω
0
15[AN #83]: Sample-efficient deep learning with ReMixMatch
Ω
Rohin Shah
5y
Ω
4
23[AN #84] Reviewing AI alignment work in 2018-19
Ω
Rohin Shah
5y
Ω
0
14[AN #85]: The normative questions we should be asking for AI alignment, and a surprisingly good chatbot
Ω
Rohin Shah
5y
Ω
2
15[AN #86]: Improving debate and factored cognition through human experiments
Ω
Rohin Shah
5y
Ω
0
28[AN #87]: What might happen as deep learning scales even further?
Ω
Rohin Shah
5y
Ω
0
18[AN #88]: How the principal-agent literature relates to AI risk
Ω
Rohin Shah
5y
Ω
0
16[AN #89]: A unifying formalism for preference learning algorithms
Ω
Rohin Shah
5y
Ω
0
11[AN #90]: How search landscapes can contain self-reinforcing feedback loops
Ω
Rohin Shah
5y
Ω
6
15[AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement
Ω
Rohin Shah
5y
Ω
10
18[AN #92]: Learning good representations with contrastive predictive coding
Ω
Rohin Shah
5y
Ω
1
24[AN #93]: The Precipice we’re standing at, and how we can back away from it
Ω
Rohin Shah
5y
Ω
0
11[AN #94]: AI alignment as translation between humans and machines
Ω
Rohin Shah
5y
Ω
0
20[AN #95]: A framework for thinking about how to make AI go well
Ω
Rohin Shah
5y
Ω
2
17[AN #96]: Buck and I discuss/argue about AI Alignment
Ω
Rohin Shah
5y
Ω
4
15[AN #97]: Are there historical examples of large, robust discontinuities?
Ω
Rohin Shah
5y
Ω
0
22[AN #98]: Understanding neural net training by seeing which gradients were helpful
Ω
Rohin Shah
5y
Ω
3
29[AN #99]: Doubling times for the efficiency of AI algorithms
Ω
Rohin Shah
5y
Ω
0
33[AN #100]: What might go wrong if you learn a reward function while acting
Ω
Rohin Shah
5y
Ω
2
15[AN #101]: Why we should rigorously measure and forecast AI progress
Ω
Rohin Shah
5y
Ω
0
38[AN #102]: Meta learning by GPT-3, and a list of full proposals for AI alignment
Ω
Rohin Shah
5y
Ω
6
29[AN #103]: ARCHES: an agenda for existential safety, and combining natural language with deep RL
Ω
Rohin Shah
5y
Ω
0
19[AN #104]: The perils of inaccessible information, and what we can learn about AI alignment from COVID
Ω
Rohin Shah
5y
Ω
5
24[AN #105]: The economic trajectory of humanity, and what we might mean by optimization
Ω
Rohin Shah
5y
Ω
3
14[AN #106]: Evaluating generalization ability of learned reward models
Ω
Rohin Shah
5y
Ω
2
13[AN #107]: The convergent instrumental subgoals of goal-directed agents
Ω
Rohin Shah
5y
Ω
1
19[AN #108]: Why we should scrutinize arguments for AI risk
Ω
Rohin Shah
5y
Ω
6
17[AN #109]: Teaching neural nets to generalize the way humans would
Ω
Rohin Shah
5y
Ω
3
13[AN #110]: Learning features from human feedback to enable reward learning
Ω
Rohin Shah
5y
Ω
2
23[AN #111]: The Circuits hypotheses for deep learning
Ω
Rohin Shah
5y
Ω
0
26[AN #112]: Engineering a Safer World
Ω
Rohin Shah
5y
Ω
2
23[AN #113]: Checking the ethical intuitions of large language models
Ω
Rohin Shah
5y
Ω
0
21[AN #114]: Theory-inspired safety solutions for powerful Bayesian RL agents
Ω
Rohin Shah
5y
Ω
3
19[AN #115]: AI safety research problems in the AI-GA framework
Ω
Rohin Shah
5y
Ω
16
21[AN #116]: How to make explanations of neurons compositional
Ω
Rohin Shah
5y
Ω
2
27[AN #117]: How neural nets would fare under the TEVV framework
Ω
Rohin Shah
5y
Ω
0
15[AN #118]: Risks, solutions, and prioritization in a world with many AI systems
Ω
Rohin Shah
5y
Ω
6
11[AN #119]: AI safety when agents are shaped by environments, not rewards
Ω
Rohin Shah
5y
Ω
0
13[AN #120]: Tracing the intellectual roots of AI and AI alignment
Ω
Rohin Shah
5y
Ω
4
28[AN #121]: Forecasting transformative AI timelines using biological anchors
Ω
Rohin Shah
5y
Ω
5
28[AN #122]: Arguing for AGI-driven existential risk from first principles
Ω
Rohin Shah
5y
Ω
0
20[AN #123]: Inferring what is valuable in order to align recommender systems
Ω
Rohin Shah
5y
Ω
1
17[AN #124]: Provably safe exploration through shielding
Ω
Rohin Shah
5y
Ω
0
25[AN #125]: Neural network scaling laws across multiple modalities
Ω
Rohin Shah
5y
Ω
7
24[AN #126]: Avoiding wireheading by decoupling action feedback from action effects
Ω
Rohin Shah
5y
Ω
1
53[AN #127]: Rethinking agency: Cartesian frames as a formalization of ways to carve up the world into an agent and its environment
Ω
Rohin Shah
5y
Ω
0
16[AN #128]: Prioritizing research on AI existential safety based on its application to governance demands
Ω
Rohin Shah
5y
Ω
2
14[AN #129]: Explaining double descent by measuring bias and variance
Ω
Rohin Shah
5y
Ω
1
8[AN #130]: A new AI x-risk podcast, and reviews of the field
Ω
Rohin Shah
5y
Ω
0
13[AN #131]: Formalizing the argument of ignored attributes in a utility function
Ω
Rohin Shah
5y
Ω
4
19[AN #132]: Complex and subtly incorrect arguments as an obstacle to debate
Ω
Rohin Shah
5y
Ω
1
14[AN #133]: Building machines that can cooperate (with humans, institutions, or other machines)
Ω
Rohin Shah
4y
Ω
0
13[AN #134]: Underspecification as a cause of fragility to distribution shift
Ω
Rohin Shah
4y
Ω
0
33[AN #135]: Five properties of goal-directed systems
Ω
Rohin Shah
4y
Ω
0
21[AN #136]: How well will GPT-N perform on downstream tasks?
Ω
Rohin Shah
4y
Ω
2
18[AN #137]: Quantifying the benefits of pretraining on downstream task performance
Ω
Rohin Shah
4y
Ω
0
12[AN #138]: Why AI governance should find problems rather than just solving them
Ω
Rohin Shah
4y
Ω
0
26[AN #139]: How the simplicity of reality explains the success of neural nets
Ω
Rohin Shah
4y
Ω
6
46[AN #140]: Theoretical models that predict scaling laws
Ω
Rohin Shah
4y
Ω
8
27[AN #141]: The case for practicing alignment work on GPT-3 and other large models
Ω
Rohin Shah
4y
Ω
4
36[AN #142]: The quest to understand a network well enough to reimplement it by hand
Ω
Rohin Shah
4y
Ω
4
13[AN #143]: How to make embedded agents that reason probabilistically about their environments
Ω
Rohin Shah
4y
Ω
3
19[AN #144]: How language models can also be finetuned for non-language tasks
Ω
Rohin Shah
4y
Ω
0
55Alignment Newsletter Three Year Retrospective
Ω
Rohin Shah
4y
Ω
0
19[AN #145]: Our three year anniversary!
Ω
Rohin Shah
4y
Ω
0
23[AN #146]: Plausible stories of how we might fail to avert an existential catastrophe
Ω
Rohin Shah
4y
Ω
1
14[AN #147]: An overview of the interpretability landscape
Ω
Rohin Shah
4y
Ω
2
24[AN #148]: Analyzing generalization across more axes than just accuracy or loss
Ω
Rohin Shah
4y
Ω
5
19[AN #149]: The newsletter's editorial policy
Ω
Rohin Shah
4y
Ω
3
17[AN #150]: The subtypes of Cooperative AI research
Ω
Rohin Shah
4y
Ω
0
19[AN #151]: How sparsity in the final layer makes a neural net debuggable
Ω
Rohin Shah
4y
Ω
2
22[AN #152]: How we’ve overestimated few-shot learning capabilities
Ω
Rohin Shah
4y
Ω
6
25[AN #153]: Experiments that demonstrate failures of objective robustness
Ω
Rohin Shah
4y
Ω
1
12[AN #154]: What economic growth theory has to say about transformative AI
Ω
Rohin Shah
4y
Ω
0
21[AN #155]: A Minecraft benchmark for algorithms that learn without reward functions
Ω
Rohin Shah
4y
Ω
5
46[AN #156]: The scaling hypothesis: a plan for building AGI
Ω
Rohin Shah
4y
Ω
20
28[AN #157]: Measuring misalignment in the technology underlying Copilot
Ω
Rohin Shah
4y
Ω
18
20[AN #158]: Should we be optimistic about generalization?
Ω
Rohin Shah
4y
Ω
0
18[AN #159]: Building agents that know how to experiment, by training on procedurally generated games
Ω
Rohin Shah
4y
Ω
4
28[AN #160]: Building AIs that learn and think like people
Ω
Rohin Shah
4y
Ω
6
15[AN #161]: Creating generalizable reward functions for multiple tasks by learning a model of functional similarity
Ω
Rohin Shah
4y
Ω
0
25[AN #162]: Foundation models: a paradigm shift within AI
Ω
Rohin Shah
4y
Ω
0
41[AN #163]: Using finite factored sets for causal and temporal inference
Ω
Rohin Shah
4y
Ω
0
13[AN #164]: How well can language models write code?
Ω
Rohin Shah
4y
Ω
7
23[AN #165]: When large models are more likely to lie
Ω
Rohin Shah
4y
Ω
0
52[AN #166]: Is it crazy to claim we're in the most important century?
Ω
Rohin Shah
4y
Ω
5
21[AN #167]: Concrete ML safety problems and their relevance to x-risk
Ω
Rohin Shah
4y
Ω
4
15[AN #168]: Four technical topics for which Open Phil is soliciting grant proposals
Ω
Rohin Shah
4y
Ω
0
33[AN #169]: Collaborating with humans without human data
Ω
Rohin Shah
4y
Ω
0
21[AN #170]: Analyzing the argument for risk from power-seeking AI
Ω
Rohin Shah
4y
Ω
1
32[AN #171]: Disagreements between alignment "optimists" and "pessimists"
Ω
Rohin Shah
3y
Ω
1
54[AN #172] Sorry for the long hiatus!
Ω
Rohin Shah
3y
Ω
0
37[AN #173] Recent language model results from DeepMind
Ω
Rohin Shah
3y
Ω
9