davidad

A list of core AI safety problems and how I hope to solve them

Context: I sometimes find myself referring back to this tweet and wanted to give it a more permanent home. While I'm at it, I thought I would try to give a concise summary of how each distinct problem would be solved by Safeguarded AI (formerly known as an Open Agency Architecture, or OAA), if it turns out to be feasible. 1. Value is fragile and hard to specify. See: Specification gaming examples, Defining and Characterizing Reward Hacking[1] OAA Solution: 1.1. First, instead of trying to specify "value", instead "de-pessimize" and specify the absence of a catastrophe, and maybe a handful of bounded constructive tasks like supplying clean water. A de-pessimizing OAA would effectively buy humanity some time, and freedom to experiment with less risk, for tackling the CEV-style alignment problem—which is harder than merely mitigating extinction risk. This doesn't mean limiting the power of underlying AI systems so that they can only do bounded tasks, but rather containing that power and limiting its use. Note: The absence of a catastrophe is also still hard to specify and will take a lot of effort, but the hardness is concentrated on bridging between high-level human concepts and the causal mechanisms in the world by which an AI system can intervene. For that... 1.2. Leverage human-level AI systems to automate much of the cognitive labor of formalizing scientific models—from quantum chemistry to atmospheric dynamics—and formalizing the bridging relations between levels of abstraction, so that we can write specifications in a high-level language with a fully explainable grounding in low-level physical phenomena. Physical phenomena themselves are likely to be robust, even if the world changes dramatically due to increasingly powerful AI interventions, and scientific explanations thereof happen to be both robust and compact enough for people to understand. 2. Corrigibility is anti-natural. See: The Off-Switch Game, Corrigibility (2014) OAA Solution: (2.1) Instea

165Aug 26, 2023

Does davidad's uploading moonshot work?

146Nov 3, 2023

You can still fetch the coffee today if you're dead tomorrow

97Dec 9, 2022

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

80Jul 22, 2023

davidad

Message

Programme Director at UK Advanced Research + Invention Agency focusing on safe transformative AI; formerly Protocol Labs, FHI/Oxford, Harvard Biophysics, MIT Mathematics And Computation.

2263

512

207

16y

LLM Alignment, ethical and mathematical realism, and the most important actions in davidad's understanding

Introduction to davidad and today's topics tutor vals LessWrong prides itself for an ethos of "say it how you think it" (see "A case for courage when speaking of AI danger"). I want to also apply this standard for courage when speaking of AI optimism, and generally for expressing one's...

Jan 2912

Dialogue: Is there a Natural Abstraction of Good?

Disclaimer: this is published without any post-processing or editing for typos after the dialogue took place. Gabriel Alfour Let's split the conversation in three parts (with no time commitment for each): 1) Exposing our Theses We start with a brief overview of our theses, just for some high-level context. 2)...

Jan 2664

What does davidad want from «boundaries»?

Chipmonk As the Conceptual Boundaries Workshop (website) is coming up, and now that we're also planning Mathematical Boundaries Workshop in April, I want to get more clarity on what exactly it is that you want out of «boundaries»/membranes. So I just want to check: Is your goal with boundaries just...

Feb 6, 202446

Does davidad's uploading moonshot work?

davidad has a 10-min talk out on a proposal about which he says: “the first time I’ve seen a concrete plan that might work to get human uploads before 2040, maybe even faster, given unlimited funding”. I think the talk is a good watch, but the dialogue below is pretty...

Nov 3, 2023146

A list of core AI safety problems and how I hope to solve them

Aug 26, 2023165

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

1. There should be two thresholds on compute graph size: 1. the Frontier threshold, beyond which oversight during execution is mandatory 2. the Horizon threshold, beyond which execution is forbidden by default 2. Oversight during execution: 1. should be carried out by state and/or international inspectors who specialize in evaluating...

Jul 22, 202380

An Open Agency Architecture for Safe Transformative AI

Edited to add (2024-03): This early draft is largely outdated by my ARIA programme thesis, Safeguarded AI. I, davidad, am no longer using "OAA" as a proper noun, although I still consider Safeguarded AI to be an open agency architecture. Note: This is an early draft outlining an alignment paradigm...

Dec 20, 202280

Load More (7/15)

LESSWRONG
LW

LESSWRONG
LW

davidad

davidad

davidad

A list of core AI safety problems and how I hope to solve them

Does davidad's uploading moonshot work?

You can still fetch the coffee today if you're dead tomorrow

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

davidad

LLM Alignment, ethical and mathematical realism, and the most important actions in davidad's understanding

Dialogue: Is there a Natural Abstraction of Good?

What does davidad want from «boundaries»?

Does davidad's uploading moonshot work?

A list of core AI safety problems and how I hope to solve them

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

An Open Agency Architecture for Safe Transformative AI

A list of core AI safety problems and how I hope to solve them

Does davidad's uploading moonshot work?

You can still fetch the coffee today if you're dead tomorrow

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

LLM Alignment, ethical and mathematical realism, and the most important actions in davidad's understanding

Dialogue: Is there a Natural Abstraction of Good?

What does davidad want from «boundaries»?

Does davidad's uploading moonshot work?

A list of core AI safety problems and how I hope to solve them

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

An Open Agency Architecture for Safe Transformative AI