Safety First: safety before full alignment. The deontic sufficiency hypothesis.

Chipmonk

Safety First: safety before full alignment. The deontic sufficiency hypothesis.

by Chipmonk

3 min read3rd Jan 20243 comments

47 Ω 18

Aligned AI ProposalsHuman ValuesHuman-AI SafetyOpen Agency ArchitectureAI

Frontpage

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

It could be the case that these two goals are separable and independent:

“AI safety”: avoiding existential risk, s-risk, actively negative outcomes
“AI getting-everything-we-want” (CEV)

Getting our actively positive desires fulfilled ≠? Getting safety

This is what Davidad calls this the Deontic Sufficiency Hypothesis.

If the hypothesis is true, it should be possible to de-pessimize and mitigate the urgent risk from AI without necessarily ensuring that AI creates actively positive outcomes. Because, for safety, it is only necessary to ensure that actively harmful outcomes do not occur. And hopefully this is easier than achieving “full alignment”.

Safety first! We can figure out the rest later.

Quotes from Davidad's The Open Agency Architecture plans

This is Davidad’s plan with the Open Agency Architecture (OAA).

A list of core AI safety problems and how I hope to solve them (2023 August)

1.1. First, instead of trying to specify "value", instead "de-pessimize" and specify the absence of a catastrophe, and maybe a handful of bounded constructive tasks like supplying clean water. A de-pessimizing OAA would effectively buy humanity some time, and freedom to experiment with less risk, for tackling the CEV-style alignment problem—which is harder than merely mitigating extinction risk.

Davidad's Bold Plan for Alignment: An In-Depth Explanation — LessWrong (2023 April)

Deontic Sufficiency Hypothesis: This hypothesis posits that it is possible to identify desiderata that are adequate to ensure the model doesn't engage in undesirable behavior. Davidad is optimistic that it's feasible to find desiderata ensuring safety for a few weeks before a better solution is discovered, making this a weaker approach than solving outer alignment. For instance, Davidad suggests that even without a deep understanding of music, you can be confident your hearing is safe by ensuring the sound pressure level remains below 80 decibels. However, since the model would still be executing a pivotal process with significant influence, relying on a partial solution for decades could be risky.

Getting traction on the deontic feasibility [sic] hypothesis
Davidad believes that using formalisms such as Markov Blankets would be crucial in encoding the desiderata that the AI should not cross boundary lines at various levels of the world-model. We only need to “imply high probability of existential safety”, so according to davidad, “we do not need to load much ethics or aesthetics in order to satisfy this claim (e.g. we probably do not get to use OAA to make sure people don't die of cancer, because cancer takes place inside the Markov Blanket, and that would conflict with boundary preservation; but it would work to make sure people don't die of violence or pandemics)”. Discussing this hypothesis more thoroughly seems important.

An Open Agency Architecture for Safe Transformative AI (2022 December)

Deontic Sufficiency Hypothesis: There exists a human-understandable set of features of finite trajectories in such a world-model, taking values in , such that we can be reasonably confident that all these features being near 0 implies high probability of existential safety, and such that saturating them at 0 is feasible^[2] with high probability, using scientifically-accessible technologies.
I am optimistic about this largely because of recent progress toward formalizing a natural abstraction of boundaries by Critch and Garrabrant. I find it quite plausible that there is some natural abstraction property $Q$ of world-model trajectories that lies somewhere strictly within the vast moral gulf of
$All Principles That Human CEV Would Endorse \Rightarrow Q \Rightarrow Don't Kill Everyone$

AI Neorealism: a threat model & success criterion for existential safety (2022 December)

For me the core question of existential safety is this:
$\begin{matrix} Under these conditions, what would be the best strategy for building an AI system that helps us ethically end the acute risk period without creating its own catastrophic risks that would be worse than the status quo? \end{matrix}$
It is not, for example, "how can we build an AI that is aligned with human values, including all that is good and beautiful?" or "how can we build an AI that optimises the world for whatever the operators actually specified?" Those could be useful subproblems, but they are not the top-level problem about AI risk (and, in my opinion, given current timelines and a quasi-worst-case assumption, they are probably not on the critical path at all).

How to formalize safety?

If the deontic sufficiency hypothesis is true, there should be an independent/separable way to formalize what “safety” is. This is why I think boundaries/membranes could be helpful for AI safety: See Agent membranes and formalizing “safety”.

Thanks to Jonathan Ng for reviewing a draft of this post and to Alexander Gietelink Oldenziel for encouraging me to post it.

Note that Davidad has ~~not~~ reviewed or verified this post.

New to LessWrong?

Getting Started

FAQ

Library

Aligned AI ProposalsHuman ValuesHuman-AI SafetyOpen Agency ArchitectureAI

Frontpage

47 Ω 18

Mentioned in

23Agent membranes/boundaries and formalizing “safety”

14How I turned doing therapy into object-level AI safety research

Safety First: safety before full alignment. The deontic sufficiency hypothesis.

New Comment

3 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:42 PM

[-]davidad4moΩ382

For the record, as this post mostly consists of quotes from me, I can hardly fail to endorse it.

[-]RogerDearnaley4mo32

If we get TAI in the next decade or so, it will almost certainly contain an LLM, at least as a component. Human values are complex and fragile, and we spend a huge amount of our time writing about them: roughly half the Dewey Decimal system consists of many different subfields of "How to Make Humans Happy 101", including virtually all of the soft sciences (Anthropology, Medicine, Ergonomics, Economics…), arts, and crafts. Current LLMs have read tens of trillions of tokens of our content, including terrabytes of this material, and as a result even GPT-4 (definitely less than TAI) can do a pretty good job of answering moral questions and commenting on possible undesirable side effects and downsides of plans. So if we have sufficient control of our TAI to ensure that it is extremely unlikely to kill us all, then presumably we can also tell it "also don't do anything that your LLM says is a bad idea or we wouldn't like, at least not without checking carefully with us first", and get a passable take on human values and impact regularization as well. So if we have enough control to block your red arrow, we can also take at least a passable first cut at the green arrow as well. Which by itself probably isn't enough to stand up to many bits of optimization pressure without Goodharting, but is a lot better then ignoring the green arrow entirely. Also any TAI that can do STEM can understand and avoid Goodharting.

I agree that just not killing everyone is a much easier problem. Consider zoos: the manual for "How Not to Kill Everything in Your Care: The Orangutan Edition" is probably only a few hundred pages or less, and has a significant overlap with the corresponding editions for all of the other primates, including Homo sapiens. However, LLMs can handle datasets vastly larger than that, so this compactness is only relevant if you're trying to add some sort of mathematical or software framework on top of it that can handle of data, but not terabytes.

[-]agentofuser2mo21

The more recent Safeguarded AI document has some parts that seem to me to go against the interpretation I had, which seems to go along the lines of this post.

Namely, that davidad's proposal was not "CEV full alignment on AI that can be safely scaled without limit" but rather "sufficient control of AI that is as little more powerful as possible than sufficiently powerful for ethical global non-proliferation".

In other words:

A) "this doesn't guarantee a positive future but buys us time to solve alignment"
B) "a sufficiently powerful superintelligence would blow right through these constraints but they hold at the power level we think is enough for A", thus implying "we also need boundedness somehow".

The Safeguarded AI document says this though:

and that this milestone could be achieved, thereby making it safe to unleash the full potential of superhuman AI agents, within a time frame that is short enough (<15 years) [bold mine]

and

~~and with enough economic dividends along the way (>5% of unconstrained AI’s potential value) [bold mine]~~^[1]

I'm probably missing something, but that seems to imply a claim that the control approach would be resilient against arbitrarily powerful misaligned AI?

A related thing I'm confused about is the part that says:

one eventual application of these safety-critical assemblages is defending humanity against potential future rogue AIs [bold mine]

Whereas I previously thought that the point of the proposal was to create AI powerful-enough and controlled-enough to ethically establish global non-proliferation (so that "potential future rogue AIs" wouldn't exist in the first place), it now seems to go in the direction of Good(-enough) AI defending against potential Bad AI?

^{^}
The "unconstrained AI" in this sentence seems to be about how much value would be achieved from adoption of the safe/constrained design versus the counterfactual value of mainstream/unconstrained AI. My mistake.
The "constrained" still seems to refer to whether there's a "box" around the AI, with all output funneled through formal verification checks on their predicted consequences. It does not seem to refer to a constraint on the "power level" ("boundedness") of the AI within the box.

Moderation Log