Crucial context for this post in «Boundaries» and AI safety compilation.

I’d like to make common knowledge of specifically how «boundaries» could apply to alignment. 

In this post, I will focus on Davidad's conception— namely, for his Open Agency Architecture alignment paradigm.

My read is that, hopefully «boundaries» can be used to formalize a sort of bare-bones morality[1] for the (~)first AGI.

Update: Davidad left a comment

I think [this post] paints a fine picture of my current best hope for deontic sufficiency.

Explanation

In an ideal future, we would get CEV alignment in the first AGI. However, this seems really hard, and it might be easier to get AI x-risk off the table first (thus ending the "acute risk period"), and then figure out how to do the rest of alignment later

In which case, we don't actually need the first AGI to understand all of human values/ethics— we only need it to understand a bare-bones subset.

But which subset? And how could it be formalized in a consistent manner?

This is where the concept of «boundaries» comes in, for it has two nice properties:

  1. «boundaries» explain what's bad about a bunch of actions that are otherwise difficult to explain why they're bad. 
  2. «boundaries» seem unusually tractable for formalizing algorithmically

The hope, then, is that the «boundaries» concept could be formalized into a sort of bare-bones morality that could be used in the first AGI. 

Put another way: the most ideal outcome is that formalizing «boundaries» results in a framework that looks like deontology at face value, but is also logically consistent and not based on arbitrary rules.

Also, one way «boundaries» could be implemented concretely— though not necessarily the only way— is for tasking the first AGI with minimizing the occurrence of «boundary» violations for its citizens (a.k.a. a "night watchman", see below).

Quotes from Davidad that support this view

Davidad tweeted in 2022 Aug: 

Post-acute-risk-period, I think there ought to be a “night watchman Singleton”: an AGI which technically satisfies Bostrom’s definition of a Singleton, but which does no more and no less than ensuring a baseline level of security for its citizens (which may include humans & AIs).

(Note: All bolding in the quotes in this post is mine.)

next tweet:

If and only if a night-watchman singleton is in place, then everyone can have their own AI if they want. The night-watchman will ensure they can’t go to war. The price of this is that if the night-watchman ever suffers a robustness failure it’s game 

and later in the thread:

The utility function of a night-watchman singleton is the minimum over all citizens of the extent to which their «boundaries» are violated (with violations being negative and no violations being zero) and the extent to which they fall short of baseline access to natural resources

Davidad in AI Neorealism: a threat model & success criterion for existential safety (2022 Dec):

For me the core question of existential safety is this:

It is not, for example, "how can we build an AI that is aligned with human values, including all that is good and beautiful?" or "how can we build an AI that optimises the world for whatever the operators actually specified?" Those could be useful subproblems, but they are not the top-level problem about AI risk (and, in my opinion, given current timelines and a quasi-worst-case assumption, they are probably not on the critical path at all).

Davidad in An Open Agency Architecture for Safe Transformative AI (2022 Dec):

  • Deontic Sufficiency Hypothesis: There exists a human-understandable set of features of finite trajectories in such a world-model, taking values in , such that we can be reasonably confident that all these features being near 0 implies high probability of existential safety, and such that saturating them at 0 is feasible[2] with high probability, using scientifically-accessible technologies.

Also see this tweet from Davidad in 2023 Feb:

In the situation where new powerful AIs with alien minds may arise (if not just between humans), I believe that a “night watchman” which can credibly threaten force is necessary, although perhaps all it should do is to defend such boundaries (including those of aggressors).

Further explanation of the OAA's Deontic Sufficiency Hypothesis in Davidad's Bold Plan for Alignment: An In-Depth Explanation (2023 Apr) by Charbel-Raphaël and Gabin:

Getting traction on the deontic feasibility hypothesis

Davidad believes that using formalisms such as Markov Blankets would be crucial in encoding the desiderata that the AI should not cross boundary lines at various levels of the world-model. We only need to “imply high probability of existential safety”, so according to davidad, “we do not need to load much ethics or aesthetics in order to satisfy this claim (e.g. we probably do not get to use OAA to make sure people don't die of cancer, because cancer takes place inside the Markov Blanket, and that would conflict with boundary preservation; but it would work to make sure people don't die of violence or pandemics)”. Discussing this hypothesis more thoroughly seems important.

Also: 

  • (*) Elicitors: Language models assist humans in expressing their desires using the formal language of the world model. […] Davidad proposes to represent most of these desiderata as violations of Markov blankets. Most of those desiderata are formulated as negative constraints because we just want to avoid a catastrophe, not solve the full value problem. But some of the desiderata will represent the pivotal process that we want the model to accomplish.

(The post also explains that the "(*)" prefix means "Important", as distinct from "not essential".)

This comment by Davidad (2023 Jan):

Not listed among your potential targets is “end the acute risk period” or more specifically “defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility. 

From Reframing inner alignment by Davidad (2022 Dec):

I'm also excited about Boundaries as a tool for specifying a core safety property to model-check policies against—one which would imply (at least) nonfatality—relative to alien and shifting predictive ontologies.


other notes

  • While I'm extremely excited about «boundaries» myself, I disagree with the "night watchman" implementation and I think there might be a better way. I will address this in one of my next posts.
  • Context on this post can be found in «Boundaries» and AI safety compilation
  1. ^

    Note: "bare-bones morality" is an original term that I've coined in this post. (It is not from Davidad.)

17

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 1:40 AM

Thanks for bringing all of this together - I think this paints a fine picture of my current best hope for deontic sufficiency. If we can do better than that, great!

New to LessWrong?