Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

for now, let's put aside the fact that we can't decide whether we're trying to achieve sponge coordination or FAS, and merely consider what it takes to build an aligned AI — regardless of whether it has the capability of saving the world as a singleton, or merely to be a useful but safe tool.

the question this post is about is: what requirements do we want such a solution to satisfy?

let's say three groups have each built an AI which they think is aligned, and before they press the start button on it, they're trying to convince the other two that their design is safe and leads to good worlds. however, their designs are actually very different from one another.

maybe one is an advanced but still overall conventional text-predicting simulator, another is a clever agentic neural net with reinforcement learning and access to a database and calculator, and the third is a novel kind of AI whose core doesn't really relate to current machine learning technology.

so, they start talking about why they think they AI is aligned. however, they run into an issue: they don't even agree on what it takes to be sure an AI is safe, let alone aligned!

  • maybe one of them has a proof that their AI is resistant to a reasonable class of acausal attacks, another has reasons to think their approach probly avoids the issue altogether somehow, and the third has a model of the world that fails to understand acausal attacks and rejects their possibility altogether.
  • maybe one of them has developed a world-modeling system that is general enough to support embedded agency, another has patched theirs to support it as a special case, and the third think their AI will simply modify itself that way because it's instrumenally convergent.
  • maybe one of them has gone out and built a decision system which implements FDT, another counts on CDT turning itself into FDT, and the third has no idea how to determine how their system fits into decision theories.
  • maybe one of them has built something that is confidently eventually aligned and hopefully enough continuously aligned, another has built something that is acting safely now and has a bunch of ad-hoc corrigibility devices which hopefully prevent it from taking a sharp left turn, and the third expects their AI to robustly keep being safe in the long run for reasons that seem hard to understand.
  • maybe one of them has built their AI to respect the values of its creator, another has made the AI care about the part of its model that they believe to be pointing to an abstraction of human values, and the third has an AI that simply takes orders from what it interacts with and can hopefully be ordered to self-modify in a way that makes it resistant to alien superintelligences by the time it meets them.

and those are optimistic cases! many alignment approaches would simply:

  • fail to consider that their design might fail in ways they haven't thought of
  • not think to ask the alignment community at large whether their design is safe
  • ask the community, but then only select criticisms they take into account based on their ability to understand those criticisms, rather than based on their importance
  • assume away the possible failure modes that would be brought up
  • accidentally kill everyone way before any of this happens

i've noticed this pattern of confusion in myself after trying to explain alignment ideas i've found promising to some people, and the nature of their criticism — "wait, where's the part that makes this lead to good worlds? why do you think it would work?" — seems to be of a similar nature to my criticism of people who think "alignment is easy, just do X": the proposal is failing to answer some fundamental concerns that the person proposing has a hard time even conceiving of.

and so, i've come to wonder: given that those people seem to be missing requirements for an alignment proposal, requirements which seem fundamental to me but unknown unknown to them, what requirements are unknown unknown to me? what could i be missing? how do i know which actual requirements i'm failing to satisfy because i haven't even considered them? how do we collectively know which actual requirements we're all collectively missing? what set of requirements is necessary for an alignment proposal to satisfy, and what set is sufficient?

it feels like there ought to be a general principle that covers all of this. the same way that the logical induction paper demonstrates that the computability desideratum and the "no dutchbook" desideratum, together suffice to satisfy ten other desiderata about logical inductors; it seems like a simple set of desiderata ought to capture the true name of what it means for an AI to lead to good worlds. but this isn't guaranteed, and i don't know that we'll find such a thing in time, or that we'll have any idea how to build something that satisfies those requirements.

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 12:32 PM

let's say three groups have each built an AI which they think is aligned, and before they press the start button on it, they're trying to convince the other two that their design is safe and leads to good worlds.

So this is something I've though and written about - and I think it actually has a fairly obvious answer: the only reliable way to convince others that a system works is to show extensive actual evidence of the . . . system working. Simple, right?

Nobody believed the Wright brothers when they first claimed they had solved powered flight (if I recall correctly they had an initial failed press conference where no reporters showed up) - it required successful test demonstrations.

How do you test alignment? How do you demonstrate successful alignment? In virtual simulation sandboxes.

PS: Is there a reason you don't use regular syntax (ie capitalization of first letters of sentences)? (just curious)

In virtual simulation sandboxes

forgive me for not reading that whole post right now, but i suspect that an AI may:

  • act as we'd want but then sharp left turn once it reaches capabilities that the simulation doesn't have enough compute for but reality does
  • need more compute than it has access to in the simulation before it starts modifying the world significantly, so we can only observe it being conservative and waiting to get more compute
  • act nice because it detects it's in a simulation, eg by noticing that the world it's in has nowhere near the computational capacity for agents that would design such an AI
  • commit to acting nice for a long time and then turn evil way later regardless of whether it's a computation or not, just in case it's in a simulation

i believe that "this computation is inside another computation!" is very plausibly an easily guessable idea for an advanced AI. if you address these concerns in your post, let me know and i'll read it in full (or read the relevant sections) and comment on there.

as for my writing style, see here.

I guess I need to write a better concise summary.

i believe that "this computation is inside another computation!" is very plausibly an easily guessable idea for an advanced AI.

As stated very early in the article, the AI in simboxes do not have the words for even precursor concepts of 'computation' and 'simulation', as their knowledge is intentionally constrained and shaped to early or pre civ level equivalents. The foundational assumption is brain-like AGI in historical sims with carefully crafted/controlled ontologies.

What you are looking for sounds very much like Vanessa Kosoy's agenda (formal guarantees, regret bounds). Best post I know explaining her agenda. If you liked logical induction, definitely look into Infrabayseanism! It's very dense, so I would reccomend to start with a short introduction, or just look for good stuff under the infrabayseanism tag. The current state of affairs is that we don't have these guarantees yet, or at least only with unsatisfactory assumptions.

i am somewhat familiar with vanessa's work, and it contributed to inspiring this post of mine. i'd like to understand infrabayesianism better, and maybe that intro will help with that, thanks!

What you are looking for sounds very much like Vanessa Kosoy's agenda

As it so happens, the author of the post also wrote this overview post on Vanessa Kosoy's PreDCA protocol.

Oops! Well, I did not carefully read the whole post to the end and that's what you get! Ok second try after reading the post carefully:

it seems like a simple set of desiderata ought to capture the true name of what it means for an AI to lead to good worlds.

I think I have been thinking something similar, and my best description of this desideratum is pragmatism. Something like "use a prior that works" in the worlds where we haven't already lost. It's easy to make toy models where alignment will be impossible. -> regret bounds for some prior where I don't know what it looks like yet.

Lab #3 either has some wild alien intelligence on hand which is so far advanced that it blows everything out of the water, or they are just faking it. XD

i didn't mean to imply that they were each 1/2/3; i was intending each statement to apply to all three labs but in unspecified order (so lab 1 might be the second in one statement and the third in another statement)

I know, I was just teasing.