Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for

Abstract: The AGI alignment problem has a bimodal distribution of outcomes with most outcomes clustering around the poles of total success and existential, catastrophic failure. Consequently, attempts to solve AGI alignment should, all else equal, prefer false negatives (ignoring research programs that would have been successful) to false positives (pursuing research programs that will unexpectedly fail). Thus, we propose adopting a policy of responding to points of metaphysical and practical uncertainty associated with the alignment problem by limiting and choosing necessary assumptions to reduce the risk false positives. Herein we explore in detail some of the relevant points of uncertainty that AGI alignment research hinges on and consider how to reduce false positives in response to them.

If you've been following along, I've been working to a particular end the past couple months, and that end is this paper. It's currently under review for journal publication, but you can read the preprint now! This marks the first in what I expect to be several papers exploring and explaining my belief that we can better figure out how to solve alignment via phenomenology and philosophical investigation because there are key questions at the heart of alignment that are poorly examined and not well grounded. This paper is intentionally conservative in its methods since it's the first (you'll notice, aside from a few citations, I stay within the analytic philosophical tradition), and I believe this is more compelling to my target audience of AI researchers, but later papers may make more direct use of phenomenological methods.

It's also the soft launch of the Phenomenological AI Safety Research Institute so that there's a place to work on these ideas. We have no money, but if you're interested in this line of research I'd be happy to talk to you about potential collaborations or research projects we need help with.

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 3:07 AM

The most obvious way to escape "false positives" is not to create AI at all, and in any other reasonable field it will be the correct solution. For example, if some organisation wants to create a reactor which has unmeasurable probability of catastrophic failure, no commission will ever allow it to be built.

But, as we all here understand, if aligned AI is not created and not used to stop AI race, another non-safe AI will eventually appear. Thus the only thing for which we really need the aligned AI is to stop other AI projects. In other words, we need aligned AI only as a weapon (or, to put in more socially acceptable terms, to make "pivotal acts").

If we need aligned AI only as a weapon to stop other AI projects, the good question is: may be we have different safer instruments to stop other AI projects. One of such hypothetical instruments is the use of Narrow AI for global domination and policing, which I explored in my previous post. There could be other possible instruments, non of them is perfect, but they may have high probability to work.