I'm interested in working on alignment, coming from a programmer/Haskell background. I have two ideas that are unlikely to info hazards, that I want to post here in order to get feedback/info about prior work.

Both are very rough/early stage, view this as me presenting a pre-MVP to get some very early feedback.

Idea 1 - "Love is that which enables choice" (Inspired by Forrest Landry)

This is an idea for a potential goal/instruction for AI (can't recall the fancy term). The idea is to make an AI that optimizes for optionality: maximizing the total sum of agency for all human and non-human agents. Agency is here loosely defined as "Optional ability to make changes to the world".

Making it the sum total would discourage situations where the ability of one person to affect change would hamper the ability of someone else.

Idea 2 - Segmented gradient descent training optimized for collaboration between different agents

This is an idea for a potential training method, that I think may have a big attractor basin for collaborative traits. The idea is to have some kind of gradient descent-esque training where AI agents of varying calibres/types are put in training scenarios in which a premium is put on collaboration. This is run in multiple iterations, where AI that successfully collaborates with other agents get to continue.

The hardest thing about this is that we want an AI that is cooperative, but we do not want an AI that is naive, as this would lead to situations where terrorists convince the AI to do stupid shit. We could try to model this on human (cultural/biological) evolution.

One thing I like about this idea is that it might lead to AI that develops behavioural patterns akin to those found in herd animals (including humans). This would make the AI easier to reason about, and more likely to develop something akin to ethical behaviour.


New Answer
Ask Related Question
New Comment

1 Answers sorted by

I like the first idea better than the second

New to LessWrong?