Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Edit February 2024: Now I think maybe we can't do better.

Lately, "alignment" means "follow admin rules and do what users mean". Admins can put in rules like "don't give instructions for bombs, hacking, or bioweapons" and "don't take sides in politics". As AI gets more powerful, we can use it to write better rules and test for loopholes. And "do what users mean" implies some low-impactfulness: it won't factory reset your computer when you ask it to fix the font size.

In other words, the plan has two steps:

  1. Make a machine that can do literally anything
  2. Configure it to do good things

I don't like this plan very much because it puts power in the hands of the few and has big security liabilities.

Temptation

Imagine you're on the team of three people that writes the constitution for the training run for the gigaAI1. Or you're the one actually typing it in. Or you do infra and write all the code that strings pass through on the way to and from the training cluster.

Might you want to take advantage of the opportunity? I can think of a few subtle things I might add.

If I think about the people in my life I know well, I can only name one that I would nominate for the constitution writing position.

One solution I've heard for this is to use the powerful DWIM machine to construct a second unalterable machine, then delete the first machine. You might call the time between creating the first and second machine the temptation mountain (in a plot where X is time and Y is the maximum power of any individual person). Then you want to minimize the volume of this mountain. It is difficult to shrink the mountain very much though.

Security risks

If you're 90% through your DWIM AI development and it seems very promising, then you must be very careful with your code and data. If someone else gets your secrets, then they can make it do what they mean, and they might mean for it to do something quite different. You should be careful with your team members too since threats and blackmail and bribes are all quite cheap and effective. (Hard to be careful with them though since they put "on constitution AI team" on their linkedins, their phone numbers are practically public information, and they live in a group house with employees from your competitor.)

If you're doing open source AI development and start having big successes then you might start looking at the github profiles of your stargazers and wonder what they mean for it to do.

If altLab is also getting close to AGI then you have to figure out whether your stop-clause means you help them or their stop-clause means they help you.

Doing better

Could we come up with a plan so rock solid that we'd be comfortable giving the code, the data, or the team to anyone? Like um how you could give Gandhi to Hitler without stressing.

For lack of a better term[1], let's call such a system inextricably kind AI. The key innovation required might be a means to strongly couple capabilities and benevolence. This is in contrast to the current regime where we build Do Anything Now then tell it to do good.

Such an innovation would unblock coordination and ease security burdens, and not require the board of directors & the engineers to be saints. It would also give the open source community something to work on that doesn't give AI safety folks gray hairs.

To be clear, I don't have a good idea for how to achieve this. I'm not certain it is even possible.

However, it's not obviously impossible to me. And I would be much more at ease if all the AI labs were building intrinsically aligned AI instead of something that can be stolen and directed to arbitrary goals.[2]

Let's at least give it a shot.


  1. please let me know if there's a better or pre-existing term ↩︎

  2. Apple has a pretty serious security team but they can't seem to help getting JS and kernel exploits on 3 billion phones and computers every few weeks ↩︎

New to LessWrong?

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 8:29 AM

I agree that it would be nice to get to a place where it is known (technically) how to make a kind AGI, but nobody knows how to make an unkind AGI. That's what you're saying, right? If so, yes that would be nice, but I see it as extraordinarily unlikely. I’m optimistic that there are technical ways to make a kind AGI, but I’m unaware of any remotely plausible approach to doing so that would not be straightforwardly modifiable to turn it into a way to make an unkind AGI.

It is just as ambitious/implausible as you say. I am hoping to get out some rough ideas in my next post anyways.

I don't see anyway that it is possible to entirely avoid the 'mountain of temptation' to get to an inherently kind AI that can't be altered even by its creators. Maybe there is such a path, if you set up siloed teams of developers each working on separate parts of the system, and set up lots of oversight. I can't imagine this being anywhere near competitive with a strategy that doesn't do this.

So, I think that this is a lovely dream which is simply infeasible. I made a prediction market which tries to point out this infeasibility: 

I like the way you've operationalized the question

In other words, the plan has two steps:

  1. Make a machine that can do literally anything
  2. Configure it to do good things

I don't think this is the plan? The hope is that, as capabilities grow, so does alignment, whatever this "alignment" thing is. The reality is different, of course.

Edited post to rename "intrinsically aligned AI" to "intrinsically kind AI" for clarity. As I understand it, the hope is to develop capability techniques and control techniques in parallel. But there's no major plan I know of to have a process for developing capabilities that are hard-linked to control/kindness/whatever in a way you can't easily remove. (I have heard an idea or two though and am planning on writing a post about it soon.)

In some sense it already happens. When we train AI on more and more human-generated texts, it, in some sense, gets more capabilities and more alignment. 

Yes it does become easier to control and communicate with, but it does not become harder to make it be malicious. I'm not sure that an AI scheme that can't be trivially turned evil rerverso is possible, but I would like to try to find one.