Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This question is about whether you have clever ideas about how to use AI imitations of humans for AI safety. The two main ideas I'm familiar with only seem to interface with these imitations as if they're humans.

  • The most obvious thing one might do with a good predictor of a human is just to write software that queries the imitation human about what the right thing to do is, and then does it.
  • The less obvious thing to do is to try and amplify it - e.g. use teams of them working together to try to choose good actions. Or maybe even an IDA loop - use your learner that learned to imitate a human, and train it to imitate the teams working together. Then make teams of teams, etc.

But can we use human imitations to increase the effectiveness of value learning in a way other than amplification/distillation? For example, is there some way of leveraging queries to human imitations to train a non-human AI that has a human-understandable way of thinking about the world?

Keep in mind the challenge that these are only imitation humans, not oracles for the best thing to do, and not even actual humans. So we can't give them problems that are too weird, or heavily optimized by interaction with the imitation humans, because they'll go off-distribution.

Another possible avenue is ways to "look inside" the imitation humans. One analogy would be how if you have an image-generating GAN, you can increase the number of trees in your image by finding the parameters associated with trees and then turning them up. Can you do the same thing with human-imitating GAN, but turning up "act morally" or "be smart?"

New to LessWrong?

New Answer
New Comment

1 Answers sorted by

Gurkenglas

Sep 28, 2020

Ω250

It sounds like you want to use it as a component for alignment of a larger AI, which would somehow turn its natural-language directives into action. I say use it as the capability core: Ask it to do armchair alignment research. If we give it subjective time, a command line interface and internet access, I see no reason it would do worse than the rest of us.

In retrospect, I was totally unclear that I wan't necessarily talking about something that has a complicated internal state, such that it can behave like one human over long time scales. I was thinking more about the "minimum human-imitating unit" necessary to get things like IDA off the ground.

In fact this post was originally titled "What to do with a GAN of a human?"

2Gurkenglas4y
I don't think you need a complicated internal state to do research. You just need to have read enough research and math to have a good intuition for what definitions, theorems and lemmas will be useful. When I try to come up with insights, my short-term memory context would easily fit into GPT-3's window.
2Charlie Steiner4y
I feel like my state is significantly more complicated than that. I smoothly accumulate short-term memory and package some of it away into long-term memory, which even more slowly gets packaged away into longer-term memory. GPT-3's window size would run out the first time I tried to do a literature search and read a few papers, because it doesn't form memories so easily. The way actual GPT-3 (or really anything with limited state but lots of training data, I think) gets around this sort of thing is by already having read those papers during training, plus lots of examples of people reacting to papers, and then using context to infer that it should output words that come from someone at a later stage of paper-reading. Do you foresee a different, more human-like model of humans becoming practical to train?
2Gurkenglas4y
Misunderstanding: You are talking about literature research, which I do see as part of training. I am talking about original research, which at its best consists of prompts like "This oneliner construction from these four concepts can be elegantly modeled using the concept of ". The results would of course be integrated into long-term memory using fine-tuning.
1 comment, sorted by Click to highlight new comments since: Today at 8:31 AM

You could try to infer human values from the "sideload" using my "Conjecture 5" about the AIT definition of goal-directed intelligence. However, since it's not an upload and, like you said, it can go off-distribution, that doesn't seem very safe. More generally, alignment protocols should never be open-loop.

I'm also skeptical about IDA, for reasons not specific to your question (in particular, this), but making it open-loop is worse.

Gurkenglas' answer seems to me like something that can work, if we can somehow be sure the sideload doesn't become superintelligent, for example, given an imitation plateau.