New Comment
9 comments, sorted by Click to highlight new comments since: Today at 3:24 AM

Tune AGI intelligence by easy goals

If an AGI is provided an easily solvable utility function ("fetch a coffee"), it will lack the incentive to self-improve indefinitely. The fetch-a-coffee-AGI will only need to become as smart as a hypothetical simple-minded waiter. By creating a certain easiness for a utility function, we can therefore tune the intelligence level we want an AGI to achieve using self-improvement. The only way to achieve an indefinite intelligence explosion (until e.g. material boundaries) would be to program a utility function maximizing something. Therefore this type of utility function will be most dangerous.

Could we create AI safety by prohibiting maximizing-type utility functions? Could we safely experiment with AGIs just a little smarter than us, by using moderately hard goals?

The hard part is that the real world is complicated and setting goals that truly have no incentive for self-improvement or gaining power is an unsolved problem.

Relevant Rob Miles video.

One could use artificial environments that are less complicated, and of course we do, but it seems like this leaves some important problems unsolved.

Thanks for your insights. I don't really understand 'setting [easy] goals is an unsolved problem'. If you set a goal: "tell me what 1+1 is", isn't that possible? And once completed ("2!"), the AI would stop to self-improve, right?

I think this may contribute to just a tiny piece of the puzzle, however, because there will always be someone setting a complex or, worse, non-achievable goal ("make the world a happy place!"), and boom there you have your existential risk again. But in a hypothetical situation where you have your AGI in the lab, no-one else has, and you want to play around safely, I guess easy goals might help?

Curious about your thoughts, and also, I can't imagine this is an original idea. Any literature already on the topic?

Suppose I get hit by a meteor before I can hear your "2" - will you then have failed to tell me what 1+1 is? If so, suddenly this simple goal implies being able to save the audience from meteors. Or suppose your screen has a difficult-to-detect short circuit - your expected utility would be higher if you could check your screen and repair it if necessary.

Because a utility maximizer treats a 0.09% improvement over a 99.9% baseline just as seriously as it treats a 90% improvement over a 0% baseline, it doesn't see these small improvements as trivial, or in any way not worth its best effort. If your goal actually has some chance of failure, and there are capabilities that might help mitigate that failure, it will incentivize capability gain. And because the real world is complicated, this seems like it's true for basically all goals that care about the state of the world.

If we have a reinforcement learner rather than a utility maximizer with a pre-specified model of the world, this story is a bit different, because of course there will be no meteors in the training data. Now, you might think that this means that the RL agent cannot care about meteors, but this is actually somewhat undefined behavior, because the AI still gets to see observations of the world. If it is vanilla RL with no "curiosity," it won't ever start to care about the world until the world actually affects its reward (which for meteors, will take much too long to matter, but does become important when the reward is more informative about the real world), but if it's more along the lines of DeepMind's game-playing agents, then it will try to find out about the world, which will increase its rate of approaching optimal play.

There are definitely ideas in the literature that relate to this problem, particularly trying to formalize the notion that the AI shouldn't "try too hard" on easy goals. I think these attempts mostly fall under two umbrellas - other-izers (that is, not maximizers) and impact regularization (penalizing the building of meteor-defense lasers).

Thanks again for your reply. I see your point that the world is complicated and a utility maximizer would be dangerous, even if the maximization is supposedly trivial. However, I don't see how an achievable goal has the same problem. If my AI finds the answer of 2 before a meteor hits it, I would say it has solidly landed at 100% and stops doing anything. Your argument would be true if it decides to rule out all possible risks first, before actually starting to look for the answer of the question, which would otherwise quickly be found. But since ruling out those risks would be much harder to achieve than finding the answer, I can't see my little agent doing that.

I think my easy goals come closest to what you call other-izers. Any more pointers for me to find that literature?

Thanks for your help, it helps me to calibrate my thoughts for sure!

I think actually 1+1 = ? is not really an easy enough goal, since it's not 100% sure that the answer is 2. Getting to 100% certainty (including what I actually meant with that question) could still be nontrivial. But let's say the goal is 'delete filename.txt'? Could be the trick is in the language..

Minimum hardware leads to maximum security. As a lab or a regulatory body, one can increase safety of AI prototypes by reducing the hardware or amount of data researchers have access to.

AGI is unnecessary for an intelligence explosion

Many arguments state that it would require an AGI to have an intelligence explosion. However, it seems to me that the critical point for achieving this explosion is that an AI can self-improve. Which skills are needed for that? If we have hardware overhang, it probably comes down to the type of skills an AI researcher uses: reading papers, combining insights, doing computer experiments until new insights emerge, writing papers about them. Perhaps an AI PhD can weigh in on the actual skills needed. I'm however making the argument that far from all mental skills humans have are needed for AI research. Appreciating art? Not needed. Intelligent conversation about non-AI topics? Not needed. Motor skills? Not needed.

I think the skills needed most for AI research (and therefore self-improvement) are the skills at which a computer may be relatively strong: methodical thinking, language processing, coding. Therefore I would expect that we reach an intelligence explosion significantly earlier than developing actual AGI with all human skills. This should be important for the timeline discussion.

Technically, tiling the entire universe with paperclips or tiny smiling faces would probably count as modern art...