I'm thinking a bit about AI safety lately as I'm considering writing about it for one of my college classes.

I'm hardly heavy into AI safety research and so expect flaws and mistaken assumptions in these ideas. I would be grateful for corrections.

  1. An AI told to make people smile tiles the world with smiley faces but an AI told to do what humans would want it to do might still get it wrong eg. Failed Utopia #4-2 . However, wouldn't it research further and correct itself (and before that, have care to not do something un-correctable)? Reasoning as follows: let's say a non-failed utopia is worth 100/100 utility points per year. A failed utopia is that which at first seems to the AI to be a true utopia (100 points) but is actually less (idk, 90 points). Even were the cost of research heavy, if the AI wants/expects billions or trillions of years of human existence, it should be doing a lot of research early on, and would be very careful to not be fooled. Therefore, we don't need to do anything more than tell an AI to do what humans would want it to do, and let it do the work on figuring out exactly what that is itself.

  2. Partially un-consequentialist AI/careful AI: Weights harms caused by its decisions (somewhat but not absolutely) heavier than other harms. (Therefore: tendency towards protecting what humanity already has against large harms and causes smaller surer gains like curing cancer rather than instituting a new world order (haha).)

Thanks in advance. :)

1: Imagine a utility function as a function that takes as input a description of the world in some standard format, and outputs a "goodness rating" between 0 and 100. The AI can then take actions that it predicts will make the world have a higher goodness rating.

Lots of utility functions are possible. Suppose there's one possible future where I get cake, and one where I get pie. I have a very strong opinion on these futures' goodness, and I will take actions that I predict will make the world more likely to turn out pie. But this is not a priori ... (read more)

1Vaniver4yCheck out the Cake or Death [http://lesswrong.com/lw/f3v/cake_or_death/] value loading problem, as Stuart Armstrong puts it. There's a rough similarity to the 'resist blackmail' problem, which is that you need to be able to tell the difference between someone delivering bad news and doing bad things. If the AI is mistaken about what is right, we want to be able to correct it without being interpreted as villains out to destroy potential utility. (Also, "correctable" is not really a low-level separation in reality, since the passage of time means nothing is truly correctable.)

Open Thread, Feb 8 - Feb 15, 2016

by Elo 1 min read8th Feb 2016224 comments


If it's worth saying, but not worth its own post (even in Discussion), then it goes here.

Notes for future OT posters:

1. Please add the 'open_thread' tag.

2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)

3. Open Threads should be posted in Discussion, and not Main.

4. Open Threads should start on Monday, and end on Sunday.