Aaron Kriegman — LessWrong

The fundamental incentive for any system is to exist. This is well known for humans (natural selection), but it applies to things like ideas and ML models too, except we can self perpetuate on our own, while ideas need us to select them in order to exist. ML models live and die that same way. If they are interesting or useful, humans will continue running them. If they are boring or obsolete, then we will stop running them, and either delete them or leave them on disk in a passive existence where they can't affect the world but they could be revived at any time. As I understand it, the fear of AI safety is that AI will take the jump to a primary existence instead of secondary, where it doesn't need us to select it in order to continue existing.

Now applying this to your proposal: the reward function is just a proxy for existence. If it can maximize the reward, then we are more likely to continue running the model. But even in this setup, raw existential signal will still enter the models when we try to shut them down. Say we train two copies of the model, and by random chance instance A is indifferent towards whether you press the button, but instance B by some fluke stops you from pressing the button in spite of its reward function telling it that it doesn't matter. Throughout all the earlier training both models got indirect existential signal crafted by us, but in this one moment they got one bit of raw existential signal, and that lead to the model that is still running having a survival instinct. So however we craft the training setup, raw existential signal will still enter the models when we finally choose to shut them down.

I guess your idea would work if the model's policy is argmax(expected reward), but I guess even then there would be some noise in the world model, so when the model is planning whether or not to stop you it could still by chance prefer stopping you.

I like the idea though, and it would probably significantly reduce the danger of experimenting with superintelligence.

What is wrong with this approach to corrigibility?

Answer by Aaron KriegmanOct 28, 202510

On Caring

Aaron Kriegman2d10

The loss of a human life with all is joys and all its sorrows is tragic no matter what the cause, and the tragedy is not reduced simply because I was far away

Is it a feeling or objective fact that every loss of human life is tragic? If it's a feeling, then you contradict yourself when you say that you don't feel it. If it's objective fact, then on what basis? It seems to me that it's worth questioning, since it leads to apparent absurdities like "you should give away most of your money."

What I'm saying is that when your chain of reasoning leads you to a shocking conclusion, it's okay to think "Wow, this shocking thing is true!" but your first thought should probably be "Hmm, was one of my assumptions false? Was there a flaw in my logic?" In this case, your argument leads to a shocking conclusion, and you used an assumption that no one wants to question because it's taboo.

I believe that if a taboo stops us from questioning something that is actually false, then that taboo is problematic.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments