Thanks heaps for the post man, I really enjoyed it! While I was reading it felt like you were taking a bunch of half-baked vague ideas out of my own head, cleaning them up, and giving some much clearer more-developed versions of those ideas back to me :)
Okay, I agree. Thanks :)
Thanks for response!
Input/output: I agree that the unnatural input/output channel is just as much a problem for the 'intended' model as for the models harbouring consequentialists, but I understood your original argument as relying on there being a strong asymmetry where the models containing consequentialists aren't substantially penalised by the unnaturalness of their input/output channels. An asymmetry like this seems necessary because specifying the input channel accounts for pretty much all of the complexity in the intended model.
Computational constraints: I'm not convinced that the necessary calculations the consequentialists would have to make aren't very expensive (from the their point of view). They don't merely need to predict the continuation of our bit sequence - they have to run simulations of all kinds of possible universes to work out which ones they care about and where in the multiverse Solomonoff inductors are being used to make momentous decisions, and then they perhaps need to simulate their own universe to work out which plausible input/output channels they want to target-- if they do this then all they get in return is a pretty measly influence over our beliefs, (since they're competing with many other daemons in approximately equally similar universes who have opposing values). I think there's a good chance these consequentialists might instead elect devote their computational resources to realising other things they desire (like simulating happy copies of themselves or something).
Thanks for your comment, I think I'm a little confused about what it would mean to actually satisfy this assumption.
It seems to me that many current algorithms, for example, a rainbowDQN agent, would satisfy assumption 3? But like I said I'm super confused about anything resembling questions about self-awareness/naturalisation.
Sorry for the late response! I didn't realise I had comments :)
In this proposal we go with (2): The AI does whatever it thinks the handlers will reward it for.
I agree this isn't as good as giving the agents an actually safe reward function, but if our assumptions are satisfied then this approval-maximising behaviour might still result in the human designers getting what they actually want.
What I think you're saying (please correct me if I misunderstood) is that an agent aiming to do whatever its designers reward it for will be incentivised to do undesirable things to us (like wiring up our brains to machines which make us want to press the reward button all the time).
It's true that the agents will try to take these kind nefarious actions if they think they can get away with it. But in this setup the agent knows that it can't get away with tricking the humans like this, since it's ancestors already warned the humans that a future agent might try this, and the humans prepared appropriately.
My entry: https://www.lesserwrong.com/posts/DTv3jpro99KwdkHRE/ai-alignment-prize-super-boxing
So much of your writing sounds like an eloquent clarification of my own underdeveloped thoughts. I'd bet good money your lesswrong contributions have delivered me far more help than harm :) Thanks <3
Heartbreaking :'( still, that "taken time off from their cryptographic shenanigans" line made me laugh so hard I woke my girlfriend up