Sorry for my very late reply!
Thanks for taking the time to answer, I now don't endorse most of what I wrote anymore.
I think that if the AGI has a perfect motivation system then we win, there's no safety problem left to solve. (Well, assuming it also remains perfect over time, as it learns new things and thinks new thoughts.) (See here for the difference between motivation and reward.)
and from the post:
And if we get to a point where we can design reward signals that sculpt an AGI's motivation with surgical precision, that's fine!
This is mostly where I went wrong. I.e. I assumed a perfect reward signal coming from some external oracle in examples where your entire point was that we didn't have a perfect reward signal (e.g. wireheading).
So basically, I think we agree: a perfect reward signal may be enough in principle, but in practice it will not be perfect and may not be enough. At least not a single unified reward.
Places d'Armes ("Plëss") opposite side from the stage (the "Old City Kiosque" on Google Maps).
Turns out there's a COVID demo here, but I have nothing to do with those guys.
In case of any doubts, don't hesitate to email me!
Also, if you plan to arrive late, let me know please.
( I can't edit the original, for some reason. )
Disclaimer: I have only read the abstract of the "Reward is enough" paper. Also, I don't have much experience in AI safety, but I consider changing that.
Here are a couple of my thoughts.
Your examples haven't entirely convinced me that reward isn't enough. Take the bird. As I see it, something like the following is going on:
Evolution chose to take a shortcut: maybe a bird with a very large brain and a lot of time would eventually figure out that singing is a smart thing to do if it received reward for singing well. But evolution being a ruthless optimizer with many previous generations of experience, shaped two separate rewards in the way you described. Silver et al.'s point might be that when building an AGI, we wouldn't have to take that shortcut, at least not by handcoding it.Assume we have an agent that is released into the world and is trying to optimize reward. It starts out from scratch, knowing nothing, but with a lot of time and the computational capacity to learn a lot.Such an agent has an incentive to explore. So it tries out singing for two minutes. It notes that in the first minute it got 1 unit of reward and in the second 2 (it got better!). All in all, 3 units is very little in this world however, so maybe it moves on.But as it gains more experience in the world it notices that patterns like these can often be extrapolated. Maybe, with its two minutes of experience, if it sang for a third minute, it would get 3 units of reward? It tries and yes indeed. Now it has an incentive to see how far it can take this. It knows the 4 units it expects from the next try will not be worth its time on their own, but the information of whether it could eventually get a million units per minute this way is very much worth the cost!
Something kind of analogous should be true for the spider story.Reward very much provides an incentive for the agent to eventually figure out that after encountering a threat, it should change its behavior, not its interpretations of the world. At the beginning it might get this wrong, but it's unfair to compare it to a human who has had this info "handcoded in" by evolution.If our algorithms don't allow you to learn to update differently in the future, because past update were unhelpful (I don't know, pointers welcome!), then that's not a problem with reward, it's a problem with our algorithms!Maybe this is what you alluded to in your very last paragraph where you speculated that they might just mean a more sophisticated RL algorithm?
Concerning the deceptive AGI etc., I agree problems emerge when we don't get the reward signal exactly right and that it's probably not a safe assumption that we will. But it might still be an interesting question how things would go assuming a perfect reward signal?My impression is that their answer is "it would basically work", while yours is something like "but we really shouldn't assume that and if we don't, then it's probably better to have separate reward signals etc.". Given the bird example, I assume you also don't agree that things would work out fine even if we did have the best possible reward signal?
Also, I just want to mention that I agree the distinction between within-lifetime RL and intergenerational RL is useful, certainly in the case of biology and probably in machine learning too.
Alright, that's helpful! Thanks!
Right at the start under "How to use this book", there is this paragraph:
If you have never been to a CFAR workshop, and don’t have any near-termplans to come to one, then you may be the kind of person who would love toset their eyes on a guide to improving one’s rationality, full of straightforwardinstructions and exercises on how to think more clearly, act more effectively,learn more from your experiences, make better decisions, and do more withyour life. This book is not that guide (nor is the workshop itself, for thatmatter). It was never intended to be that guide.
I'm partway through reading the sequences, have read plenty of other LW posts, but have no possibility to attend CFAR workshops anytime soon. Still, I want to get more experience with applied rationality. Would you say reading (parts of) this handbook and trying exercises that seem relevant to my situation is a reasonable thing to try (while minding all your precautions)? Or do you know of a better way to train one's skills in applying rationality?