Oh you're right! Thanks for catching that. I think I was lead astray because I wanted there to be a big payoff for averting the bad event, but I guess the benefit is just not having to pay D.I'll have a look and see how much this changes things
Edit: Fixed it up now, none of the conclusions seem to change (which is good because they seemed like common sense!). Thanks for reading this and pointing that out!
Thanks! Yeah, I definitely think that "it's okay to slack today if I pull up the average later on" is a pretty common way people lose productivity. I think one framing could be that if you do have an off day, that doesn't have to put you off track forever, and you can make up for it in the future.
I make the graphs using the [matplotlib xkcd mode](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xkcd.html), it's super easy you use, you just put your plotting in a "with plt.xkcd():" block
My read of Russel's position is that if we can successfully make the agent uncertain about its model for human preferences then it will defer to the human when it might do something bad, which hopefully solves (or helps with) making it corrigible.
I do agree that this doesn't seem to help with inner-alignment stuff though, but I'm still trying to wrap my head around this area.