Thanks; very interesting result.
The aim of this post is not to catch out GPT-3; it's to see what concept extrapolation could look like for a language model.
Possibly! Though it did seem to recognise that the words were spelt backwards. It must have some backwards spelt words in its training data, just not that many.
Thanks for that link. It does seem to correspond intuitively to a lot of the human condition. Though it doesn't really explain value extrapolation, more the starting point from which humans can extrapolate values. Still a fascinating read, thanks!
There is certainly a decent chance.
There is certainly a chance.
Indeed, thanks! Corrected.
To see this, imagine the AUP agent builds a subagent to make for all future , in order to neutralize the penalty term. This means it can't make the penalty vanish without destroying its ability to better optimize its primary reward, as the (potentially catastrophically) powerful subagent makes sure the penalty term stays neutralized.
I believe this is incorrect. The and are the actions of the AUP agent. The subagent just needs to cripple the AUP agent so that all actions are equivalent, then go about maximising to the upmost.
Hey there! Sorry for the delay. $100 for the best reference. PM for more details.
Thanks so much! Typos have been corrected.