LESSWRONG
LW

595
Tom Price
0020
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
Models Don't "Get Reward"
Tom Price3mo10

Could you give some examples?

Reply
Reward is not the optimization target
Tom Price3mo*10

reward chisels cognitive grooves into an agent

 

This makes sense, but if the agent is smart enough to know how it *could* wirehead, perhaps wireheading would eventually result from the chiseling of some highly abstract grooves.

To give an example, suppose you go to Domino's pizza on Saturday at 6pm and eat some Hawaiian pizza. You enjoy the pizza. This reinforces the behaviour of "Go to Domino's pizza on Saturday at 6pm and eat some Hawaiian pizza".

Surely this will also reinforce other more generic behaviours, that include this behaviour as a special case, such as:

"Go to a pizza place in the evening and eat pizza."

"Go to a restaurant and eat yummy food."

Well then, why not "do a thing that I know will make me feel good": that includes the original behaviour as a special case. It also includes wireheading.

(this is a different explanation of a similar point made in this comment from hillz: https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target?commentId=oZ6aX3bzNF5bwvL4S but it seemed different enough to be worth a separate comment)

Reply