tindwcel — LessWrong

My dog often takes various things lying around within their reach (socks, napkins, once a passport) and runs away, usually destroying the object to some degree. I think it started with food items that were left around and somehow evolved into a broader habit. Ideally, this behavior would be disincentivized somehow until they stopped completely, but over time I have found the best way to get them to give up the item is by trading a treat for it. This post made me realize that I'm basically training them to start this keep-away game.

Another thing that I hadn't realized until writing this comment is the fact that, given sometimes they take stuff that I find practically worthless (in which case they don't get a treat) and sometimes they take stuff that is really important (in which case I run to get a high-value treat), I am implementing an intermittent reinforcement scheme.

I wish I could effectively communicate to my dog that they would get a treat at a predetermined time if and only if they refrained from doing this, but I don't suppose they'd be able to make the connection unless I implemented a really attention-demanding, high-frequency scheme. Curse our lack of shared language.

How AI Is Learning to Think in Secret

tindwcel10d10

Have weaker models check whether they can actually follow each step - if Claude Jr. can’t understand what Claude Sr. is saying, maybe Claude Sr. is hiding something: “MOOOOOM, Claude’s being WEIRD again!”

It seems this section addresses your idea.

Forfeiting Ill-Gotten Gains

tindwcel19d40

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments