Ricardo Meneghin's Shortform

Ricardo Meneghin

Ricardo Meneghin's Shortform

1 min read14th Aug 20203 comments

This is a special post for quick takes by Ricardo Meneghin. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

3 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:22 AM

[-]Ricardo Meneghin4y20

Has there been any discussion around aligning a powerful AI by minimizing the amount of disruption it causes to the world?

A common example of alignment failure is that of a coffee-serving robot killing its owner because that's the best way to ensure that the coffee will be served. Sure, it is, but it's also a course of action majorly more transformative to the world than just serving coffe. A common response is "just add safeguards so it doesn't kill humans", which is followed by "sure, but you can't add safeguards for every possible failure mode". But can't you?

Couldn't you just add a term to the agent's utility function penalizing the difference between the current world and it's prediction of the future world, disincentivizing any action that makes a lot of changes (like taking over the world)?

Reply

[-]TurnTrout4y30

Impact measures.

Reply

[-]Ricardo Meneghin4y10

Thanks!

Reply

Moderation Log