704

LESSWRONG
LW

703

Ricardo Meneghin's Shortform

by Ricardo Meneghin
14th Aug 2020
1 min read
3

2

This is a special post for quick takes by Ricardo Meneghin. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Ricardo Meneghin's Shortform
2Ricardo Meneghin
3TurnTrout
1Ricardo Meneghin
3 comments, sorted by
top scoring
Click to highlight new comments since: Today at 1:15 AM
[-]Ricardo Meneghin5y20

Has there been any discussion around aligning a powerful AI by minimizing the amount of disruption it causes to the world?

A common example of alignment failure is that of a coffee-serving robot killing its owner because that's the best way to ensure that the coffee will be served. Sure, it is, but it's also a course of action majorly more transformative to the world than just serving coffe. A common response is "just add safeguards so it doesn't kill humans", which is followed by "sure, but you can't add safeguards for every possible failure mode". But can't you?

Couldn't you just add a term to the agent's utility function penalizing the difference between the current world and it's prediction of the future world, disincentivizing any action that makes a lot of changes (like taking over the world)?

Reply
[-]TurnTrout5y30

Impact measures.

Reply
[-]Ricardo Meneghin5y10

Thanks!

Reply
Moderation Log
More from Ricardo Meneghin
View more
Curated and popular this week
3Comments