This is a special post for short-form writing by Ricardo Meneghin. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.

New to LessWrong?

3 comments, sorted by Click to highlight new comments since: Today at 10:46 PM

Has there been any discussion around aligning a powerful AI by minimizing the amount of disruption it causes to the world?

A common example of alignment failure is that of a coffee-serving robot killing its owner because that's the best way to ensure that the coffee will be served. Sure, it is, but it's also a course of action majorly more transformative to the world than just serving coffe. A common response is "just add safeguards so it doesn't kill humans", which is followed by "sure, but you can't add safeguards for every possible failure mode". But can't you?

Couldn't you just add a term to the agent's utility function penalizing the difference between the current world and it's prediction of the future world, disincentivizing any action that makes a lot of changes (like taking over the world)?