•

Applied to Shard Theory - is it true for humans? by Rishika ago

•

Applied to Claude wants to be conscious by Joe Kwon ago

•

Applied to The Pointer Resolution Problem by Jozdien ago

•

Applied to Requirements for a Basin of Attraction to Alignment by RogerDearnaley ago

•

Applied to Value learning in the absence of ground truth by Joel_Saarinen ago

•

Many ways have been proposed to incorporate human values in an AGI (e.g.: Coherent Extrapolated Volition, Coherent Aggregated Volition and Coherent Blended Volition, mostly proposed around 2004-2010). Value learning was suggested in 2011 by Daniel ~~Dewey’s~~Dewey in ‘Learning What to Value’. Like most authors, he assumes that an artificial agent needs to be intentionally aligned to human goals. First, Dewey argues against the use of a simple use of reinforcement learning to solve this problem, on the basis that this lead to the maximization of specific rewards that can diverge from value maximization. For example, this could suffer from goal misspecification or reward hacking. He proposes a utility function maximizer comparable to AIXI, which considers all possible utility functions weighted by their Bayesian probabilities: "[W]e propose uncertainty over utility functions. Instead of providing an agent one utility function up front, we provide an agent with a pool of possible utility functions and a probability distribution P such that each utility function can be assigned probability P(Ujyxm) given a particular interaction history [yxm]. An agent can then calculate an expected value over possible utility functions given a particular interaction history"

Many ways have been proposed to incorporate human values in an AGI (e.g.: Coherent Extrapolated Volition, Coherent Aggregated Volition and Coherent Blended Volition, mostly proposed around 2004-2010). Value ~~Learning~~learning was suggested in 2011 by Daniel Dewey’s in ‘Learning What to Value’. Like most authors, he assumes that an artificial agent needs to be intentionally aligned to human goals. First, Dewey argues against the use of a simple use of reinforcement learning to solve this problem, on the basis that this lead to the maximization of specific rewards that can diverge from value maximization. For example, this could suffer from goal misspecification or reward hacking. He proposes a utility function maximizer comparable to AIXI, which considers all possible utility functions weighted by their Bayesian probabilities: "[W]e propose uncertainty over utility functions. Instead of providing an agent one utility function up front, we provide an agent with a pool of possible utility functions and a probability distribution P such that each utility function can be assigned probability P(Ujyxm) given a particular interaction history [yxm]. An agent can then calculate an expected value over possible utility functions given a particular interaction history"

Many ways have been proposed to incorporate human values in an AGI (e.g.: Coherent Extrapolated Volition, Coherent Aggregated Volition and Coherent Blended Volition, mostly proposed around 2004-~~2010.)~~2010). Value Learning was suggested in ~~2011in~~2011 by Daniel Dewey’s ~~paper~~in ‘Learning What to Value’. Like most authors, he assumes that an artificial agent needs to be intentionally aligned to human goals. First, Dewey argues against the use of a simple use of reinforcement learning to solve this problem, on the basis that this lead to the maximization of specific rewards that can diverge from value maximization. For example, this could suffer from goal misspecification or reward hacking. He proposes a utility function maximizer comparable to AIXI, which considers all possible utility functions weighted by their Bayesian probabilities: "[W]e propose uncertainty over utility functions. Instead of providing an agent one utility function up front, we provide an agent with a pool of possible utility functions and a probability distribution P such that each utility function can be assigned probability P(Ujyxm) given a particular interaction history [yxm]. An agent can then calculate an expected value over possible utility functions given a particular interaction history"

Nick Bostrom also discusses value learning at length in his book Superintelligence. Value learning is closely related to various proposals for AI-assisted Alignment and AI-assisted/AI automated Alignment research. Since human values are complex and fragile, learning human values well is a challenging problem, much like AI-assisted Alignment (but in a less supervised setting, so actually harder). So this is only a practicable alignment technique for AGI capable of successfully performing a STEM research program ~~in Anthropology.~~(in Anthropology). Thus value learning is (unusually) an alignment technique that improves as capabilities increase, and it requires around an AGI minimum threshold of capabilities to begin to be effective.

~~There are many~~Many ways have been proposed~~ ways~~ to incorporate human values in an AGI (e.g.: Coherent Extrapolated Volition, Coherent Aggregated Volition and Coherent Blended Volition, mostly ~~dating from~~proposed around 2004-2010.) ~~This method~~Value Learning was suggested in 2011in Daniel Dewey’s paper ‘Learning What to Value’. Like most authors, he assumes that an artificial agent needs to be intentionally aligned to human goals. First, Dewey argues against the use of a simple use of reinforcement learning to solve this problem, on the basis that this lead to the maximization of specific rewards that can diverge from value maximization. For example, this could suffer from goal misspecification or reward hacking. ~~Dewey~~He proposes a utility function maximizer comparable to AIXI, ~~who~~which considers all possible utility functions weighted by their Bayesian probabilities: "[W]e propose uncertainty over utility functions. Instead of providing an agent one utility function up front, we provide an agent with a pool of possible utility functions and a probability distribution P such that each utility function can be assigned probability P(Ujyxm) given a particular interaction history [yxm]. An agent can then calculate an expected value over possible utility functions given a particular interaction history"

Nick Bostrom also discusses value learning at length in his book Superintelligence. Value learning is closely related to various proposals for AI-assisted Alignment and AI-assisted/AI automated Alignmentresearch. Since human values are complex and fragile, learning human values well is a challenging problem, much like AI-assisted Alignment (but in a less supervised setting, so actually harder). So this is only a practicable alignment technique for AGI capable of successfully performing a STEM research program in Anthropology. Thus value learning is (unusually) an alignment technique that improves as capabilities increase, and it requires around an AGI minimum threshold of capabilities to begin to be effective.

There are many proposed ways to incorporate human values in an AGI (e.g.: Coherent Extrapolated Volition, Coherent Aggregated Volition and Coherent Blended Volition, mostly dating from around 2004-2010.) This method was suggested in 2011in Daniel Dewey’s paper ‘Learning What to Value’. Like most authors, he assumes that an artificial agent needs to be intentionally aligned to human goals. First, Dewey argues against the use of a simple use of reinforcement learning to solve this problem, on the basis that this lead to the maximization of specific rewards that can diverge from value maximization. For example, this could suffer from goal misspecification or reward hacking.

Dewey proposes a utility function ~~maximizer,~~maximizer comparable to AIXI, who considers all possible utility functions weighted by their Bayesian probabilities: "[W]e propose uncertainty over utility functions. Instead of providing an agent one utility function up front, we provide an agent with a pool of possible utility functions and a probability distribution P such that each utility function can be assigned probability P(Ujyxm) given a particular interaction history [yxm]. An agent can then calculate an expected value over possible utility functions given a particular interaction history"

**Value learning** is a proposed method for incorporating human values in an AGI. It involves the creation of an artificial learner whose actions consider many possible ~~set~~sets of values and preferences, weighed by their likelihood. Value learning could prevent an AGI of having goals detrimental to human values, hence helping in the creation of Friendly AI.

~~Although there~~There are many proposed ways to incorporate human values in an AGI (e.g.: Coherent Extrapolated Volition, Coherent Aggregated Volition and Coherent Blended Volition~~)~~, ~~this~~mostly dating from 2004-2010.) This method ~~is directly mentioned and developed~~was suggested in 2011in Daniel Dewey’s paper ‘Learning What to Value’. Like most authors, he assumes that ~~human’s goals would not naturally occur in ~~an artificial agent ~~and should~~needs to be ~~enforced in it.~~intentionally aligned to human goals. First, Dewey argues against the use of a simple use of reinforcement learning to solve this problem, on the basis that this lead to the maximization of specific rewards that can diverge from value maximization. For example, ~~even if we forcefully engineer the agent to maximize those rewards that also maximize human values, the agent~~this could ~~alter its environment to more easily produce those same rewards without the trouble of also maximizing human values (i.e.: if the~~suffer from goal misspecification or reward ~~was human happiness it could alter the human mind so it became happy with anything).~~hacking.

~~To solve all these problems, ~~Dewey proposes a utility function maximizer, who considers all possible utility functions weighted by their Bayesian probabilities: "[W]e propose uncertainty over utility functions. Instead of providing an agent one utility function up front, we provide an agent with a pool of possible utility functions and a probability distribution P such that each utility function can be assigned probability P(Ujyxm) given a particular interaction history [yxm]. An agent can then calculate an expected value over possible utility functions given a particular interaction history"~~ He concludes saying that although it solves many of the mentioned problems, this method still leaves many open questions. However it should provide a direction for future work.~~

Two long paragraphs on Dewey's original paper, followed by one short paragraph hidden below the fold on everything that has happens since, seems like an inappropriate balance. I'm inclined to edit the summary of Dewey's paper down a little. Before I do, does anyone have a fundamental objection to this?