Gradient Hacking

Applied to Gradient Filtering by Jozdien 1y ago

Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.

See also: Inner Alignment