Nicolas Macé

Making AIs less likely to be spiteful

Which forms of misalignment might result in particularly bad outcomes? And to what extent can we prevent them even if we fail at intent alignment? We define spite as a terminal preference for frustrating others’ preferences, at least under some conditions. Reducing the chances that an AI system is spiteful is a candidate class of interventions for reducing risks of AGI conflict, as well as risks from malevolence. This post summarizes some of our thinking on the topic. We give an overview of why spite might lead to catastrophic conflict; how we might intervene to reduce it; ways in which the intervention could fail to be impactful, or have negative impact; and things we could learn that would update us on the value of this intervention. Key takeaways 1. Spiteful preferences include a generalized preference for harming others, as well as other preferences like vengefulness and spite towards certain groups. The basic reason to focus on reducing spite is that such interventions may stably make AIs less likely to take risks of mutually costly conflict (or deliberately create suffering because they intrinsically value it), even if alignment fails. (more) 2. Spite might be selected for in ML systems because (a) it serves as a strategically valuable commitment device, (b) it is a direct proxy for high-scoring behavior in environments where the optimal behavior involves harming other agents (e.g., environments with competition between agents), (c) it is (correctly or incorrectly) inferred from human preferences, or (d) it results from miscellaneous generalization failures. (more) 3. Thus potentially low-cost interventions to reduce the chances of spite include modifications to the training data or loss function to reduce selection pressure towards spite (e.g., avoiding selecting agents based on relative performance in multi-agent tasks, or filtering human feedback that could select for spite). (more) 1. Reducing spite carries some prima facie backfire risk, via pot

118Sep 26, 2023

Nicolas Macé

Message

290

Individually incentivized safe Pareto improvements in open-source bargaining

Summary Agents might fail to peacefully trade in high-stakes negotiations. Such bargaining failures can have catastrophic consequences, including great power conflicts, and AI flash wars. This post is a distillation of DiGiovanni et al. (2024) (DCM), whose central result is that agents that are sufficiently transparent to each other have...

Jul 17, 202441

Degeneracies are sticky for SGD

Introduction Singular learning theory (SLT) is a theory of learning dynamics in Bayesian statistical models. It has been argued that SLT could provide insights into the training dynamics of deep neural networks. However, a theory of deep learning inspired by SLT is still lacking. In particular it seems important to...

Jun 16, 202456

Making AIs less likely to be spiteful

Sep 26, 2023118

Open-minded updatelessness

Summary Bounded agents might be unaware of possibilities relevant to their decision-making. That is, they may not just be uncertain, but fail to conceive of some relevant hypotheses entirely. What's more, commitment races might pressure early AGIs into adopting an updateless policy from a position of limited awareness. What happens...

Jul 10, 202366

Nicolas Macé's Shortform

Jun 21, 20231

LESSWRONG
LW

LESSWRONG
LW

Nicolas Macé

Nicolas Macé

Nicolas Macé

Making AIs less likely to be spiteful

Open-minded updatelessness

Degeneracies are sticky for SGD

Individually incentivized safe Pareto improvements in open-source bargaining

Nicolas Macé

Individually incentivized safe Pareto improvements in open-source bargaining

Degeneracies are sticky for SGD

Making AIs less likely to be spiteful

Open-minded updatelessness

Nicolas Macé's Shortform

Making AIs less likely to be spiteful

Open-minded updatelessness

Degeneracies are sticky for SGD

Individually incentivized safe Pareto improvements in open-source bargaining

Individually incentivized safe Pareto improvements in open-source bargaining

Degeneracies are sticky for SGD

Making AIs less likely to be spiteful

Open-minded updatelessness

Nicolas Macé's Shortform