On The Risks of Emergent Behavior in Foundation Models

jsteinhardt

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post first appeared as a commentary for the paper "On The Opportunities and Risks of Foundation Models".

Bommasani et al. (2021) discuss a trend in machine learning, whereby increasingly large-scale models are trained once and then adapted to many different tasks; they call such models "foundation models". I quite enjoyed their paper and the associated workshop, and felt they correctly identified the two most important themes in foundation models: emergence and homogenization. My main criticism is that despite identifying these themes, they did not carry them to their logical conclusions, so I hope to (partially) remedy that here.

In short, emergence implies that ML systems can quickly change to look different and "weird" compared to ML today, thus creating new risks that aren't currently apparent. Meanwhile, homogenization contributes to inertia, which could make us slow to adapt. This calls for thinking about these risks now, to provide the requisite lead time.

Emergent Behavior Creates Emergent Risks

Bommasani et al. (2021) use the following definition of emergence:

Emergence means that the behavior of a system is implicitly induced rather than explicitly constructed; it is both the source of scientific excitement and anxiety about unintended consequences.

This actually better matches the definition of a self-organizing system, which tends to produce emergent behavior. I will take emergence to be the idea that qualitative changes in behavior arise from varying a quantitative parameter ("More Is Different"). This is most common in self-organizing systems such as biology and economics (and machine learning), but can occur even for simple physical systems such as ice melting when temperature increases. In machine learning, Bommasani et al. highlight the emergence of "in-context" or "few-shot" learning; other examples include arithmetic and broad multitask capabilities.

The companion to emergence is phase transitions, exemplified in the melting ice example. While not always the case, emergent behavior often quickly manifests at some threshold. Radford et al. (2018) provided the first hint of emergent few-shot capabilities that are now ubiquitous three years later. More strikingly, arithmetic capabilities in GPT-3 emerge from only a 30x increase in model size (Brown et al., 2020; page 22), and Power et al. (2021) show that similar capabilities can emerge simply by training for longer.

Moving forward, we should expect new behaviors to emerge routinely, and for some emergent properties to appear quite suddenly. For instance, risky capabilities such as hacking could enable new forms of misuse without sufficient time to respond. New autonomous weapons could upset the current balance of power or enable new bad actors, sparking a global crisis.

Beyond misuse, I worry about internal risks from misaligned objectives. I expect to see the emergence of deceptive behavior as ML systems get better at strategic planning and become more aware of their broader environment context. Recommender systems and newsfeeds already have some incentive to deceive users to produce profit. As ML models are increasingly trained based on human ratings, deception will become more attractive to trained ML systems, and better capabilities will make deception more feasible.

Emergence therefore predicts a weird and, unfortunately, risk-laden future. Current applications of machine learning seem far-removed from ML-automated cyberattacks or deceptive machines, but these are logical conclusions of current trends; it behooves us to mitigate them early.

Homogenization Increases Inertia

Bommasani et al.'s other trend is homogenization:

Homogenization indicates the consolidation of methodologies for building machine learning systems across a wide range of applications; it provides strong leverage towards many tasks but also creates single points of failure.

Homogenization contributes to inertia, which slows our reaction to new phenomena. Current foundation models are derived from enormous corpora of images, text, and more recently code. Changing this backend is not easy, and even known biases such as harmful stereotypes remain unfixed. Meanwhile, new data problems such as imitative deception could pose even greater challenges.

Change that may seem slow can still be fast compared to the pace of large institutions. Based on the previous examples of emergence, it appears that new capabilities take anywhere from 6 months to 5 years to progress from nascent to ubiquitous. In contrast, institutions often take years or decades to respond to new technology. If a new capability creates harms that outweigh the benefits of machine learning, neither internal engineers nor external regulators will reliably respond quickly.

Inertia can come from other sources as well---by the time some problems are apparent, machine learning may already be deeply woven into our societal infrastructure and built upon years of subtly flawed training data. It will not be feasible to start over, and we may face a task akin to fixing a rocket ship as it takes off. It would be much better to fix it on the launchpad.

Fixing the Rocket

Our most recent global crises are the coronavirus pandemic and global warming. The former took over a year to reach a full policy response, while the latter is still struggling after decades of effort. The pace of machine learning is too fast for this; we need to think a decade ahead, starting now.

We can start by building a better picture of future ML systems. While the future is uncertain, it is not unknowable, and I and others have started to do this by forecasting progress in AI.

On a more technical level, we can unearth, investigate, and characterize potentially dangerous behaviors in ML systems. We can also work on mitigation strategies such as anomaly detection and value alignment, and guard against external risks such as cyberattacks or autonomous weapons. In a recent white paper, we outline approaches to these and other directions, and we hope others will join us in addressing them.

LESSWRONG
LW

On The Risks of Emergent Behavior in Foundation Models

30

Ω 13

Emergent Behavior Creates Emergent Risks

Homogenization Increases Inertia

Fixing the Rocket

New to LessWrong?

30

Ω 13