[AN #90]: How search landscapes can contain self-reinforcing feedback loops

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Highlights

Demons in Imperfect Search (John S Wentworth) (summarized by Asya): This post gives an analogy to explain optimization demons: a type of undesirable behavior that arises in imperfect search processes. In the analogy, a ball rolls down a hill trying to go as far down as possible, mimicking a gradient descent algorithm. The ball is benefited by random noise, but still basically only experiences local changes in slope-- it cannot see steep drop-offs that are a little off to the side. Small bumps in the hill can temporarily alter the ball's trajectory, and the bumps that are selected for are the ones that most effectively control its trajectory. In this way, over time the ball's trajectory selects for demons, twisty paths with high walls that keep the ball contained and avoid competing walls. Demons cause the ball to go down the hill as slowly as possible so that potential energy is conserved for avoiding competitor walls.

The general pattern this analogy is meant to elucidate is the following: In any imperfect search mechanism with a rich enough search space, a feedback loop can appear that creates a more-and-more perfect exploitation of the imperfect search mechanism, resulting in a whole new optimization process. The post gives several real world examples as proofs that this is a failure mode that happens in real systems. One example is metabolic reactions-- a chemical system searches by making random small changes to the system state while trying to minimize free energy. Biological systems exploit the search by manipulating the height of the barriers between low-free-energy states, raising or lowering the activation energies required to cross them. After enough time, some chemicals changed the barriers enough such that more copies of the chemicals were made, kicking off an unstable feedback loop that led to life on earth.

The post ends by posing an open question asking what about a system makes this kind of failure mode likely to happen.

Asya's opinion: I think it's worth spelling out how this is different from the failure modes described in Risks from Learned Optimization (AN #58). In Risks from Learned Optimization, we are concerned that the outer optimizer will produce an unaligned inner optimizer because we're training it in diverse environments, and an inner optimizer may be the best solution for performing well in diverse environments. In this post, we are concerned that the outer optimizer will produce an unaligned demon (which may or may not be an optimizer) because the search process may have some self-reinforcing imperfections that allow it to be pushed strongly in a direction orthogonal to its objective. This direction could be bad unless the original outer objective is a perfect specification of what we want. This means that even if the conditions for mesa-optimization don't hold-- even if we're training on a fairly narrow task where search doesn't give an advantage-- there may be demon-related failure modes that are worth thinking about.

I really like this post, I think it crystallizes an important failure mode that I haven't seen described before. I'm excited to see more work on this class of problems.

Tessellating Hills: a toy model for demons in imperfect search (DaemonicSigil) (summarized by Asya): This post is trying to generate an example of the problem outlined in 'Demons in Imperfect Search' (summarized above): the problem where certain imperfect search processes allow for self-reinforcing behavior, 'demons', that push in a direction orthogonal to the original objective.

The post runs a simple gradient descent algorithm in an artifically constructed search space. The loss function that defines the search space has two major parts. One part straightforwardly tries to get the algorithm to move as far as it can in a particular direction x0 -- this represents our original objective function. The other part can be thought of as a series of periodic 'valleys' along every other axis, (x1 ... xn) that get steeper the farther you go along that axis.

When running the gradient descent, at first x0 increases steadily, and the other coordinates wander around more or less randomly. In the second phase, a self-reinforcing combination of valleys (a "demon") takes hold and amplifies itself drastically, feeding off the large x0 gradient. Finally, this demon becomes so strong that the search gets stuck in a local valley and further progress stops.

Asya's opinion: I think this is a good illustration of the problem specified in Demons in Imperfect Search. Clearly the space has to have a fairly specific shape, so the natural follow-up question, as is posed in the original post, is to think about what cases cause these kinds of self-reinforcing search spaces to arise.

Technical AI alignment

Agent foundations

A critical agential account of free will, causation, and physics (Jessica Taylor)

Subjective implication decision theory in critical agentialism (Jessica Taylor)

Forecasting

Historic trends in technological progress (AI Impacts) (summarized by Nicholas): One key question in thinking about AGI deployment and which safety problems to focus on is whether technological progress will be continuous or discontinuous. AI Impacts has researched the frequency of discontinuities in a number of case studies, that were selected on the possibility of having discontinuities. An example of a discontinuity in flight speed records would be the Fairey Delta 2 flight in 1956 which represented 19 years of progress at the previous trend. On the other hand, penicillin did not create a discontinuity of more than ten years in the number of deaths from syphilis in the US. This post summarizes a number of those case studies. As it is already a summary, I will just refer you to the post for more information.

Nicholas's opinion: I’m looking forward to reading AI Impacts’ conclusions after completing these case studies. My impression from reading through these is that discontinuities happen, but rarely, and small discontinuities are more common than larger ones. However, I remain uncertain of a) how relevant each of these examples is to AI progress, and b) if I missed any key ways in which the examples differ from each other.

Miscellaneous (Alignment)

Cortés, Pizarro, and Afonso as Precedents for Takeover (Daniel Kokotajlo) (summarized by Matthew): This post lists three historical examples of how small human groups conquered large parts of the world, and shows how they are arguably precedents for AI takeover scenarios. The first two historical examples are the conquests of American civilizations by Hernán Cortés and Francisco Pizarro in the early 16th century. The third example is the Portugese capture of key Indian Ocean trading ports, which happened at roughly the same time as the other conquests. Daniel argues that technological and strategic advantages were the likely causes of these European victories. However, since the European technological advantage was small in this period, we might expect that an AI coalition could similarly take over a large portion of the world, even without a large technological advantage.

Matthew's opinion: In a comment, I dispute the claimed reasons for why Europeans conquered American civilizations. I think that a large body of historical literature supports the conclusion that American civilizations fell primarily because of their exposure to diseases which they lacked immunity to, rather than because of European military power. I also think that this helps explain why Portugal was "only" able to capture Indian Ocean trading ports during this time period, rather than whole civilizations. I think the primary insight here should instead be that pandemics can kill large groups of humans, and therefore it would be worth exploring the possibility that AI systems use pandemics as a mechanism to kill large numbers of biological humans.

AI strategy and policy

Activism by the AI Community: Analysing Recent Achievements and Future Prospects (Haydn Belfield) (summarized by Rohin): The AI community has been surprisingly effective at activism: it has led to discussions of a ban on lethal autonomous weapons systems (LAWS), created several initiatives on safety and ethics, and has won several victories through organizing (e.g. Project Maven). What explains this success, and should we expect it to continue in the future? This paper looks at this through two lenses.

First, the AI community can be considered an epistemic community: a network of knowledge-based experts with coherent beliefs and values on a relevant topic. This seems particularly relevant for LAWS: the AI community clearly has relevant expertise to contribute, and policymakers are looking for good technical input. From this perspective, the main threats to future success are that the issues (such as LAWS) become less novel, that the area may become politicized, and that the community beliefs may become less cohesive.

Second, the AI community can be modeled as organized labor (akin to unions): since there is high demand for AI researchers, and their output is particularly important for company products, and the companies are more vulnerable to public pressure, AI researchers wield a lot of soft power when they are united. The main threat to this success is the growing pool of talent that will soon be available (given the emphasis on training experts in AI today), which will reduce the supply-demand imbalance, and may reduce how commited the AI community as a whole is to collective action.

Overall, it seems that the AI community has had good success at activism so far, but it is unclear whether it will continue in the future.

Rohin's opinion: I think the ability of the AI community to cause things to happen via activism is quite important: it seems much more likely that if AI x-risk concerns are serious, we will be able to convince the AI community of them, rather than say the government, or company executives. This mechanism of action seems much more like the "epistemic community" model used in this paper: we would be using our position as experts on AI to convince decision makers to take appropriate precautions with sufficiently powerful AI systems. Applying the discussion from the paper to this case, we get the perhaps unsurprising conclusion that it is primarily important that we build consensus amongst AI researchers about how risky any particular system is.

Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society (Carina Prunkl and Jess Whittlestone) (summarized by Rohin): This paper argues that the existing near-term / long-term distinction conflates four different axes on which research could differ: the capability level of AI systems (current pattern-matching systems vs. future intelligent systems), the impacts of AI systems (impacts that are being felt now like fairness vs. ones that will be felt in the future like x-risks), certainty (things that will definitely be problems vs. risks that are more speculative) and extremity (whether to prioritize particularly extreme risks). While there are certainly correlations across these axes, they are not the same thing, and discourse would be significantly improved by disambiguating the axes. For example, both authors of the paper see their work as considering the medium-to-long-term impacts of near-to-medium-term AI capabilities.

Rohin's opinion: I definitely agree that near-term and long-term often seem to mean many different things, and I certainly support efforts to be more precise in our language.

While we're talking about near-term and long-term, I'll add in my own gripe: "long-term" implies that the effects will be felt only in the far future, even though many people focused on such effects are doing so because there's a significant probability of such effects being felt in only a few decades.

Exploring AI Futures Through Role Play (Shahar Avin et al) (summarized by Rohin): This paper argues that role playing (akin to the "wargames" used in the military) is a good way to explore possible AI futures, especially to discover unusual edge cases, in a 10-30 year time horizon. Each player is assigned a role (e.g. director of AI at Tencent, or president of the US) and asked to play out their role faithfully. Each game turn covers 2 simulated years, in which players can negotiate and take public and private actions. The game facilitator determines what happens in the simulated world based on these actions. While early games were unstructured, recent games have had an AI "tech tree", that determines what AI applications can be developed.

From the games played so far, the authors have found a few patterns:

- Cooperation between actors on AI safety and (some) restriction on destabilizing uses of AI seem to both be robustly beneficial.

- Even when earlier advances are risky, or when current advances are of unclear value, players tend to pursue AI R&D quite strongly.

- Many kinds of coalitions are possible, e.g. between governments, between corporations, between governments and corporations, and between sub-roles within a corporation.

Rohin's opinion: It makes sense that role playing can help find extreme, edge case scenarios. I'm not sure how likely I should find such scenarios -- are they plausible but unlikely (because forecasting is hard but not impossible), or are they implausible (because it would be very hard to model an entire government, and no one person is going to do it justice)? Note that according to the paper, the prior literature on role playing is quite positive (though of course it's talking about role playing in other contexts, e.g. business and military contexts). Still, this seems like quite an important question that strongly impacts how seriously I take the results of these role playing scenarios.

Other progress in AI

Deep learning

Speeding Up Transformer Training and Inference By Increasing Model Size (Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin et al) (summarized by Rohin): This blog post and associated paper confirm the findings from Scaling Laws for Neural Language Models (AN #87) that the most efficient way to train Transformer-based language models is to train very large models and stop before convergence, rather than training smaller models to convergence.

11

[AN #90]: How search landscapes can contain self-reinforcing feedback loops

11

Ω 9