The ability to make credible commitments is a key factor in many bargaining situations ranging from trade to international conflict. This post builds a taxonomy of the commitment mechanisms that transformative AI (TAI) systems could use in future multipolar scenarios, describes various issues they have in practice, and draws some tentative conclusions about the landscape of commitments we might expect in the future.

Introduction

A better understanding of the commitments that future AI systems could make is helpful for predicting and influencing the dynamics of multipolar scenarios. The option to credibly bind oneself to certain actions or strategies fundamentally changes the game theory behind bargaining, cooperation, and conflict. Credible commitments and general transparency can work to stabilize positive-sum agreements, and to increase the efficiency of threats (Schelling 1960), both of which could be relevant to how well TAI trajectories will reflect our values.

Because human goals can be contradictory, and even broadly aligned AI systems could come to prioritize different outcomes depending on their domains and histories, these systems could end up in competitive situations and bargaining failures where a lot of value is lost. Similarly, if some systems in a multipolar scenario are well aligned and others less so, some worst cases might be avoidable if stable peaceful agreements can be reached. As an example of the practical significance of commitment ability in stabilizing peaceful strategies, standard theories in international relations hold that conflicts between nations are difficult to avoid indefinitely primarily because there are no reliable commitment mechanisms for peaceful agreements (e.g. Powell 2004, Lake 1999, Rosato 2015), even when nations would overall prefer them.

In addition to the direct costs of conflict, the lack of enforceable commitments leads to continuous resource loss from arms races, monitoring, and other preparations for possible hostility. It can also simply prevent gains from trade that binding prosocial contracts and high trust could unlock. A strategic landscape that resembles current international relations in these respects seems possible in a fully multipolar scenario, where no AI system has yet gained a decisive advantage over the others, and no external rule of law can be strongly enforced over all the systems. If AI systems had a much greater ability to commit than we do, however, they could avoid recapitulating these common pitfalls of human diplomacy.

The potential consequences of credible commitments in various TAI scenarios will be discussed more thoroughly in forthcoming work. The purpose of this post is mostly to investigate whether and how credible commitments could be feasible for such systems in the first place.[1]As commitment mechanisms differ in which kinds of commitment they are best suited for, though, some implications for consequences will also be tentatively explored.

Some quick notes on the terminology in this post:

Commitment ability refers here to an agent's ability to cause others to have a model of its relevant actions and future behavior which matches its own model of itself, or its genuine intentions.[2] This can naturally include arbitrarily complex probabilistic or conditional models. This definition diverges somewhat from how commitments are typically understood, but captures better a broader transparency relevant to bargaining situations. While an agent's model of itself may not always correspond to what it actually ends up doing, the noise from incorrect models should at least ideally also be low enough that it doesn't affect the bargaining landscape much.

Closer to the conventional concept of commitment, commitment mechanisms here are ways to bind yourself more strongly to certain future actions in externally credible ways (such as visibly throwing out your steering wheel in a game of chicken).

Approaches to commitment in this context are simply the higher-level frameworks that agents can use to assess and increase the commitment ability of themselves and others. The main content of this post will be outlining these frameworks.

Approaches to commitment between AI systems

This section will discuss ways through which TAI could surpass humans in commitment ability, but also tie in the main reasons for why this isn't self-evident even between systems that are overall far more capable than humans. In particular, there are three properties of commitment approaches that are at least not obviously satisfied by any of the candidates here, but seem important when talking about commitment ability in a given future environment:

  • General programmability: an agent's commitment ability should be suitable for a wide enough range of commitments and contracts. In natural bargaining situations, various specific commitments are comparatively easy to implement because of contingent factors. For example, lending an expensive camera to one's sibling seems less risky than to a stranger simply because of the high likelihood of frequent future interactions, and geographical features can influence the credibility of various military strategies. The ability to make similarly limited contingent commitments can already have a great influence on the dynamics of multipolar scenarios, of course, but more generalizable approaches seem more informative when the details of likely TAI scenarios are still hazy.
  • Low dependence on specific architectures or paradigms, or alternatively just high compatibility with the trajectories we expect to see in foreseeable and tractable areas of AI development.
  • Economical viability, or costs that scale favorably compared e.g. to the likely opportunity costs of agents involved in future bargaining.

Classical approaches: mutually transparent architectures

Early discussions in AI safety often assumed that transformative AI systems would be based on advanced models of the fundamental principles of intelligence. Their cognitive architectures could therefore be quite elegant, and perhaps arbitrarily transparent to other similarly intelligent agents. The concept of systems checking each other's source codes, or allowing a trusted third party to verify them, was often used as a shorthand for this kind of mutual interpretability. For highly transparent agents whose goals are also contained in compact formal representations, such as utility functions, reliable alliances could even happen through merging utility functions (Dai 2009, 2019).  Work on program equilibrium as a formal solution to certain game-theoretic dilemmas uses source code transparency as a starting point (Tennenholtz 2004, see also Oesterheld 2018), assuming complete information of the other agent's syntax to condition one's response on. [3]Further work has also generalized the idea of conditional commitments and the cooperative equilibria they support (Kalai et al 2010).

These approaches seem less compatible with recent advances in AI. Capability gain is currently mostly driven by reinforcement learning in increasingly large and complex environments, less by progress in understanding the building blocks of general cognition (Sutton 2019). It seems plausible that the ultimately successful paradigms for TAI will conceptually be quite close to contemporary work (Christiano 2016). If this is the case, and superhuman systems will be hacky and opaque similarly to human brains, their mutual interpretability could also remain limited like it is between humans. Cognitive heterogeneity in itself is already a hindrance to mutual understanding, and will likely be much greater in AI systems than in humans. Considering that all humans have a shared evolutionary history and are strongly adapted for social coordination, we could even be much better at credibility and honesty than independently-trained AI systems, if they are developed through very different methods or in varying conditions, and have no such natural adaptations for transparency. [4]

On the other hand, superhuman agents could also be able to define and map the foundations of intelligence better than human researchers. Even prosaic trajectories could thus eventually lead to more compact builds and allow for higher interpretability. Though beyond the reach of human researchers, intentionally designed and elegant cognitive architectures could still ultimately be more efficient than ones that were born through less controllable (e.g. evolutionary) processes. Increased commitment ability in itself might already motivate agents to move in this direction, if they expect transparency to facilitate more gains from trade or some other competitive advantage. The bargaining landscape would then change in a predictable pattern over time: early AI systems would have poor commitment ability despite otherwise superhuman competence, but after more intentional refactoring towards transparency, strong commitments through classical approaches would eventually become available to their successors. [5]

This kind of self-modification would still lack robust safeguards against some conceptually simple exploits. Even if one could comb through an agent's internal structure at some point after it self-modified to be highly interpretable, it would be costly to make sure that it hasn't, for example, secretly changed something relevant in the environment before this process. In addition, asymmetries in competence would likely appear between agents due to their different domains, histories, and goals. Whether global differences in competence or just local blind spots, these asymmetries might make obfuscating one's intentions a viable strategy after all, and decrease the general credibility of commitments.

If transformative AI systems will be built with current paradigms, existing research on interpretability might also be helpful when predicting commitment ability. Even if the kind of syntactic transparency required for program equilibrium approaches wasn't feasible, high levels of trust can be achieved as long as other ways exist to understand another agent's internal decision procedures from the outside. This resembles a more advanced version of human researchers trying to make contemporary machine learning models more understandable to us.

The literature on interpretability currently lacks a unified paradigm, but it often divides methods for interpretation into model-based and post-hoc approaches (Murdoch et al 2018). The former require the models themselves to be inherently more understandable through design choices such as sparsity of parameters, modular structures, or more feature engineering based on expert knowledge about the domain in question. These ideas can possibly be extrapolated to TAI scenarios, and some key concepts will be explored below. The latter are more specific to current narrow models, and mostly deal with measures for feature importance, i.e. clarifying which features or interactions in the training data have contributed to the model to which degrees. With enough information of how an agent has been trained, analogous methods could perhaps be useful, but likely laborious; they will not be discussed further in this post.

Generally, model-based methods have a shared problem in how they constrain the model design in its other capabilities. Some interpretability researchers have suggested that these constraints are less prohibitive than they seem, at least in contemporary applications. When the task is to make sense of some dataset, the set of models that can accomplish such a predictive task (known as the Rashomon set) is potentially large, and thus arguably likely to include some intrinsically understandable models as well (Rudin 2019). This idea might extend to general intelligences quite poorly in practice, especially when computational efficiency is also a concern and the setting is competitive. However, in a sense it's related to the idea that there could be some eventually discoverable highly compact building blocks that suffice for general intelligence, even if many of the paths there are messier. One way through which this could hold is that the world and its relations themselves are fundamentally simple or compressible (see e.g. Wigner 1960).

Another way in which even complex systems could achieve more transparency is through modularity, where various parts of an agent's cognition can be examined and interpreted somewhat independently. Different cognitive strategies, employed in different situations depending on some higher-level judgment, could potentially be both effective and fairly transparent due to their smaller size (and possibly higher fundamental comprehensibility and traceable history) compared to a generally intelligent agent. Whether strongly modular structures are in fact functional or competitive enough in this context will be discussed in forthcoming work, but the greater transparency of modular minds is questionable. It seems unlikely that in a complex world, parts of an effective agent's reasoning could be so separable from its other capacities so as to leave no context-dependent uncertainties, or opportunities to secretly defect by using seemingly trustworthy modules in underhanded ways. This certainly doesn't seem to be the case in human brains, despite their likely quite modular structure (for an overview, see e.g. Carruthers 2006, Robbins 2017).

Overall the relation between interpretability and the kind of transparency that facilitates commitments is not well defined. However, being able to interpret an agent's decisions doesn't seem to directly mean that they are simulatable or otherwise verifiable to you in specific bargaining situations. Transparency through these means seems especially implausible when local or global asymmetries between agents are large, and possibly when the scenario is adversarial.

Centralized collaborative approaches

A less architecture-dependent but also less satisfying approach is simply assuming that commitment ability is a very difficult problem, and like most very difficult problems, trivially solved by throwing a lot of compute at it. Perhaps TAI systems will remain irredeemably messy, but will be motivated to find ways to cooperate in spite of this. One route they might consider is similar to what human societies have often converged on: centralizing enough power and resources to enforce or check contracts that individual humans otherwise can't credibly commit to.

In this context, the central power could exist either for simply verifying the intentions behind arbitrary commitments, or for punishing defectors afterwards if they break established laws. As the latter task has been brought up in other contexts [link] and doesn't constitute a meaningfully multipolar scenario, this section will mostly discuss the former. An overseer that merely verifies contracts and commitments instead of dealing out punishments could be more palatable even for agents with idiosyncratic preferences about societal rules. It only requires agents to believe that the ability to make voluntary credible commitments will be positive in expectation.[6] It would regardless capture many of the benefits of a central overseer, as one main reason for punitive systems is also enforcing otherwise untenable commitments.

The idea behind this mechanism is only that while the agents can't interpret each other or predict how well they would stick to commitments, a far more capable system (here, likely just a system with vastly more compute at its disposal) could do it for them. If several agents of similar capability are involved in collaboratively constructing such a system, they can be fairly confident that no single agent can secretly bias it, or otherwise manipulate the outcome. This system would then serve as an arbitrator, likely with no other goals of its own, and remain far enough above the competence level of any other agent in the landscape. Assuming that its subjects will continuously strive for expansion and self-improvement, this minimal-state brain would also need to keep growing. As long as it remains useful, it could do so by collecting resources from the agents that expect to benefit from its abilities.

How much more intelligent would such a system need to be, though? Massively complex neural architectures could well remain inscrutable even to much more competent agents. In terms of neural connections, no human could use a snapshot of a salamander brain to predict its next action, let alone its motivations one hour in the future. Even the simple 302-neuron connectome of the nematode C. elegans mostly escapes our understanding, despite years of effort at emulating its functions and our own neuron count at 8.6x10^9 (see OpenWorm and related projects, 2020). Most likely, judgments about an agent's honesty would need to rely in part on inferences based on its origins and history, slight behavioral inconsistencies, and other subtle external signs it would hopefully not be clever enough to fully cover up. Some causal traces of intentions to deceive bargaining partners could be expected. For the iconic argument in this space, see Yudkowsky (2008).

A major weakness to this approach is that the costs of running such a system seem substantial regardless of how large the difference needs to be. The gains from trade that agents could secure through increased commitment ability would have to be higher than that, and it isn't clear that this is the case. Eventually, there might not be enough surplus left on the table to motivate further contributions to such a costly system. On the other hand, if there is some point after which most of the valuable commitments have been made and an arbitrator is no longer needed, this could just be because the bargaining landscape is then thoroughly locked into a decently cooperative state: if bargaining failures were still frequent in expectation, there would be more potential surplus left. If so, the overall costs of relying on such a system might not be too high in the long run, as it would mostly be needed through some transient unstable timespan during early interactions between AI systems.

There are a few ways through which a centralized arbitrator system could be set up, for example:

  • By humans in different AI labs or safety organizations, who recognize the long-term benefits of cooperative commitments and want to pave the way for a system that could verify them. Some relevant interventions in the AI policy and governance space could already be available at this point, and similarly, leading AI labs could already coordinate related projects.
  • By AI systems in early deployment, or even some later stages of instrumental expansion, who can directly construct an arbitrator and collaboratively watch over the process. To self-organize for such a project, agents would already need to communicate on some level, and agree that setting up such a system will benefit them sufficiently.
  • By an agent with some other goals, in a particularly suitable position, that optimizes for this task for instrumental reasons such as increased security  and resources. Due to contingent events or features, some agents can by chance be much more credible than average at least so some other agents, and seek to leverage this. This path is somewhat related to the approach outlined in the next section.

Decentralized collaborative approaches

A potentially less costly collaborative approach could work if credibility can be mapped to a multidimensional model where agents start out differentially trusting each other because of path-dependent or idiosyncratic reasons, and can then form networks to verify commitments. For example, due to domain-specific differences and histories, even agents whose overall competence is roughly on your level could spot minor details that you miss because of your own limitations, but that are relevant to your credibility. If architectural similarities matter for transparency, some agents will be able to understand each other's internal workings better than others; this could be the case if copies either of agents or their internal modules become common. Some agents can also come to share origins or relevant interactions that allow them to form better models of each other.

This approach differs from a centralized project in that it describes conditions where gradients of trust form with low initial effort in these path-dependent dimensions. As trust at least in a general sense can be largely transitive, the costs of communicating within a network under such conditions could stay reasonably manageable. More than a specific mechanism, this approach would be a way to extend already existing local and empirical commitment ability, at least in a probabilistic manner, through larger areas in the bargaining landscape.

As a simplified example, say that there are three agents, A, B, and C, who as a binary either can or cannot trust each other. Agent A can trust agent B (and vice versa) because of many shared modules, but cannot trust the internally very different agent C. Agent B can trust agent C (and vice versa) because of a shared history. If agents A and C then want to communicate credibly with each other, it seems easy for them to verify their agreements through their links to agent B.

In larger agent spaces, longer chains and networks of agents with differential levels of trust could plausibly come to influence the dynamics of commitment through similar network structures. Even without multiple dimensions to make the task potentially cheaper, however, a less centralized approach can be pursued. Rather than specifically building a central system for the black-box task of verifying commitments, a network of agents that can along with their other pursuits trade various evaluation services could be a more dynamic way to get the required amounts of compute for assessing individual contracts.

While modeling the payoffs that agents could receive by helping others communicate is not a central question in this context, it is interesting when considering the incentives for such tasks. Various models have been built in cooperative game theory to represent limited communication between different parts in a network of collaborators and the payoff distributions in such situations (see e.g. Slikker and Van Den Nouweland 2001). A widely used formalism is the Myerson value, which builds on the Shapley value and allocates a greater part of the surplus in a coalition to players who facilitate communication and therefore cooperation between others (Myerson 1977, 1980; Shapley 1951). This and related concepts correspond reasonably well to scenarios where trust differences allow only some agents to credibly communicate with each other. Forthcoming work will investigate in more detail how cooperative equilibria can be sustained in limited communication situations.

Overall, the approach described in this section mostly serves as a rough sketch for much more sophisticated network strategies that AI systems could devise, but even with very liberal hypothetical extrapolation, the bridge to plausible practical scenarios seems shaky. At the very least, the availability and strength of any local credibility links is determined mostly by higher-level features of the agent space, although intentionally creating more of them seems possible for cooperative human researchers during development.

Automated approaches

High transparency can sometimes be achieved by finding a simple enough commitment mechanism that its workings are unambiguous from the outside. Separating a commitment from sophisticated strategies and other cognitive complexities of the agent itself, an effective approach can just consist of automatic structures responding to the environment in predictable ways. Nuclear control systems were presumably built in the Cold War era Soviet Union that could be triggered by sensor input alone, to ensure retaliation with minimal human intervention (Wikipedia 2020). [7] Companies can irreversibly invest and deploy specific assets, tying their hands to a certain strategy often in an observable and understandable way (e.g. Sengul et al 2011). Similarly, militaries can reduce their options by mobilizing troops that would be too costly to recall regardless of what one's opponent chooses to do (Fearon 1997).

While powerful in many specific cases, this approach is quite limited especially in complex environments. With large differences in general or domain-specific competence, there might be few situations where simple automated mechanisms can even be built transparently enough. Regardless of how interpretable and robust some physical device or resource investment seems, it doesn't remove the intelligent agent from the equation, or again prevent it from setting up the environment in a clever way that allows for defection after all.

In most contexts an automated approach has many other downsides as well, such as a lack of flexibility and corrigibility if there are unpredictable events in the environment. It seems unlikely that highly verifiable automated mechanisms could be built with the resolution to track the ideal commitments one could make in complex situations, and most interesting contracts could likely not be represented at all. In environments with agents that are much more diverse than humans, nations, or organizations, the physical reliability of simple commitment devices could be illusory even when they are set up by agents that are sincere in their commitments. While the fearful symmetry seen in nuclear deterrence strategies may be the best option to practically reduce the incidence of conflicts, it has historically led to mistaken near-launches due to unforeseen details such as weather anomalies (Wikipedia 2020). This illustrates how even applying a simple commitment mechanism requires a good understanding of the environment, including one's peers and their behavior space, when designing viable trigger conditions for whatever the intended procedure is. The worse one's models of the other players are, the harder this task would likely be.

Strategic delegation

In economic and game-theoretic literature, a related but typically more flexible approach is strategic delegation, where principals deploy agents with different direct incentives to act on their behalf. By optimizing for something other than the principal's actual goal, delegated agents can sometimes reach better bargaining outcomes due to the desired commitment being naturally being more favorable to their incentives. For example, a manager may be responsible merely for keeping a company in the market, not its ultimate profit margins, credibly changing the way they will respond to threats in entry deterrence games (Fershtman and Judd, 1987). The original formalism behind strategic delegation (Vickers, 1985) involves an agent appointment game that precedes the actual game between agents, and determines how the agents in the latter game play mapping to an exogenously given outcome function. More recent work (Oesterheld and Conitzer 2019) describes how delegates with modified incentives can safely strive for Pareto improvements.

The practical applications of these models are not immediately clear in the empirical future scenarios we might envision. As pointed out by Oesterheld and Conitzer, the process of committing one's delegates to their modified incentives must already be credible. If the deployed agent differs from the principal mostly in terms of incentives and not competence, agency, or internal complexity, for instance, it may not be much more transparent in its commitments than the principal was. Perhaps some goals are more verifiable or otherwise credible than others e.g. in terms of observable actions that are consistent with the goals, but the fundamental problem with internal opaqueness remains. One solution is only deploying the agent for a specific bargaining situation, for which it is trained in a mostly transparent way where an observer can see the details of the training procedure. Similarly to how modules in a single agent pose challenges, it is unclear how well individual bargaining situations could be separated from enforcing them in the environment, however, and the enforcement would again presumably require a more generally competent agent to be crucially involved.

Iteration, punishment capacity, and other miscellaneous factors

If interactions in the bargaining environment are iterated or one's history is visible to outside parties one might trade with later on, reputation concerns can incentivize sticking to commitments.  This is a well-known finding in game theory (see e.g. Mailath & Samuelson 2006), and will not be discussed much further here, but ought to be included for the sake of completeness. In transparent iterated scenarios, an agent expects other players to be able to punish it later for breaking commitments. Even if the environment is uncertain, adhering to costly commitments can signal credibility to future bargaining partners as a long-term strategy. A concrete special case of the former mechanism is having a central power or other system with the material capacity to retrospectively punish agents that renege on their contracts, much like law enforcement in human societies happens through the deterrence effect of designed consequences to defections.[8]

As it is mostly far upstreams from commitment ability, increasing the iteration factor of interactions for the sake of credibility seems inefficient and probably intractable. Among agents whose strategies optimize for the very long term, it is also unreliable: if interactions are repeated in an environment where the stakes get higher over time, most agents would prefer to be honest while the stakes are low, regardless of how they will act in a sufficiently high-stakes situation. This holds especially because the higher the stakes get in a competition for expansion, the fewer future interactions one expects, as wiping out other players entirely becomes a possible outcome. Iteration alone would therefore provide limited information, even if it sometimes is the only practical way to provide evidence of one's trustworthiness.

Both epistemic and normative features in individual agents can make their commitments more credible, if these features are common knowledge. Human cultures, for instance, have used religious notions to signal commitment to certain strategies (e.g. Holslag 2016), perhaps often successfully compared to available counterfactual approaches. Agents could also come to intrinsically value transparency or choose to adhere to commitments, either through moral values, or certain decision-theoretic policy choices (Drescher 2006). These choices would not in themselves make commitments externally credible, of course, but could have verifiable origins depending on the agent's history.

Conclusions and further notes

As mentioned above, it seems that each commitment strategy described here suffers from potentially serious drawbacks, though in different areas and circumstances. Many plausible scenarios can be envisioned where one or more of the approaches succeeds in supporting credible commitment. Different approaches could even be used in overlapping ways to compensate for their weaknesses, though this holds less if the main weakness is resource costs. In many cases, the feasibility of commitments seems to come down to whether the surplus from cooperation will be enough to incentivize a great deal of collective effort. Another fundamental question is how costly it is to obfuscate one's intentions with great care, versus detect obfuscation by observing an agent’s behavior and history.

On a more practical level, contingent features such as agent heterogeneity and logistics suggest that even if contracts and commitments were overall feasible, they would be costlier to verify between some agents than between others. Rather than expecting uniform opportunities for commitment throughout the landscape, we should perhaps assume the environment will be governed by some n-dimensional mess of gradients in commitment ability. Comparing agents along axes such as physical location, architectural similarity, history, normative motivations, and willingness to cooperate, some of them would likely be in better positions to make credible commitments to each other. This does not necessarily prevent widely cooperative dynamics, especially if there is a lot of transitivity in commitment ability between agents as speculated above, but makes the path there more complicated in terms of interventions.

Another insight from this work is that committing to threats could require completely different mechanisms or approaches than committing to cooperation, and future discussions on commitment among AI systems should ideally reflect this. Notably, as many ways to signal one's intentions already require some minimal collaborative labor, it seems much more feasible commitment-wise to make prosocial commitments than to extort others.[9] When you can't simply inform your target of a threat and your intentions to carry it out, and would instead need them to go through a costly process to get your intentions properly verified, you might find that they aren't very interested in hearing more about your plans.[10] One exception seems to be the dumber mechanisms, which are well suited for destructive threats, but might not be able to represent complex voluntary trade contracts.

Acknowledgements

This post benefited immensely from conversations with and feedback from Jesse Clifton, Richard Ngo, Daniel Kokotajlo, Lin Chi Nguyen, Lukas Gloor, Stefan Torges, Anthony DiGiovanni, Caspar Oesterheld, as well as all my colleagues at Center on Long-term Risk (CLR) and the attendees at CLR's 2019 S-risk workshop, which inspired many of the initial ideas explored here.


  1. Whether or not AI systems can credibly commit to humans is not discussed much in this post, though it is also an interesting question. ↩︎

  2. There are some counterexamples to this definition that work against the spirit of commitment, though. Agents could still knowingly omit relevant contextual information they have about the world, or about processes they set in motion earlier which are now separate from the agent as such. On the other hand, requiring that no interesting contextual information be hidden when commitments are made seems impossibly strict, since even miniscule differences between the beliefs of two agents could turn out to be relevant in ways that are unpredictable or overly costly to map out. ↩︎

  3. Even in Bayesian games, contracts conditional on the other players' contracts can possibly be formulated. This idea has been investigated by Peters and Szentzes (2012), but seems overly abstract to be relevant here. ↩︎

  4. This point was raised by Richard Ngo during our conversations in 2019. ↩︎

  5. It's not a given that elegance alone would lead to increased interpretability, of course. For example, even if there were fundamental patterns to intelligence that superhuman systems could discover and model themselves after, these more compact foundations could still possibly be implemented in any number of different ways, none of which might be uniquely efficient. ↩︎

  6. Some minimal non-aggression principles could hopefully also be added in to prevent agents from using the commitment system for extortion. This would on average again be in the interests of participants, as extortion causes expected value loss in the bargaining environment. ↩︎

  7. The system's predictability was apparently hampered by the inexplicable policy decision to not mention it much to outsiders, though. ↩︎

  8. Again, of course, this would not constitute a genuinely multipolar scenario. ↩︎

  9. This is pretty intuitive, as it's also how human commitment structures have been designed -- as a kidnapper, you could hardly hire a lawyer to write an enforceable contract that binds you to actually killing your hostages unless you get what you want. ↩︎

  10. This would naturally not mean that you couldn't have internally committed to the threat regardless of whether your target listens to you or not, but this would at least be an unwise strategy. ↩︎

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 9:05 PM

Nice post! A couple of quick comments:

"If interactions are repeated in an environment where the stakes get higher over time, most agents would prefer to be honest while the stakes are low, regardless of how they will act in a sufficiently high-stakes situation."

If honesty in low-stakes situations is very weak evidence of honesty in high-stakes situations, then it will become less common as an instrumental strategy, which makes it stronger evidence, until it reaches equilibrium.

More generally, I am pretty curious about how reputational effects work when you have a very wide range of minds. The actual content of the signal can be quite arbitrary - e.g. it's possible to imagine a world in which it's commonly understood that lying continually about small scales is intended as a signal of the intention to be honest about large scales. Once that convention is in place, then it could be self-perpetuating.

This is a slightly extreme example but the general point remains: actions taken as signalling can be highly arbitrary (see runaway sexual selection for example) when they're not underpinned by specific human mental traits (like the psychological difficulty of switching between honesty and lying).

"This holds especially because the higher the stakes get in a competition for expansion, the fewer future interactions one expects, as wiping out other players entirely becomes a possible outcome."

Seems plausible, but note that early iterated interactions allows participant to steer towards possibilities where important outcomes are decided by many small interactions rather than a few large interactions, making long-term honesty more viable.

"lending an expensive camera to one's sibling seems less risky than to a stranger simply because of the high likelihood of frequent future interactions"

This doesn't seem right; your sibling is by default more aligned and trustworthy.

"while the agents can't interpret each other or predict how well they would stick to commitments, a far more capable system (here, likely just a system with vastly more compute at its disposal) could do it for them."

Is it fair to describe this as creating a singleton?

What do you think about the following sort of interpretability?

You and I are neural nets. We give each other access to our code, so that we can simulate each other. However, we are only about human-level intelligence, so we can't really interpret each other--we can't look at the simulated brain and say "Ah yes, it intends to kill me later." So what we do instead is construct hypothetical scenarios and simulate each other being in those scenarios, to see what they'd do. E.g. I simulate you in a scenario in which you have an opportunity to betray me.

Super thoughtful post!

I get the feeling that I'm more optimistic about post-hoc interpretability approaches working well in the case of advanced AIs. I'm referring to the ability of an advanced AI in the form of a super large neural network-based agent to take another super large neural network-based agent and verify its commitment successfully. I think this is at least somewhat likely to work by default (i.e. scrutinizing advanced neural network-based AIs may be easier than obfuscating intentions). I also think this may potentially not require that much information about the training method and training data.

I thought before that this doesn't matter in practice because of possibility of self-modification and successor agents. But I now think that at least in some range of potential situations verifying the behavior of a neural network seems enough for credible commitment when an agent pre-commits to using this neural network e.g. via a blockchain.

Also, are you sure that the fact that people can't simulate nematodes fits well in this argument? I may well be mistaken but I thought that we do not really have neural network weights for nematodes, we only have the architecture. In this case it seems natural that we can't do forward passes.