Project ideas: Backup plans & Cooperative AI

Lukas Finnveden

This is part of a series of lists of projects. The unifying theme is that the projects are not targeted at solving alignment or engineered pandemics but still targeted at worlds where transformative AI is coming in the next 10 years or so. See here for the introductory post.

In this final post, I include two categories of projects (which are related, and each of which I have less to say about than the previous areas). Backup plans for misaligned AI and Cooperative AI.

Backup plans for misaligned AI

When humanity builds powerful AI systems, I hope that those systems will be safe and aligned. (And I’m excited about efforts to make that happen.)

But it’s possible that alignment will be very difficult and that there won’t be any successful coordination effort to avoid building powerful misaligned AI. If misaligned AI will be built and seize power (or at least have the option of doing so), then there are nevertheless certain types of misaligned systems that I would prefer over others. This section is about affecting that.

This decomposes into two questions:

If humanity fails to align their systems and misaligned AIs seize power: What properties would we prefer for those AIs to have?
If humanity fails to align their systems and misaligned AIs seize power: What are realistic ways in which we might have been able to affect which misaligned AI systems we get? (Despite how our methods were insufficient to make the AIs fully safe & aligned.)

The first is addressed in What properties would we prefer misaligned AIs to have? The second is addressed in Studying generalization & AI personalities to find easily-influenceable properties. (If you’re more skeptical about the second of these than the first, you should feel free to read that section first.)

What properties would we prefer misaligned AIs to have? [Philosophical/conceptual] [Forecasting]

There are a few different plausible categories here. All of them could use more analysis on which directions would be good and which would be bad.

Making misaligned AI have better interactions with other actors

This overlaps significantly with cooperative AI. The idea is that there are certain dispositions that AIs could have that would lead them to have better interactions with other actors who are comparably powerful.

When I say “better” interactions, I mean that the interactions will leave the other actors better off (by their own lights). Especially insofar as this happens via positive-sum interactions that don’t significantly disadvantage the AI system we’re influencing.

Here are some examples of who these “other actors” could be:

Humans.
- There might be some point in time when misaligned AIs have escaped human control, and have a credible shot at taking full control, but when humanity^[1] still has a fighting chance. If so, it would be great if humans and AIs could find a cooperative solution that would leave both of them better-off than conflict.
Aliens in our universe, or distant aliens who we can only interact with acausally (e.g. via evidential cooperation in large worlds ECL).
- One reason to care about these interactions is that some (possibly small) fraction of aliens would have values that overlap significantly with our own values.
- Evidential cooperation in large worlds (ECL) could provide another reason to care about the values of distant aliens, as I’ve written about here.
Other misaligned AIs on Earth, insofar as multiple groups of misaligned AIs acquired significant power around the same time.
- I think the case for caring about those AIs’ values is weaker than the case for caring about the earlier listed types of actors. But it’s possible that some of those AI systems’ values could overlap with ours, or that some of them would partially care about humanity, or that ECL gives us reasons to care about their values.

So what are these “dispositions” that could lead AIs to have better interactions with other comparably powerful actors? Some candidates are:

What type of decision theory the AIs use. For example, if the AIs use EDT rather than CDT, it might be rational for them to act more cooperatively due to ECL.
AIs not having any spiteful preferences, i.e. preferences such that they would actively value and attempt to frustrate others’ preferences.^[2] (Note that sadism in humans provides some precedent for this coming about naturally.)
- See this post for a proposed operationalization and some analysis.
AI having diminishing marginal returns to resources rather than increasing or linear returns.^[3]
AIs having porous values.

AIs that we may have moral or decision-theoretic reasons to empower

The main difference between this section and the above section is that the motivations in this section are one step closer to being about empowering the AIs for “their own sake” rather than for the sake of someone they interact with. Though it still includes pragmatic/decision-theoretic reasons for why it’s good to satisfy certain AI systems’ values.

One direction to think about is the scheme that Paul Christiano suggests in When is unaligned AI morally valuable?

It involves simulating AI systems that are seemingly in a situation similar to our own: in the process of deciding whether to hand over significant power to an alien intelligence of their own design.
If those AI systems behave “cooperatively” in the sense of empowering an alien intelligence that itself acted “cooperatively”, then perhaps there’s a moral and/or decision-theoretic argument for us empowering those AIs.
- The pragmatic decision-theoretic argument would be that we might be in such a simulation and that our behaving “cooperatively” would lead to us being empowered.
- The moral case would be that there is a certain symmetry between the AI’s situation and our own, and so the Golden Rule might recommend empowering those AI systems.
Overall, this seems incredibly complicated. But perhaps further analysis could reveal whether there is something to this idea.

Another direction is the ideas that I talk about in ECL with AI. Basically:

ECL might give us reason to benefit and empower value systems held by distant ECL-sympathetic AIs.
Thus, if we can make our AIs have values closer to distant ECL-sympathetic AIs and/or be more competent by the lights of distant ECL-sympathetic AIs, then we might have reason to do that.

Another direction is to think about object-level things that humans value, as well as the process that produced our values, and try to get AI systems more inclined to value similar things. I’m somewhat skeptical of this path since human values seem complex, and so I’m not sure what schemes could plausibly make AIs share a significant fraction of human values without us also having the capability of making the AIs corrigible or otherwise safe.^[4] But it doesn’t seem unreasonable to think about it more.

(To reiterate what I said above: I think that all of these schemes would be significantly worse than successfully building AI systems that are aligned and corrigible to human intentions.)

Making misaligned AI positively inclined toward us

A final way in which we might want to shape the preferences of misaligned AIs is to make them more likely to care enough about humans to give us a small utopia, instead of killing us. (Even if most of the universe gets used for the AI’s own ends.)

For an AI that cares about all the resources in the universe (in an mostly impartial way), it would be extremely cheap to do this. Our solar system is a negligible fraction of all the resources in the accessible universe. And a surveillance system that prevents humans from competing with the AIs could probably be built cheaply and without interfering too much with human happiness. (For some discussion of this, see my report AGI and lock-in, especially section 8.2.)

I think it’s reasonably likely that this would happen as a result of trade with distant civilizations. Taking that into account, there are 3 broad directions, here:

Firstly, trade will be more likely to save us if we succeed at Making misaligned AI have better interactions with other actors, as discussed above.
Secondly, trade will be more likely to save us if it’s really cheap for the AI to treat us well. Since the resource cost of treating us well is small by default, this might just mean decreasing the probability that the AI either actively wants to harm us or that it has preferences that especially interfere with ours (via e.g. caring a lot about what happens on Earth in just the next few years).
Finally, if trade falls through, it might help for the AI to have some intrinsic concern for humans getting what they want (by their own lights).

Insofar as we want the AIs to have some intrinsic concern for us (or at least not to be actively antagonistic towards us), we can also distinguish between interventions that:

Directly modify the AIs’ dispositions and preferences.
Intervenes on what humanity does in a way that makes the AI more likely to care about us insofar as it has some sense of justice or reciprocity.
- For example, if we successfully carry out many of the interventions suggested in the post on sentience and rights of digital minds: AIs that have absorbed a sense of justice could reasonably be more positively inclined towards us than if we had been entirely indifferent to AI welfare.

For some discussion about whether it’s plausible that AIs could have some intrinsic concern for humans getting what they want (by their own lights), which addresses issues around the “complexity of human values”, I recommend this comment and subsequent thread.

Studying generalization & AI personalities to find easily-influenceable properties [ML]

Here is a research direction that hasn’t been very explored to date: Study how language models’ generalization behavior / “personalities” seem to be shaped by their training data, by prompts, by different training strategies, etc. Then, use that knowledge to choose training data, prompts, and training strategies that induce the kind of properties that we want our AIs to have.

If done well, this could be highly useful for alignment. In particular: We might be able to find training set-ups which often seem to lead to corrigible behavior.

But notably, this research direction could fail to work for alignment while still being practically able to affect other properties of language models.^[5] For example, maybe corrigibility is a really unnatural and hard-to-get property (perhaps for reasons suggested in item 23 of Yudkowsky’s list of lethalities, and formally analyzed here). That wouldn’t necessarily imply that it was similarly hard to modify the other properties discussed above (decision theories, spitefulness, desire for humans to do well by their own lights). So this research direction looks more exciting insofar as we could influence AI personalities in many different valuable ways. (Though more like 3x as exciting than 100x as exciting, unless you have particular views where “corrigibility” is either significantly less likely or less desirable than the other properties.)

What about fine-tuning?

A “baseline” strategy for making AIs behave as you want is to finetune them to exhibit that behavior in situations that you can easily present them with. But if this work is to be useful, it needs to generalize to strange, future situations where humans no longer have total control over their AI systems. We can’t easily present AIs with situations from that same distribution, and so it’s not clear whether fine-tuning will generalize that far.^[6]

So while “finetune the model” seems like an excellent direction to explore, for this type of research, you’ll still want to do the work of empirically evaluating when fine-tuning will and won’t generalize to other settings. By varying various properties of the fine-tuning dataset, or other things, like whether you’re doing supervised learning or RL.

Also, insofar as you can find models that satisfy your evaluations without needing to do a lot of “local” search (like fine-tuning / gradient descent), it seems somewhat more likely that the properties you evaluated for will generalize far. Because if you make large changes in e.g. architecture or pre-training data, it’s more likely that your measurements are picking up on deeper changes in the models. Whereas if you use gradient descent, it is somewhat more likely that gradient descent implements a “shallow” fix that only applies to the sort of cases that you can test.^[7]

Of course, the above argument only works insofar as you’re searching for properties you could plausibly get without doing a lot of search. For example, you’d never get something as complex as “human values” without highly targeted search or design. But properties like “corrigibility”, “(lack of) spitefulness”, and “some desire for humans to do well by-their-own-lights” all seem like properties that could plausibly be common under some training schemes.

Ideally, this research direction would lead to a scientific understanding of training that would let us (in advance) identify & pick training processes that robustly lead to the properties that we want. But insofar as we’re looking for properties that appear reasonably often “by default”, one possible backup plan may be to train several models under somewhat different conditions, evaluate all of them for properties that we care about, and deploy the one that does best. (To be clear: this would be a real hail-mary effort that would always carry a large probability of failing, e.g. due to the models knowing what we were trying to evaluate them for and faking it.)

Previous work

An example of previous, related research is Perez et al.’s Discovering Language Model Behaviors with Model-Written Evaluations.

Ways in which this work is relevant for the path-to-impact outlined here:

The paper does not focus on “capability evaluations” (i.e. analyzing whether models are capable of providing certain outputs, given the right fine-tuning or prompting). Instead, it measures language models’ inclinations along dimensions they haven’t been intentionally finetuned for.
It measures how these inclinations vary depending on some high-level changes to the training process. In particular, it looks at model size and the presence vs. absence of RLHF training.
It measures how these inclinations vary depending on some features of the prompting. In particular, it studies models’ inclinations towards “sycophancy” by examining whether models’ responses are sensitive to facts the user shared about themselves.
For each property it wants to test for, it generates many questions that get at that question, thereby reducing noise and the risk of spurious results.

Further directions that could make this type of research more useful for this path-to-impact.

Considering a greater number of training conditions. For example:
- Testing for differences between fine-tuning via supervised learning vs. fine-tuning via RL.
- Testing the differential impacts of different finetuning datasets (rather than just one “RLHF” setting, with more/fewer training steps).
  - Potentially using influence functions or (more simplistically) leaving particular data points out from fine-tuning and seeing how the results change.
Varying the context that the LLM is presented with. Is it asked a question about what’s right or wrong, is it asked to advise us, or is it prompted to itself take an action? (This context can be varied both during evaluation and during training.)
More systematic study of framing effects. Are the models’ answers better predicted by the content of the questions or by the way they are presented?
Using more precisely described scenarios so that it’s easier to vary individual details and see what matters for the AIs’ decisions. E.g. present actual pay-off matrices in difficult dilemmas.
Study how various closely adjacent concepts go together or come apart by presenting dilemmas where they would recommend different actions. For example, contrast being “nice” vs. “cooperative” vs. “high-integrity”, etc. What are the natural dimensions of variation within the AIs’ personality?
Making use of analysis concerning What properties would we prefer misaligned AIs to have?, and targeting evaluations & training datasets to answer the most important questions.
- For example: Designing multi-agent training & evaluation data sets that study when models may or may not develop spiteful preferences. Perhaps comparing models only trained on zero-sum games vs. models also trained on cooperative or mixed-motive games.
Studying how far various properties generalize from the training distribution, by intentionally making the test distribution different in various ways.

(Thanks to Paul Christiano for discussion.)

Theoretical reasoning about generalization [ML] [Philosophical/conceptual]

Rather than doing empirical ML research, you could also do theoretical reasoning about what sort of generalization properties and personality traits are more or less likely to be induced by different kinds of training.

For example, it seems a-priori plausible that spiteful preferences are more likely to arise if you (only) train AI systems on zero-sum games.

There has also been some theoretical work on what kind of decision-theoretic behavior is induced by different training algorithms, for example Bell, Linsefors, Oesterheld & Skalse (2021) and Oesterheld (2021).

I think we’ll ultimately want empirical work to support any theoretical hypotheses, here. But theoretical work seems great for generating ideas of what’s important to test.

Cooperative AI

This is an area other people have written about.

It’s the focus of the Cooperative AI Foundation
It’s a major focus area of the Center on Long-Term Risk (because it seems especially important for s-risk reduction).
- You can see their research agenda on the topic here.
There’s relevant research at the Foundations of Cooperative AI Lab at CMU.
It’s a significant motivation behind encultured.ai.

Partly due to this, I will write about it in less detail than I’ve written about the other topics. But I will mention a few projects I’d be especially excited about.

The first thing to mention is that some of my favorite cooperative AI projects are variants of the just-previously mentioned topics: Studying generalization & AI personalities to find easily-influenceable properties and figuring out What properties would we prefer misaligned AIs to have? Positively influencing cooperation-relevant properties like (lack of) spitefulness seems great. I won’t go over those projects again, but I think they’re great cooperative AI projects, so don’t be deceived by their lack of representation here.

Similarly, some of the topics under Governance during explosive technological growth are also related to cooperative AI. In particular, the question of How to handle brinkmanship/threats? is very tightly related.

Another couple of promising projects are:

Implementing surrogate goals / safe Pareto improvements [ML] [Philosophical/conceptual] [Governance]

Safe Pareto improvements are an idea for how certain bargaining strategies can guarantee a (weak) Pareto-improvement for all players via preserving certain invariants about what equilibrium is selected while replacing certain outcomes with other, less-harmful outcomes. Surrogate goals are a special case of this, which involves genuinely adopting a new goal in a way that will mostly not affect your behavior, but which will encourage people who want to threaten you to make threats against the surrogate goal rather than your original values. If bargaining breaks down and the threatener ends up trying to harm you, it is better that they act to thwart the surrogate goal than to harm your original values. See here for resources on surrogate goals & safe Pareto improvements.

I think there are some promising empirical projects that can be done here:

Empirical experiments of implementing surrogate goals in contemporary language models.
Empirical experiments of implementing surrogate goals in contemporary language models that the models try to keep around during self-modification / when designing future systems.

Conceptual/theory projects:

Better understanding of conditions where surrogate goals / safe Pareto improvements are credible. (Including credibly sticking around for a long time.) Especially when humans are still in the loop.
What are the conditions under which classically rational agents would use safe Pareto improvements?

AI-assisted negotiation [ML] [Philosophical/conceptual]

One use-case for AI that might be especially nice to differentially accelerate is “AI that helps with negotiation”. Certainly, it would be of great value if AI could increase the frequency and speed at which different parties could come to mutually beneficial disagreements. Especially given the tricky governance issues that might come with explosive growth, which may need to be dealt with quickly.

(This is also related to Technical proposals for aggregating preferences, mentioned in that post.)

I’m honestly unsure about what kind of bottlenecks there are here, and to what degree AI could help alleviate them.

Here’s one possibility. By virtue of AI being cheaper and faster than humans, perhaps negotiations that were mediated by AI systems could find mutually agreeable solutions in much more complex situations. Such as situations with a greater number of interested parties or a greater option space. (This would be compatible with humans being the ones to finally read, potentially opine on, and approve the outcome of the negotiations.)

More speculatively: Perhaps negotiations via AI could also go through more candidate solutions faster because anything an AI said would have the plausible deniability of being an error. Such that you’d lose less bargaining power if your AI signaled a willingness to consider a proposal that superficially looked bad for you.^[8]

Implications of acausal decision theory [Philosophical/conceptual]

One big area is: the implications of acausal decision theory for our priorities. This is something that I previously wrote about in Implications of ECL (there focusing specifically on evidential cooperation in large worlds).

But to highlight one particular thing: One potential risk that’s highlighted by acausal decision theories is the risk of learning too much information. This is discussed in Daniel Kokotajlo’s The Commitment Races problem, and some related but somewhat distinct risks are discussed in my post When does EDT seek evidence about correlations? I’m interested in further results about how big of a problem this could be in practice. If we get an intelligence explosion anytime soon, then our knowledge about distant civilizations could expand quickly. Before that happens, it could be wise to understand what sort of information we should be happy to learn as soon as possible vs. what information we should take certain precautions about.

Updateless Decision Theory, as first described here, takes some steps towards solving that problem but is far from having succeeded. See e.g. UDT shows that decision theory is more confusing than ever for a description of remaining puzzles. (And e.g. open-minded updatelessness for a candidate direction to improve upon it).

End

That’s all I have on this topic! As a reminder: it's very incomplete. But if you're interested in working on projects like this, please feel free to get in touch.

Other posts in series: Introduction, governance during explosive growth, epistemics, sentience and rights of digital minds.

^{^}
Possibly assisted by aligned AIs or tool AIs.
^{^}
Maybe some mild desire for retribution (in a way that discourages bad behavior while still being de-escalatory) could be acceptable, or even good. But we would at least want to avoid extreme forms of spite.
^{^}
Sufficiently strong versions of this could also drastically reduce motivations to overthrow humans. At least if we’ve done an ok job at promising and demonstrating that we’ll treat digital minds well.
^{^}
This path also carries a higher risk of near-miss scenarios.
^{^}
Which I mainly care about because it might let us influence misaligned models. But in principle, it’s also possible that we could get intent-alignment via other means, but that we were still happy to have done this research because it lets us influence other properties of the model. But the path-to-impact there is more complicated, because it requires an explanation for why the people who the AI is aligned to aren’t able or willing to elicit that behavior just by asking/training for it. (Yet are willing to implement the training methodology that indirectly favors that behavior.)
^{^}
And if we’re specifically looking for ways to affect properties in worlds where alignment fails, then we’re conditioning on being in a world where the simplest “baseline” solutions (such as fine-tuning for good behavior) failed. Accordingly, we should be more pessimistic about simple solutions.
^{^}
Possibly via modifying a model that is “playing the training game” to better recognise that it’s being evaluated and to notice what the desired behavior is.
^{^}
Also: If there was some information that you wanted to be part of AI bargaining, but that you didn’t want to be communicated to the humans on the other side, you could potentially delete large parts of the record and only keep certain circumscribed conclusions.

18