Discovering alignment windfalls reduces AI risk

stuhlmueller

Review

Some approaches to AI alignment incur upfront costs to the creator (an “alignment tax”). In this post, I discuss “alignment windfalls” which are strategies that tend towards the long-term public good at the same time as reaping short-term benefits for a company.

My argument, in short:

Just as there are alignment taxes, there are alignment windfalls.
AI companies optimise within their known landscape of alignment taxes & windfalls.
We can change what AI companies do by:
1. Shaping the landscape of taxes and windfalls
2. Shaping their knowledge of that landscape
By discovering and advocating for alignment windfalls, we reduce AI risk overall because it becomes easier for companies to adopt more alignable approaches.

Alignment taxes

An “alignment tax” refers to the reduced performance, increased expense, or elongated timeline required to develop and deploy an aligned system compared to a merely useful one.

More specifically, let’s say an alignment tax is an investment that a company expects to help with alignment of transformative AI that has a net negative impact on the company's bottom line over the next 3-12 months.

A few examples, from most to least concrete:

Adversarial robustness: For vision models, and maybe deep learning models in general, there seems to be a trade-off between adversarial robustness and in-distribution performance.^[1] Effort can be invested in improving either metric, and adversarial training often requires larger training sets and more complex implementations. Simply put: making your model behave more predictably in unexpected scenarios can lead to it performing worse in everyday circumstances.
Robustness to distributional shift: Similarly, it’s easiest to develop and deploy systems that assume that the future will be like the past and present. For example, you don’t need to detect sleeper agents if you only care about the training distribution.
Safe exploration: Once AI systems are interacting with the real world, adding additional controls that prevent unsafe or unethical behaviours is more costly than simply allowing unlimited exploration.
Avoiding power-seeking: AI systems that engage in power-seeking behaviour—acquiring resources, colluding with other AI systems—may be more economically valuable in the short term, at the expense of long-term control.

All of these require more investment than the less aligned baseline comparison, and companies will face hard decisions about which to pursue.

Alignment windfalls

On the other hand, there are some ideas and businesses where progress on AI safety is intrinsically linked to value creation.

More specifically, let’s say an alignment windfall is an investment that a company expects to help with alignment of transformative AI that also has a net positive impact on the company's bottom line over the next 3-12 months.

For example:

Reinforcement Learning from Human Feedback. RLHF comes with significant upfront costs^[2], but it easily pays for itself by producing a more valuable model. Nathan Labenz was one of the volunteers who assessed a “purely helpful” version of GPT-4, and found it was “totally amoral”. This early version of the model would have been a liability for OpenAI (at one point suggesting targeted assassination as the best way to slow down AI progress). The version which was made public has been carefully tuned with RLHF to avoid such behaviour. Anthropic proposed that an aligned ML system should be helpful, honest, and harmless: RLHF can make Pareto improvements on these three axes, making models both more valuable and safer.^[3]
Factored cognition when epistemics matter. High-quality thinking is often as much about the process followed as it is about the result produced. By setting up AI systems to follow systematic, transparent processes we improve the quality of thought, giving the user a richer experience as well as making the system safer. I think this is an underrated example and will discuss it in depth later in the post.
Interpretability tools. Interpretability (meaning our ability to introspect and understand the workings of a neural network) is widely held to be a key component of AI safety (e.g. 1, 2, 3). It can help spot inner misalignment, deception, aid with design and debugging, and lots more. Companies founded on productising interpretability are incentivised to make their tools' output as truthful and useful as possible. This increase in visibility naturally creates value for the business.^[4]
Eric Ho has a great list of for-profit AI alignment ideas that, besides interpretability tools, also includes software and services for testing, red-teaming, evals, cybersecurity, and high-quality data labelling. My thinking diverges from his slightly, in that many of these ideas are ancillary tooling around AI systems, rather than the systems themselves. In this way, they don’t hook directly into the inner loop of market-driven value creation, and might have a smaller impact as a result.

In practice, almost all ideas will have some costs and some benefits: finding ways to shape the economic environment so that they look more like windfalls is key to getting them implemented.

Companies as optimisers

Startup companies are among the best machines we've invented to create economic value through technological efficiency.

Two drivers behind why startups create such an outsized economic impact are:

Lots of shots on goal. The vast majority of startups fail: perhaps 90% die completely and only 1.5% get to a solid outcome. As a sector, startups take a scattergun approach: each individual company is likely doomed, but the outsized upside for the lucky few means that many optimists are still willing to give it a go.
Risk-taking behaviour. Startups thrive in legal and normative grey areas, where larger companies are constrained by their brand reputation, partnerships, or lack of appetite for regulatory risk.

In this way, the startup sector is effectively searching over our legal and ethical landscape for unexploited money-making ideas. They're looking for the most efficient way to create value, and they're willing to take risks to do so.

This optimisation pressure will be especially strong for artificial intelligence, because the upside for organisations leading the AGI race is gigantic. For startups and incumbents alike, there is incredible countervailing pressure on anything standing in the way: alignment taxes start to look a lot like inefficiencies to be eliminated, and alignment windfalls become very appealing.

Shaping the landscape

The technical approaches which lead to taxes and windfalls lie on a landscape that can be shaped in a few ways:

Regulation can levy taxes on unaligned approaches:

Causes like environmental protection or consumer safety made progress when governments decreed that companies must absorb additional costs in order to protect the public: from pollution and unsafe products respectively. Regulation meant it was a smart economic decision for these companies to better align to the needs of the public.
Regulation is fast becoming a reason to pay some alignment tax for AI too. For example, red-teaming will soon become required for companies building foundation models due to the Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.

Public awareness can cause windfalls for aligned approaches:

While regulation is often used to set a minimum standard that companies must meet on prosocial measures, marketing strategies can push beyond this and incentivise excellence. For decades, Volvo built immense brand equity around the safety features in their cars with adverts demonstrating how crumple zones, airbags, and side impact protection would protect you and your family.
In AI, the marketing dynamic remains centred around avoiding PR disasters, rather than aiming for brand differentiation. Infamously, Alphabet briefly lost $100 billion of value after its splashy launch adverts included a factual error. Hopefully, AI companies “do a Volvo” and compete on safety in the future, but there probably isn't enough public awareness of the issues for this to make sense yet.
On the other hand, in a competitive world with many players, public awareness may not be enough to get to exceptionally high standards, whereas regulation could enforce it across the board.

Recruiting top talent is easier for safety-oriented companies:

Given two similar employment choices, most candidates would favour the option which better aligns with their own moral code. Good examples of this will be found at mission-driven non-profit organisations, staffed by team members who could often find much better-compensated work at for-profit orgs, but who find significant value in aiding causes they believe in.
As for AI, the competition for talent is extraordinarily competitive. Panglossian AI optimists might not care about the safety stance of the organisation they work at, but most people would pay some heed. Therefore, AI companies that prioritise alignment—even if it comes with a tax—can boost their ability to hire the best people and remain competitive.

In practice, there is often interplay between these. For example, regulation can help the public appreciate the dangers of products, and a positive public company profile is a boon for recruiting.

Companies greedily optimise within the known landscape

An important nuance with the above model is that companies don’t optimise within the true landscape: they optimise within the landscape they can access. Founders don’t know the truth about where all the windfalls lie: they have to choose what to build based on the set of possible approaches within reach.

Here are a couple of reasons why the full landscape tends to be poorly known to AI startup founders:

Startups are myopic. Startups tend to run on tight feedback loops. Using quick, iterative release schedules, they excel at pivoting to whatever direction seems most promising for the next month or two. They tend to make lots of local decisions, rather than grand strategies. This isn’t to say that startups are unable to make bold moves, especially in aggregate, but in my experience the vast majority of decisions at the vast majority of startups are incremental improvements and additions to what’s already working.
Companies are secretive. If a startup did happen to find an alignment windfall, they would be incentivised to keep it secret for as long as possible. For example, many important details in how GPT-4 and Claude 2 were trained are not public.

In contrast, researchers in academia have much more latitude to explore completely unproven ideas lacking any clear path to practical application—and the expectation is that those results will be published publicly too. Unfortunately, because of the data and compute requirements associated with modern machine learning, it has become hard for aspiring researchers to do many forms of groundbreaking work outside of an industrial setting: they’re embedded in these myopic and secretive organisations.

Shaping knowledge of the landscape

What does it look like to shape the broader knowledge of this landscape?

Some of this is technical alignment work: discovering how to build AI systems that are robustly aligned while minimising alignment taxes or even maximising windfalls.
Another crucial part is exploring, testing, and promoting ideas for AI companies that exploit alignment windfalls. This is what Elicit is doing, as I’ll explain in the following section.

Factored cognition: an example of an alignment windfall

Let’s consider a more detailed example. Elicit has been exploring one part of the landscape of taxes & windfalls, with a focus on factored cognition.^[5]

Factored cognition as a windfall

Since the deep learning revolution, most progress on AI capability has been due to some combination of:

More data.
More compute.
More parameters.

Normally, we do all three at the same time. We have basically thrown more and more raw material at the models, then poked them with RLHF until it seems sufficiently difficult to get them to be obviously dangerous. This is an inherently fragile scheme, and there are strong incentives to cut corners on the “now make it safe” phase.

Factored cognition is an alternative paradigm which offers a different path. Instead of solving harder problems with bigger and bigger models, we decompose the problem into a set of smaller, more tractable problems. Each of these smaller problems is solved independently and their solutions combined to produce a final result. In cases where factored cognition isn't great for generating a result, we can factor a verification process instead. Either way, we aim to keep the component models small and the tasks self-contained and supervisable.

How we’ve been exploring factored cognition at Elicit

Elicit, our AI research assistant, is built using factored cognition: we decompose common research tasks into a sequence of steps, using gold standard processes like systematic reviews as a guide.

For our users, accuracy is absolutely crucial. They are pinning their reputation on the claims that they make, and therefore something which merely sounds plausible is nowhere near good enough. We need to earn their trust through solid epistemic foundations.

For Elicit, creating a valuable product is the same thing as building a truthful, transparent system. We don't have some people building an AI tool and also some people figuring out how to make it reliable. Trustworthiness is our value proposition.

Conclusion

Let's find and promote alignment windfalls!

Some of the top AI labs seem to be careful and deliberative actors, but as more competitors enter the race for AGI the pressure to avoid hindrance will increase. Future competitors may well be less cautious and even explicitly reject safety-related slow-downs.

If this is the world we're heading towards, AI safety measures which impose significant alignment taxes are at risk of being avoided. To improve outcomes in that world, we should discover and promote alignment windfalls, by which I mean mechanisms that harness the awesome efficiency of markets to create aligned AI systems.

I'm a proponent of other approaches—such as regulation—to guide us towards safe AI, but in high stakes situations like this my mind turns to the Swiss cheese model used to reduce clinical accidents. We shouldn't hope for a panacea, which probably doesn't exist in any case. We need many, independent layers of defence each with their strengths and (hopefully non-overlapping) weaknesses.

In my view, Elicit is the best example of an alignment windfall that we have today. To have maximum impact, we need to show that factored cognition is a powerful approach for building high-stakes ML systems. Elicit will be a compelling existence proof: an example which we hope other people will copy out of their own self-interest and—as such—make AI safer for everyone.

Want to help us?

We are building the best possible team to make this happen—you can see our open roles here!

Many thanks to @jungofthewon, Étienne Fortier-Dubois, @Owain_Evans, and @brachbach for comments on an early draft.

^{^}
E.g. Hendrycks (2022)
^{^}
Jan Leike estimated RLHF accounted for 5–20% of the total cost of GPT-3 Instruct, factoring in things like engineering work, hiring labelers, acquiring compute, and researcher effort.
^{^}
Of course, it’s far from a complete solution to alignment and might be more akin to putting lipstick on a shoggoth than deeply rewiring the model’s concepts and capabilities.
^{^}
But also: Some have noted that interpretability could be harmful if our increased understanding of the internals of neural networks leads to capability gains. Mechanistic interpretability would give us a tight feedback loop with which to better direct our search for algorithmic and training setup improvements.
^{^}
We expect to contribute to reducing AI risk by improving epistemics, but highlighting alignment windfalls from factored cognition is important to us as well.

[-]stuhlmueller2y11

Another potential windfall I just thought of: the kind of AI scientist system discussed by Bengio in this talk (older writeup). The idea is to build a non-agentic system that uses foundation models and amortized Bayesian inference to create and do inference on compositional and interpretable world models. One way this would be used is for high-quality estimates of p(harm|action) in the context of online monitoring of AI systems, but if it could work it would likely have other profitable use cases as well.

LESSWRONG
is fundraising!
LW