How harmful are improvements in AI? + Poll

A bit of clarification about EleutherAI's stance: the compressed version of our argument is that a) for a variety of reasons, including the fact that our models are far behind the frontier, we believe that the AGI capabilities contribution of our release is very small, and b) we believe that there’s a significant chance that alignment research that has any meaningful chance of generalizing to AGI requires access to large language models. I would say that the best case outcome of our work would be if research using our models results in some novel alignment techniques that scale to superhuman LLM-based AGI.

Our full argument is pretty nuanced and it's hard to do justice to it in a few sentences, so I recommend reading the alignment section of the recent NeoX 20B paper, which outlines some of these arguments (and especially which concrete directions in particular we're interested in) in far more detail.

[-]Marius Hobbhahn4y20

thanks for the clarification!

[-]tilmanr4y10

Thank you for giving more context to EleutherAI's stance on acceleration and linking to your newest paper.

I support the claim that your open model contributes to AI safety research, and I generally agree with the improvements for the alignment landscape. I can also understand why you are not detailing possible failure modes of realising LLM, as this would basically be stating a bunch of infohazards.
But at least for me, this opens the space for discussing until which point to open up previously closed models for the sake of alignment research. If an aligned researcher can benefit from access, so could a non-aligned researcher, hence the " accidental acceleration."

[-]Peter Hroššo4y10

Power corrupts, so I don't think the view number 3. Gaining control is likely to help.

Moderation Log

Poll

A big clarification

We know that different people and institutions within the EA space hold different beliefs regarding the issue of AI acceleration. Our aim is NOT to single out and criticize them but rather to have a civil discussion on a subject that seems pretty crucial to AI alignment.
We don’t have an inside view into the decision-making that led to the foundation of different institutions. Still, we are optimistic that with multiple founders, grantmakers, and advisors involved, a lot of time went into thinking about the possible downsides of AI acceleration.

Definition

We think of AI alignment as the process of aligning the goals of artificial systems with the broader goals of humanity.

We loosely define AI acceleration as everything that increases the pace of AI development/capabilities i.e. shortens transformative AI (TAI) timelines. In this post, we primarily look at AI acceleration that comes as a byproduct of alignment efforts. Examples include accidentally developing more efficient training techniques while doing alignment research or actively improving state-of-the-art but only giving access to aligned actors.

We will use the term “non-aligned actor” for institutions/researchers who are not directly or indirectly concerned with AI safety. There is not a clear threshold when an institution becomes aligned/non-aligned, but one example of a rather non-aligned actor is NVIDIA. Even though one research area of theirs is concerned with safety, we assume that this is not the main focus of their research.

Different views:

We broadly identified four escalating views on AI acceleration.

1. No acceleration

AGI might be terrible. Therefore, anything that accelerates development is bad as long as we do not provide sufficient answers to questions regarding safety. (post, discussion)

2. Accepting side effects

To make relevant AI systems safe, we need to work with state of the art, which can sometimes lead to minimal acceleration as a side effect. E.g., For LLMs, it is plausible that you sometimes solve problems that other institutions have not encountered before. This might, for example, be specific knowledge about prompt engineering or how to set up GPU clusters efficiently. In the best case, aligned models show state-of-the-art performance, and everyone adopts them. (post reflecting that view)

3. Gaining control

By controlling relevant knowledge about AI algorithms or relevant architecture, aligned actors could control who gets access and thus decrease the risk of misalignment. In the best case, aligned actors have sufficient power to decide who gets access to the most capable models or compute. (artificial question reflecting that view)

4. Accelerating everything

“AI is likely good. The sooner, the better.” If you think we already have good answers to all alignment questions or are confident that we will find them in time, then acceleration is good. However, since there are still a lot of open questions, the vast majority of EAs don’t hold this view.

How to read these views

We think these views can be best imagined as a spectrum reaching from no-acceleration to total-acceleration. Although it is hard to sort most actors on this spectrum, it is safe to assume that most non-aligned AI labs fall somewhere around 3 to 4 in pursuing progress in AI. Furthermore, we assume most AI labs that are more aware of the safety concerns regarding AI mostly fall around 2, with some labs falling notably on 1.

Dependencies

The following factors can lead to different views. They are not supposed to be an exhaustive list - we welcome additional suggestions in the comments.

How hard is alignment?

If you think that alignment is a very hard problem and we should have as much time as possible, view 1 becomes more plausible. As a prominent example, Yudkowsky thinks it is very difficult. If you think alignment might turn out to be easier than anticipated or might get easier with larger models, views 2 and 3 become more plausible.

How much AI acceleration can AI safety researchers create counterfactually?

The AI safety community is tiny compared to conventional AI research regarding people and money. To gain a simple estimate of the amount of AI research that focuses on safety we compare the number of papers released. We estimate that about 1 in 100 papers focuses on safety (see appendix). Furthermore, the AI safety community might be especially interested in and focused on TAI/AGI and, therefore, disproportionally prone to create acceleration. The more you think that AI safety researchers can make a meaningful difference to the overall progress of AI, the more plausible views 1 and 3 become compared to 2.

How probable is accidental acceleration?

Assume an aligned organization furthers the state-of-the-art. They have good intentions and only share their methods with other aligned organizations. If everything goes well, this does not increase acceleration by unaligned actors. However, it might be possible that the information leaks. Reaching a new state-of-the-art could accelerate other research, e.g., if you make better hardware, NVIDIA might accelerate because feasibility was shown. The same premise holds for pure safety research: MIRI decided to go ‘nondisclosed-by-default’ due to the possibility of acceleration through their results. Eleuther.ai also discusses the issue, focusing on a large language model (LLM). While they do not seem to aim to further state of the art, they want to open research on LLMs by releasing a similar model to GPT-3 to make safety research for LLMs possible. Furthermore, they claim most damage done through GPT-3 happened by showing feasibility. Results of the state-of-the-art research can therefore be considered as infohazards. The more probable you find accidental acceleration, the more you should favor view 1 over 2 and especially 3.

Which kind of models will be adopted by a broader audience?

It is plausible that most actors care less about safety or alignment but care about performance. The higher the performance of aligned algorithms compared to unaligned ones, the likelier adoption by the broader public becomes. The effort in aligning algorithms, as in creating align(x) that is equal or better than DQN(x), is referred to as the alignment tax. It might be the case that there is a trade-off between capabilities and alignment, and therefore, aligned models can’t ever reach state-of-the-art performance as discussed e.g., in A dilemma for prosaic AI. The more plausible you find this perspective on adoption, the more you should favor view 2.

How much pressure can EAs create on other players to be more aligned?

Assume aligned organizations control state-of-the-art technology, e.g. a large language-model API or compute. In a high-pressure scenario, other companies adapt their mission to be aligned in order to get access. In a low-pressure scenario, other companies ignore the aligned organization, and nothing happens. The more pressure aligned organizations can create, the more you should favor view 3.

How reckless are non-aligned actors?

There are different possible scenarios of how reckless non-aligned actors will be with AI systems. You might think that they will care about safety due to the large possible negative consequences of unaligned AI. On the other hand, non-aligned actors might not care about safety due to profit incentives, lack of understanding, or external pressures. The more reckless you think that non-aligned actors behave, the stronger you should believe in controlling access to AI tech, i.e. view 3 over 1 and 2.

What is your AI timeline?

Different timelines change the research we should focus on. Short timelines until TAI should favor view 1, since every tiny bit of acceleration results in an important loss of time. Longer timelines favor views 2 and 3 since there is more time for unaligned actors to do harm with increasingly powerful systems. Also, if you think we are far from AGI, you might see no need to decrease acceleration and favor view 4.
This is strongly connected to the question of when to contribute. Maybe the best time to contribute to AI safety is during the imminent years before the development of AGI. Then there is no reason to stop acceleration, at least until we have AGI in plain sight and the field of AI safety is clearer. Buck claims a lot of safety researchers endorse this view. If you sympathize with this claim, at least for now, you could favor view 3 or even 4.

Who leads the race?

We have little expertise in other countries’ AI policies but assume that they are a less well-meaning actor than most western countries. So, for example, we think it is more likely that China would use an AI to gain an advantage even if everyone else is worse off. The higher you think the probability of unstable actors developing the first TAI is, the more you should favor view 3 to control and prevent them.

What type of research?

The kind of research that an organization does matters as well. If they have a very clear and plausible theory of change for how their acceleration leads to relevant increases in alignment, views 2 and 3 might be more convincing. On the other hand, if their theory is “let’s accelerate and see what happens”, that’s probably bad and strengthens view 1. However, we think that even controlled acceleration just for the sake of testing safety-relevant techniques on more powerful models is already rather dangerous and is similar to gain-of-function research in other fields such as biology. Therefore, all the risks of accidental releases can be translated into the AI domain.

How powerful is TAI?

Predictions about AI systems range from “better than humans but not by much” to “insanely powerful”. The more powerful AI is, the higher the stakes are. In a world where we have no solution to the alignment problem, this favors view 1 since we want to minimize the chance of powerful unaligned actors with potentially significant negative impacts. In a world with a solution to all questions to the alignment problem, this favors view 4 since we should get to good outcomes as fast as possible.

Who is attracting talent?

If an organization is very public about its successes, it is more likely to attract talented people. It is especially useful if you want to redirect a lot of researchers’ attention to relevant/pressing issues. This also holds if you’re going to shift attention some time in the future, see the crunch-time for AI safety. If you think this is a valuable method, you favor view 3 or 2.

Potential Biases

Effective Altruists are unfortunately not immune to biases, some of which we want to highlight. However, we think that the founding process, discussions with other EAs, and input from funders should mitigate these biases a lot. Therefore, we think these biases are much less important than the considerations above.

Money: There is a lot of money in AI. EAs are not immune to the desire to be rich. However, most money comes from improving the state-of-the-art and much less from safety work - at least for now.
Rationalization: It’s easy to say that “someone else would have found the new technique if I didn’t” and it’s hard or impossible to evaluate this counterfactual.
Tired of being pessimistic about the future: Some EAs might be sick of being the bad messenger all the time. The message “I think AI is good, we just need to make it safe” gives you much more pleasurable human interactions than “You’re gonna kill us all. Stop doing the thing that might make you rich”.

Conclusion

Our main goal with this post is to highlight different considerations on AI acceleration in AI safety research. Broadly, we think that view 1 (no-acceleration) would be the default if everyone in AI worked on alignment. Views 2 (side-effects) and 3 (control) come from the interaction with non-aligned actors, e.g. since other actors continually increase state of the art, the safety community has sometimes to make decisions that could accelerate AI.

We don’t think any of the first three views from above are obviously superior. However, we believe it’s entirely plausible that some considerations might completely dominate others when investigated in more detail.

Appendix

How big is AI safety compared to the rest of the field?

We use the arxiv search engine to estimate this and search for papers with specific keywords. As a set of keywords, we used the suggested set from this metaculus question and applied it to the year 2021, which resulted in 533 results. If we use a general query of “Machine Learning”, “Artificial Intelligence”, we get 35,000 results. We can easily bump up the number to as high as 47k results with additional keywords. Also, note that these numbers seem to increase slightly over time, maybe because people add keywords to some papers.

We are aware that this is not the best method for estimating the share of safety to non-safety work (e.g. compared to estimating funding or estimating employees) but this might be a good starting point for someone to explore these questions further.

A dilemma for prosaic AI alignment (link)

By Rohin: “If we try to train an AI system directly using such a scheme, it will likely be uncompetitive, since it seems likely that the most powerful AI systems will probably require cutting-edge algorithms, architectures, objectives, and environments, at least some of which will be replaced by new versions from the safety scheme. Alternatively, we could first train a general AI system, and then use our alignment scheme to finetune it into an aligned AI system. However, this runs the risk that the initial training could create a misaligned mesa optimizer, that then deliberately sabotages our finetuning efforts.”

Eleuther AI take on their LLM

They state: “Most (>99%) of the damage of GPT⁠-⁠3’s release was done the moment the paper was published”. Also, Connor has written a lot about it, so you can check that out.

Thanks to Max Räuker, Anson Ho, and Jasper Götting for their valuable discussion and feedback on this post.