A descriptive, not prescriptive, overview of current AI Alignment Research

Jan; Logan Riggs; jacquesthibs; janus

139 A descriptive, not prescriptive, overview of current AI Alignment Research

by Jan, Logan Riggs, jacquesthibs, janus

6th Jun 2022

8 min read

139 Ω 46

TL;DR: In this project, we collected and cataloged AI alignment research literature and analyzed the resulting dataset in an unbiased way to identify major research directions. We found that the field is growing quickly, with several subfields emerging in parallel. We looked at the subfields and identified the prominent researchers, recurring topics, and different modes of communication in each. Furthermore, we found that a classifier trained on AI alignment research articles can detect relevant articles that we did not originally include in the dataset.

(video presentation here)

Dataset Announcement

In the context of the 6th AISC, we collected a dataset of alignment research articles from a variety of different sources. This dataset is now available for download here and the code for reproducing the scrape is on GitHub here^[1]. When using the dataset, please cite our manuscript as described in the footnote^[2].

Table 1: **Different sources of text included in the dataset alongside the number of articles per source.** Color of row indicates that data was analyzed as AI alignment research articles (green) or baseline (gray), or that the articles were added to the dataset as a result of the analysis in Fig. 4 (purple). Definition of level-0 and level-1 articles in Fig. 4c. For details about our collection procedure see the Methods section.

Here follows an abbreviated version of the full manuscript, which contains additional analysis and discussion.

Rapid growth of AI Alignment research from 2012 to 2022 across two platforms

After collecting the dataset, we analyzed the two largest non-redundant sources of articles, Alignment Forum (AF) and arXiv. We found rapid growth in publications on the AF (Fig. 1a) and a long-tailed distribution of articles per researcher (Fig. 1b) and researchers per article (Fig. 1c). We were surprised to find a decrease in publications on the arXiv in recent years, but identified the cause for the decrease as spurious and fixed the issue in the published dataset (details in Fig. 4).

Figure 1: **Alignment Research across a community forum and a preprint server.** (a) Number of articles published as a function of time on the alignment forum (AF; purple) and the arXiv preprint server (arXiv; green). (b) Histogram of the number of articles per researcher published on either AF or arXiv. Inset shows names of six researchers with more than 60 articles. Note the logarithmic y-axis. (c) Histogram of the number of researchers per article on AF (purple) and arXiv (green). Note the logarithmic y-axis.

Unsupervised decomposition of AI Alignment research into distinct clusters

Given access to this unique dataset, we were curious to see if we could identify distinct clusters of research. We mapped the title + abstract of each article into vector form using the Allen Institute for AI's SPECTER model and reduced the dimensionality of the embedding with UMAP (Fig. 2a). The resulting manifold shows a continuum of AF posts and arXiv articles (Fig. 2b) and a temporal gradient from the top right to the bottom left (Fig. 2c). Using k-means and the elbow method, we obtain five clusters of research articles that map onto distinct regions of the UMAP projection (Fig. 2d).

Figure 2: **Dimensionality reduction and unsupervised clustering of alignment research.** (a) Schematic of the embedding and dimensionality reduction. After concatenating title and abstract of articles, we embed the resulting string with the Allen SPECTER model40, and then perform UMAP dimensionality reduction with n_neighbors=250. (b) UMAP embedding of articles with color indicating the source (AF, purple; arXiv, green). (c) UMAP embedding of articles with color indicating date of publication. Arrows superimposed to indicate direction of temporal evolution. (d) UMAP embedding of articles with color indicating cluster membership as determined with k-means (k=5). Inset shows sum of residuals as a function of clusters k, with an arrow highlighting the chosen number of clusters.

We were curious to see if the five clusters identified by k-means map onto existing distinctions in the field. When identifying the most prolific authors in each cluster, we noticed strong differences^[3] (consistent with previous work that suggests that author identity is an important indicator of research direction).

Table 2: **Researchers with the highest number of articles per cluster.** Clusters as determined in Fig. 2, with number of articles per cluster 𝑁. Number in brackets behind researcher name indicates number of articles published by that researcher. Note: "Diffractor" is an undisclosed pseudonym.

By skimming articles in each cluster and given the typical research published by the authors, we suggest the following putative descriptions of each cluster:

cluster one: Agent alignment is concerned with the problem of aligning agentic systems, i.e. those where an AI performs actions in an environment and is typically trained via reinforcement learning.
cluster two: Alignment foundations research is concerned with deconfusion research, i.e. the task of establishing formal and robust conceptual foundations for current and future AI Alignment research.
cluster three: Tool alignment is concerned with the problem of aligning non-agentic (tool) systems, i.e. those where an AI transforms a given input into an output. The current, prototypical example of tool AIs is the "large language model".
cluster four: AI governance is concerned with how humanity can best navigate the transition to advanced AI systems. This includes focusing on the political, economic, military, governance, and ethical dimensions.
cluster five: Value alignment is concerned with understanding and extracting human preferences and designing methods that stop AI systems from acting against these preferences.

We note that these descriptions are chosen to be descriptive, not prescriptive. Our approach has the advantage of being (comparatively^[4]) unbiased and can therefore serve as a baseline against which other (more prescriptive) descriptions of the landscape can be compared (Krakovna's paradigms, FLI landscape, Christiano's landscape, Nanda's overview, ...). Discrepancies between these descriptions and ours can serve as important information for funding agencies (to identify neglected areas) and AI Governance researchers (for early identification of natural categories for regulation).

Research dynamics vary across the identified clusters

We further note some properties of the identified clusters (Fig. 3a). The cluster labeled as "alignment foundations" contains most of the seminal work in the field (Fig. 3b,c), but remains largely disconnected from the more applied "agent alignment" and "tool alignment" research (Fig. 3a). Furthermore, most "alignment foundations" work is published on the Alignment Forum (Fig. 3d) and it has the largest inequality in terms of "number of articles per researcher" (Fig. 3e). This corroborates an observation that was made before: While critically important, alignment foundations research appears to be poorly integrated into more applied alignment research, and the research remains insular and pushed by comparatively few researchers.

Figure 3: **Characteristics of research clusters corroborate potential usefulness of decomposition.** (a) UMAP embedding of articles with color indicating cluster membership as in Fig. 2d. Labels assigned to each cluster are putative descriptions of a common research focus across articles in the cluster. (b) Number of articles published per year, colored by cluster membership. (c) Fraction of articles published by cluster membership as a function of time. (d) Fraction of articles from AF or arXiv as a function of cluster membership. (e) GINI inequality coefficient of articles per researcher as a function of article cluster membership.

Leveraging dataset to train an AI alignment research classifier

After having identified the five clusters, we returned to the issue we noted at the onset of our analysis: the apparent decrease in publications on the arXiv in recent years (Fig. 1a). We were skeptical about this and hypothesized that our data collection might have missed relevant recent articles^[5]. Therefore, we trained a logistic regression classifier to distinguish alignment articles (level-0) from articles cited by alignment articles (level-1) (Fig.4 a). The resulting classifier achieved good performance and generalized well to papers from unrelated sources (Fig. 4b). We then scraped all the articles from the arXiv cs.AI category and asked our classifier to score them (Fig. 4c,d). Based on the distribution of scores of Alignment Forum posts (Fig. 4d) and after skimming the relevant articles, we chose a threshold of 75% as a reasonable trade-off between false positives and false negatives.

When adding the arXiv articles above the cutoff to our dataset, we observed a rapid increase in publications also on the arXiv (Fig. 4e). To test if our clustering is robust to this increase, we repeated the UMAP projection with the updated dataset and found that, indeed, the clusters are still in distinct regions of the manifold (Fig. 4f). Interestingly, the added literature appears to fill some of the gaps between "alignment foundations" and "agent alignment" research.

Figure 4: **An AI alignment research classifier for filtering new publications.** (a) Top: Illustration of arXiv level-0 articles (alignment research; green) and level-1 articles (cited by alignment research articles; blue). Bottom: Schematic of test-train split (20%-80% for training of a logistic regression classifier. (b) Fraction of articles as a function of classifier score for arXiv level-0 (green), level-1 (blue), and arXiv articles on quantum physics (grey). (c) Illustration of procedure for filtering arXiv articles. After querying articles from the cs.AI section of arXiv, the logistic regression classifier assigns a score between 0 and 1. (d) Fraction of articles as a function of classifier score for articles from the cs.AI section of arXiv (grey) and AF (purple). Dashed line indicates cutoff for classifying articles as arXiv level-0 (75%). (e) Number of articles published as a function of time on AF (purple) and arXiv (green), according to the cutoff in panel d. (f) Left inset: Original UMAP embedding from Fig. 2d. Right: UMAP embedding of all original articles and updated arXiv articles with color indicating cluster membership as in Fig. 2d or that the article is filtered from the arXiv (gray).

Closing remarks

The primary output from our project is the curated dataset of alignment research articles. We hope the dataset might serve as the basis for

a semantic search service that returns relevant literature (see prototype here).
writing assistants in the form of fine-tuned large-language models.
projects to preserve AI Safety research in case of catastrophic events.

If you have other ideas for how to use the dataset, please don't hesitate to reach out to us; we're excited to help.

Furthermore, we hope that the secondary outcome from our project (the analysis in this post) can aid both funding agencies and new researchers entering the field to orient themselves and contextualize the research.

As we plan to continue this line of research, we are happy about any and all feedback on the dataset and the analysis, as well as hints and pointers about things we might have missed.

Acknowledgments: We thank Daniel Clothiaux for help with writing the code and extracting articles. We thank Remmelt Ellen, Adam Shimi, and Arush Tagade for feedback on the research. We thank Chu Chen, Ömer Faruk Şen, Hey, Nihal Mohan Moodbidri, and Trinity Smith for cleaning the audio transcripts.

^{^}
We will make some finishing touches on the repository over the next few weeks after this post is published.
^{^}
Kirchner, J. H., Smith, L., Thibodeau, J., McDonnell, K., and Reynolds, L. "Understanding AI alignment research: A Systematic Analysis." arXiv preprint arXiv:2206.02841 (2022).
^{^}
Except for Stuart Armstrong, who publishes prolifically across all clusters.
^{^}
Remaining biases include:
- differences in formatting between arxiv and AF articles that bias the embedding
- some (important) topics might not have any documentation due to infohazards
- by implicitly focusing on number of published articles (rather than f.e. the "volume occupied in semantic space") we bias our analysis in favor of questions that can be written about more easily
^{^}
We took the TAI Safety Bibliographic Database from early 2020 as a starting point and manually added relevant articles from other existing bibliographies or based on our judgment. We were very conservative in this step, as we wanted to make sure that our dataset includes as few false positives as possible.

AI Safety CampAI

Frontpage

139 Ω 46

Mentioned in

319Shallow review of live agendas in alignment & safety

79Introducing AlignmentSearch: An AI Alignment-Informed Conversional Agent

79Results from a survey on tool use and workflows in alignment research

61AI Safety Chatbot

58"Brain enthusiasts" in AI Safety

Load More (5/7)

New Comment

21 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:06 AM

[-]A Ray2yΩ7120

Thanks so much for making this!

I'm hopeful this sort of dataset will grow over time as new sources come about.

In particular, I'd nominate adding MLSN (https://www.alignmentforum.org/posts/R39tGLeETfCZJ4FoE/mlsn-4-many-new-interpretability-papers-virtual-logit) to the list of newsletters in the future.

[-]Ethan Perez2yΩ120

Yes super excited about datasets like this! It might be helpful to also add https://ai-alignment.com/ or https://paulfchristiano.medium.com/ if these aren't already in the data

[-]jacquesthibs2yΩ120

I believe all of those posts can be found on the Alignment Forum so, luckily, they are included in the dataset (at least from what I remember after checking a handful of the posts). I had begun scraping from those sources, but realized they were already on AF halfway through.

[-]Ethan Perez2yΩ110

Cool, that's great!

[-]jacquesthibs2y10

Good idea! I added most of the papers from the previous entries of MLSN. Adding the summaries would be a useful next step. Would be great if someone could keep track of it in a Google Sheet of individual summaries like the Alignment Newsletter (https://docs.google.com/spreadsheets/d/1lJ6431R-E6aioVRd7AN4LQYTj-QhQlUYNRbGDbG5RWY/edit?usp=sharing).

I was also considering adding distillations as a key as well. For example, adding ELK distillations to the ELK report entry.

[-]Charlie Steiner2yΩ240

Neat! Hey, did I have any bonus papers for you?

[-]jacquesthibs2y80

I saw that earlier! Please keep up the great work. We have a Google Sheet with the bibliography (or at least arXiv url) of a bunch of papers. I can add those manually, but it would be great if they were stored in Zotero (preferably from arXiv) or something similar, we’d love to export it and add it to our list. And for anyone else reading this, sending us a bibliography like TAI Safety (https://www.lesswrong.com/posts/4DegbDJJiMX2b3EKm/tai-safety-bibliographic-database) would be a great way to contribute! :)

Here’s the (public) Google Sheet of the arXiv papers we used (there are duplicates but we remove duplicates during extraction): https://docs.google.com/spreadsheets/d/1jh5VbDWqNZiB5VUM4MW-yDRhZVGa19OsC2wdI6MwUZk/edit

We focused our search on alignment papers more specifically, but we are also discussing branching out to other domains with papers that could be relevant for alignment.

[-]Charlie Steiner2y20

Does this link work? https://www.zotero.org/charliesteiner/collections/XFVERRTP

[-]jacquesthibs2y10

Unfortunately, I am denied access. Send me a private message with a Google Drive link to the exported Zotero RDF and I'll import it into my Zotero library. I think you can create a group in Zotero, but it is very limited in storage (need to pay to load a decent amount of entries).

[-]Charlie Steiner2y20

Figuring out zotero, then I'll send you a link.

[-][anonymous]2y30

This is very cool! For archiving and rebuilding after a global catastrophe, how easy would this be to port to Kiwix for reading on a phone? My thinking is that if a few hundred LWers/EAs have this offline on their phones, that could go quite a long way. Burying phones with it on could also be good as a low hanging fruit (ideally you need a way of reading the data to be stored with the data). Happy to fund this if anyone wants to do it.

[-]npit2y30

Very cool findings! Thanks for sharing the dataset, too.

[-]jacopo2y20

Cool work!

Can I ask a couple of questions about the DR+clustering approach?

If I understand correctly, you do the clustering in a 2D space obtained with UMAP (ignore this if I am wrong). Are you sure you are not losing important information with such a low dimension? I say this because you show that one dimension is strongly correlated with style (academic vs forum/blog) and the second may be somewhat correlated with time. I remember that an argument exists for using n-1 dimensions when looking for n clusters, although that was probably using linear DR techniques and might not apply to UMAP. But it would be interesting to check if using higher n_components (3 to 5) results in the same clustering or generates some new insight.

Another thing you could check is using GMM instead of k-means. My (limited) experience is that if the embedding dimension is low you get better results this way. But, again, I was clustering downstream of linear DR.

[-]Jan2y30

Thank you for the comment and the questions! :)

This is not clear from how we wrote the paper but we actually do the clustering in the full 768-dimensional space! If you look closely as the clustering plot you can see that the clusters are slightly overlapping - that would be impossible with k-means in 2D, since in that setting membership is determined by distance from the 2D centroid.

[-]jacopo2y10

Ahh sorry! Going back to read it was pretty clear from the text. I was tricked by the figure where the embedding is presented first. Again, good job! :)

[-]jonmenaster2y20

We were surprised to find a decrease in publications on the arXiv in recent years, but identified the cause for the decrease as spurious and fixed the issue in the published dataset (details in Fig. 4).

I'd be interested in hearing more about how the decrease was determined to be spurious; I looked at Fig. 4 but am not understanding how that decision was made based on the figure, if that was the intention.

[-]jacquesthibs2y20

Thanks for the question! When we initially scraped the dataset, we looked at the dates in figure 1.a. and there was a decrease in papers after 2020 since much of the Alignment literature lists we grabbed papers from were made in 2020 or earlier and had not been updated. This led to a perceived decline in papers based on figure 1.a. However, this seemed obviously due to not including all the newer papers that had came out in 2020 and later. So, once we scraped a wider set of papers using arXiv’s API, you could see the uptick in papers in 2020 and beyond (figure 4.e) where there was previously a decrease (figure 1.a).

[-]Ben Smith2y10

I would very much like to see your dataset, as a zotero database or some other format, in order to better orient myself to the space. Are you able to make this available somehow?

[-]Ben Smith2y10

Very very helpful! The clustering is obviously a function of the corpus. From your narrative, it seems like you only added the missing arx.iv files after clustering. Is it possible the clusters would look different with those in?

[-]Jan2y20

Hey Ben! :) Thanks for the comment and the careful reading!

Yes, we only added the missing arx.iv papers after clustering, but then we repeat the dimensionality reduction and show that the original clustering still holds up even with the new papers (Figure 4 bottom right). I think that's pretty neat (especially since the dimensionality reduction doesn't "know" about the clustering) but of course the clusters might look slightly different if we also re-run k-means on the extended dataset.

[-]jacquesthibs2y10

I just updated the code for the scrape to include the EA Forum in case someone wants to do something interesting with that data. Contains metadata as well: authors, score, votes, date_published, text (post contents), comments.

Here’s a link to a jsonl of the EA Forum only: https://drive.google.com/file/d/1XA71s2K4j89_N2x4EbTdVYANJ7X3P4ow/view?usp=drivesdk

Moderation Log