(this answer is cross-posted on my blog)
here is a list of problems which i seek to either resolve or get around, in order to implement my formal alignment plans, especially QACI:
Open Problems in AI X-Risk:
I'm going to answer a different question: what's my list of open problems in understanding agents? I claim that, once you dig past the early surface-level questions about alignment, basically the whole cluster of "how do agents work?"-style questions and subquestions form the main barrier to useful alignment progress. So with that in mind, here are some of my open questions about understanding agents (and the even deeper problems one runs into when trying to understand agents), going roughly from "low-level" to "high-level".
Here’s Quintin Pope’s answer from the Twitter thread I posted (https://twitter.com/quintinpope5/status/1633148039622959104?s=46&t=YyfxSdhuFYbTafD4D1cE9A):
1.1 How do we make there be more convergence?
How do we minimize semantic drift in LMs when we train them to do other stuff? (If you RL them to program good, how to make sure their English continues to describe their programs well?)
How well do alignment techniques generalize across capabilities advances? Id AI start doing AI research and make 20 capabilities advances like the Chinchilla scaling laws, will RLHF/whatever still work on the resulting systems?
Where do the inductive biases of very good SGD point? Are they "secretly evil", in the sense that powerful models convergently end deceptive / explicit reward optimizers / other bad thing?
4.1 If so, how do we stop that?
5.1 How do we safely shape such a process. We want the process to enter stable attractors along certain dimensions (like "in favour of humanity"), but not along others (like "I should produce lots of text that agents similar to me would approve of").
What are the limits of efficient generalization? Can plausible early TAI generalize from "all the biological data humans gathered" to "design protein sequences to build nanofactory precursors"?
Given a dataset that can be solved in multiple different ways, how can we best influence the specific mechanism the AI uses to solve that dataset?
7.1 like this? arxiv.org/abs/2211.08422
7.2 or this? https://openreview.net/forum?id=mNtmhaDkAr
7.3 of how about this? https://www.lesswrong.com/posts/rgh4tdNrQyJYXyNs8/qapr-3-interpretability-guided-training-of-neural-nets
How to best extract unspoken beliefs from LM internal states? Basically ELK for LMs. See: https://github.com/EleutherAI/elk
What mathematical framework best quantifies the geometric structure of model embedding space? E.g., using cosine similarly between embeddings is bad because it's dominated by outlier dims and doesn't reflect distance along embedding manifold. We want math that more meaningfully reflects the learned geometry. Such a framework would help a lot with questions like "what does this layer do?" And "how similar are the internal representations of these two models?"
How do we best establish safe, high bandwidth, information-dense communication between human brains and models? This is the big bottleneck on approaches like cyborgism, and includes all forms of BCI research / "cortical prosthesis" / "merging with AI". But it also incudes things like "write a very good visualiser of LM internal representations", which might allow researchers a higher-bandwidth view of what's going on in LMs beyond just "read the tokens sampled from those hidden representations".
What is the correct "object of study" for alignment researchers in understanding the mechanics of a world immediately before and during takeoff? A good step in this direction is the work of Alex Flint and Shimi's UAO.
What form does the correct alignment goal take? Is it a utility function over a region of space, a set of conditions to be satisfied or something else?
Mechanistically, how do systems trained primarily on token frequencies appear to be capable of higher level reasoning?
How likely is the emergence of deceptively aligned systems?
Some braindumping, took me a while, many passes of editing in loom to see if I'd missed something - I rejected almost every loom branch though, this is still almost all my writing, sometimes the only thing I get from loom is knowing what I don't intend to say:
My current sense is I will be the one to answer exactly none of these. But who knows! anyway, here's some. I think I have more knocking around somewhere in my head and/or my previous comments.
Question for my fellow alignment researchers out there, do you have a list of unsolved problems in AI alignment? I'm thinking of creating an "alignment mosaic" of the questions we need to resolve and slowly filling it in with insights from papers/posts.
I have my own version of this, but I would love to combine it with others' alignment backcasting game-trees. I want to collect the kinds of questions people are keeping in mind when reading papers/posts, thinking about alignment or running experiments. I'm working with others to make this into a collaborative effort.
Ultimately, what I’m looking for are important questions and sub-questions we need to be thinking about and updating on when we read papers and posts as well as when we decide what to read.
Here’s my Twitter thread posing this question: https://twitter.com/jacquesthibs/status/1633146464640663552?s=46&t=YyfxSdhuFYbTafD4D1cE9A.
Here’s a sub-thread breaking down the alignment problem in various forms: https://twitter.com/jacquesthibs/status/1633165299770880001?s=46&t=YyfxSdhuFYbTafD4D1cE9A.