Epistemic status: I'm pretty confident about this; looking for feedback and red-teaming. As of 2023-06-30, I notice multiple epistemological errors in this document. I do not endorse the reasoning in it, and while my object-level claims haven't changed radically, I am in the process of improving them using better epistemological procedures. I might update this after.
This post describes my current model of the world and the alignment problem, and my current plan to mitigate existential risk by misaligned ASI given these beliefs, my skills and my situation. It is written to both communicate my plan and to get feedback on it for improvement.
Here are my current beliefs:
Given the stated beliefs, my current probability distribution for the creation of an artificial superintelligence (ASI) over the next decade is roughly a normal distribution with a mean at 2.5 years from now (that is, around 2025-01-01), and a standard deviation of 0.5.
There are two main ways one can delay the creation of a ASI (so we have more time to solve the alignment problem): unilateral pivotal acts, and government-backed governance efforts to slow down AI capabilities research and investment. Unilateral pivotal acts seem like they burn a lot of trust and social capital that would actually accelerate race dynamics and animosity between parties competing to launch an ASI, which is why I believe that this should be avoided. In contrast, I expect government-backed efforts to regulate AI capabilities research and investment would not have such side effects and would give alignment researchers about a decade or two before someone creates an ASI, which may even be enough time to solve the problem.
Since the only relevant country when it comes to AI capabilities regulation is the USA, I am not in a position to impact government-backed governance efforts as an Indian citizen. This is especially true since my technical skills are significantly better than my skills at diplomacy and persuasion. Ergo, I must focus on alignment research and the researchers actually attempting to solve the problem.
I am most interested in alignment agendas that focus on what seem to be key bottlenecks to aligning ASIs. Specifically, the areas I believe are worth gigantic researcher investment are:
Here's a list of existing alignment agendas I find high value enough to track, and if possible, contribute to:
Here's a list of existing alignment agendas I haven't looked at yet and think may be high value enough to also make a significant impact on our ability to solve the problem:
While these are not necessarily agendas, here's a list of researchers whose writings are worth reading (but that I haven't entirely read):
I also am excited about agendas that intend to accelerate alignment research in the time we have.
While I find shard theory incredibly intellectually stimulating and fun to make progress on, I do not believe it is enough to solve the alignment problem since all of shard theory involves ontologically fragile solutions. Humans do not have ontologically robust formulations of terminal goals and a human whose intellectual capabilities is enhanced 1000x will necessarily be misaligned to their current self. At best, shard theory serves as a keystone for the most powerful non-ontologically-robust alignment strategy we will have, and that is not very useful given my model of the problem.
Mechanistic interpretability (defined as the ability to take a neural network and convert it into human-readable code that effectively does the same thing), at its limit, should lead to solving both inner misalignment and ontological robustness, but I am pessimistic that we solve mechanistic interpretability at its limit given my timeline. Worse, any progress towards mechanistic interpretability (also known simply as interpetability research) simply accelerates AI capabilities without any improvement in our ability to align AI models at their limit.
Shard theory research seems, on the surface, less damaging than interpretability reseach because shard theory research focuses purely on injecting goals and ensuring they continue to stay there, while interpretability actually improves our understanding of models and makes our architectures and training processes more efficient, reducing the compute bound required for us to get to RSI. However, shard theory research relies on interpretability research findings to make progress in detecting shards (here's an example), and I assume progress in interpretability research would generally drive progress in non-ontologically-robust alignment approaches such as shard theory. I currently believe the trade-off of accelerating capabilities by doing interpretability research is very likely not worth the progress in non-ontologically-robust alignment agendas it unlocks, since it does not help with aligning AI models at their limit.
Finally, while I expect certain alignment agendas (such as Steven Byrnes's Brain-like AGI) to produce positive expected value contributions to solving the problem, they do not aim at what I consider the core problems enough for me to personally track and potentially contribute to.
Note: There are many (new and established) independent alignment researchers who haven't published written research contributions (or have explicitly defined their agenda), that make a significant positive impact on the alignment research community, and are worth talking to and working with. I simply have chosen to not list them here.
My plan is simple: accelerate technical alignment research in the key research bottlenecks by whatever means I have at hand that make the biggest impact. This means, in the order of potential impact:
The better my skills at direct technical conceptual research is, the bigger my impact. This means that I also should focus on improving my ability to do conceptual research work -- but that is implicit in my definition of making direct research contributions anyway.
While I am uncertain about my ability to make original direct research contributions right now, I am confident of my ability to distill existing research contributions and to red-team existing research contributions. These seem to be relatively neglected 'cause areas' and I expect my intervention to make a big difference there (although not as much as direct research contributions).
I wish I could say that support for high value alignment researchers working on these bottlenecks is not neglected, but this is absolutely not the case. Funding and visas are the two key bottlenecks, and I believe that the current state of the ecosystem is abyssmal compared to how it should be if we were actually trying to solve the problem. Anyway, I believe I am agentic enough, and smart enough, to provide informal support (particularly in the form of logistics and ops work) to researchers who are not yet in a position where they do not need to worry about such 'chores' that get in the way of them doing actual research.
My personal bottlenecks to work on all of these things is visas and funding (with visas being a significantly more painful bottleneck than funding). In the worst case scenario, I may end up in a position where I have close-to-zero ability to make net positive contributions in these four ways towards solving the problem. I assign a 5% probability for this situation to occur in my timeline given my status as an Indian citizen. To mitigate this, I shall lookout for ways to extend my logistical runway to continue to be useful towards solving the problem. I prefer to not discuss specifics of my plans about this in this post, but feel free to message me to talk about it or offer advice.
Onward to utopia.
The argument for why this is the case is outside the scope of this post, and probably a capability externalities infohazard, so I choose to not discuss it here. ↩︎
The normal distribution is a good default given my uncertainty regarding further details about what scenarios we shall see. ↩︎
Especially since his SERI MATS strategy seems to be to mainly develop independent alignment researchers who work with each other instead of with him or on his work ↩︎
Did your model change in the last 6 months or so, since the GPTx takeover? If so, how? Or is it a new model? If so, can you mentally go back to pre-GP-3.5 and construct the model then? Basically, I wonder which of your beliefs changes since then.
Your question seems to focus mainly on timeline model and not alignment model, so I shall focus on explaining how my model of the timeline has changed.
My timeline shortened from about four years (mean probability) to my current timeline of about 2.5 years (mean probability) since the GPT-4 release. This was because of two reasons:
The latter is more load-bearing than the former, although my predictions for how soon AI labs will achieve human-in-the-loop RSI creates an upper bound on how much time we have (assuming no slowdown), which is quite useful when making your timeline.
Nice post! You seem like you know what you are doing. I'd be curious to hear more about what you think about these priority areas, and why interpretability didn't make the list:
ensuring ontological robustness of goals and concepts (without the use of formally specified goals)creating formally specified outer aligned goalspreventing inner misalignmentunderstanding mesaoptimizers better by understanding agency
Thanks and good luck!
Sorry for the late reply: I wrote up an answer but due to a server-side error during submission, I lost it. I shall answer the interpretability question first.
Interpretability didn't make the list because of the following beliefs of mine:
Here's a few facets of interpretability research that I am enthusiastic about tracking, but not excited enough to want to work on, as of writing:
I'm really glad you asked me this question! You've helped me elicit (and develop) a more nuanced view on interpretability research.
unilateral pivotal acts, and government-backed governance efforts to slow down AI capabilities research and investment
unilateral pivotal acts, and government-backed governance efforts to slow down AI capabilities research and investment
There is nothing inherently unilateral about pivotal acts. The problem with an international moratorium is that with enforcement tools that are readily available it's unlikely to last as long as it needs to for human-level alignment theory to catch up. Being government-backed is not part of the problem. Pivotal AI-enabled trajectories of development can help with that, by providing the tools for a more reliable international moratorium and for getting to a place where the field of alignment is actually ready for tackling more capability.
When I referred to pivotal acts, I implied the use of enforcement tools that are extremely powerful, of the sort implied in AGI Ruin. That is, enforcement tools that make an actual impact in extending timelines. Perhaps I should start using a more precise term to describe this from now on.
It is hard for me to imagine how there can be consensus within a US government organization capable of launching a superhuman-enforcement-tool-based pivotal act (such as three letter agencies) to initiate a moratorium, much less consensus in the US government or between US and EU (especially given the rather interesting strategy EU is trying with their AI Act).
I continue to consider all superhuman-enforcement-tool-based pivotal acts as unilateral given this belief. My use of the world "unilateral" points to the fact that the organizations and people who currently have a non-trivial influence over the state of the world and its future will almost entirely be blindsided by the pivotal act, and that will result in destruction of trust and chaos and an increase in conflict. And I currently believe that this is actually more likely to increase P(doom) or existential risk for humanity, even if it extends the foom timeline.
Although not preventing ASI creation entirely. The destruction of humanity's potential is also an existential risk, and the inability for us to create a utopia is too painful to bear. ↩︎
How long do you think such a moratorium would last?
There is nothing physically impossible about it lasting however long it needs to, that's only implausible for the same political and epistemic reasons that any global moratorium at all is implausible. GPUs don't grow on trees.
My point in the above comment is that pivotal acts don't by their nature stay apart, a conventional moratorium that actually helps is also a pivotal act. Pivotal act AIs are something like task AIs that can plausibly be made to achieve a strategically relevant effect relatively safely, well in advance of actually having an understanding necessary to align a general agentic superintelligence, using alignment techniques designed around lack of such an understanding. Advances made by humans with use of task AIs could then increase robustness of a moratorium's enforcement (better cybersecurity and compute governance), reduce the downsides of the moratorium's presence (tool AIs allowed to make biotech advancements), and ultimately move towards being predictably ready for a superintelligent AI, which might initially look like developing alignment techniques that work for making more and more powerful task AIs safely. Scalable molecular manufacturing of compute is an obvious landmark, and can't end well without robust compute governance already in place. Human uploading is another tool that can plausibly be used to improve global security without having a better understanding of AI alignment.
(I don't see what we currently know justifying Hanson's concern of never making enough progress to lift a value drift moratorium. If theoretical progress can get feedback from gradually improving task AIs, there is a long way to go before concluding that the process would peter out before superintelligence, so that taking any sort of plunge is remotely sane for the world. We haven't been at it for even a million years yet.)