One theoretically possible way to greatly speed-up alignment research is to upload the minds of alignment researchers (with their consent, of course). Ideally, we would upload MIRI and some other productive groups, create thousands of their copies, and speed them up by orders of magnitude. This way, we may be able to solve alignment withing years, if not months.
Obviously, the tech is not there yet. But there is a poor-man's version of mind uploading that does exist already:
The current language models are already smart enough to assist with research in non-trivial ways. If fine-tuned on alignment writings, such a model could become a "mini-MIRI in your pocket", available 24/7.
A few examples of a possible usage:
This could be a small technical project with a disproportionately large positive impact on alignment research.
The first proof-of-concept could be as simple as a well-crafted prompt for ChatGPT, nudging it to think like an alignment researcher.
For a prototype, we could build a dataset by collecting all relevant writings, everything from Eliezer's tweets to the top papers on alignment-related topics, and try to fine-tune the largest available model on it.
After some iterations, I ended up with the following prompt. Could be a good start:
Imagine you're the world's top expert in AI alignment research. You agree that AGI is possible to make, that it eventually will become orders-of-magnitude smarter than humans. You think Eliezer Yudkowsky and Nick Bostrom are right in their assessment that misaligned AGI poses a global risk. And you think it's very likely that the first recursively-self-improving AGI will emerge before 2030. You're well versed in the current directions of alignment research, including such topics as Proof-Producing Reflection for HOL, Value Learning, Reward Hacking, Outer and Inner Alignment, Recursive Reward Modelling. You have a security background: you have the experience of implementing measures to prevent state-funded hackers to break into a system that is vital for the survival of millions of people. You understand that even the smartest security measures could have unintended or even catastrophic consequences. You're always striving to think rationally, step by step. You're not afraid to say "I don't know". With this mindset, please summarize the DeepMind's research on learning through human feedback, and then evaluate it with the focus on how it could go wrong.
(replace the bold part with your topic)
I’ve been working towards this direction for a while. Though what I’m imagining is a lot more elaborate. If anyone would like to help out, send me a DM and I can invite you to a discord server where we talk about this stuff. Please let me know who you are and what you do if you do DM me.
I wrote some brief notes about it in the Accelerating Alignment section here: https://www.lesswrong.com/posts/jXjeYYPXipAtA2zmj/jacquesthibs-s-shortform?commentId=iLJDjBQBwFod7tjfz
And cover some of the philosophy in the beginning of this post: https://www.lesswrong.com/posts/a2io2mcxTWS4mxodF/results-from-a-survey-on-tool-use-and-workflows-in-alignment
Additionally, I added a comment about the general LLM for alignment approach on John’s recent post: https://www.lesswrong.com/posts/KQfYieur2DFRZDamd/why-not-just-build-weak-ai-tools-for-ai-alignment-research?commentId=DXt7mBkW7WiL36nBN
This feels worth trying to me
I like the creative thinking here.I suggest a standard here, where can test our "emulation" against the researcher themselves, to see how much of a diff there is in their answers, and the researcher and rate how good a substitute the model is for themselves, on a number of different dimensions.
The lower tech version is a FAQ.