Your rot13 suggestion has the obvious corruption problem, but also has the problem of public relations for the plan. I doubt it would be popular. However, I like where your head is at.
My suggestion is not supposed to be the final idea. It's just supposed to be an improvement over what appears to be Wei Dai's implicit idea, of having philosophers who have some connection to AGI labs solve these philosophical issues, and hardcode solutions in so they can't be changed.
(Perhaps you could argue that Wei Dai's implicit idea is better, because there's only a chance that these philosophers will be listened to, and even then it will be in the distant future. Maybe those conditions keep philosophers honest. But we could replicate those conditions in my scenario as well: Randomly generate 20 different groups of philosophers, then later randomly choose 1 group to act on their conclusions, and only act on their conclusions after a 30-year delay.)
Basically, whatever system we use for deciding on and controlling the corrigible AI becomes the system we are concerned with ensuring the alignment of. It doesn't really solve the problem, it just backs it up one matryoshka doll around the AI.
I'm not convinced they are the same problem, but I suppose it can't hurt to check if ideas for the alignment problem might also work for the "morality is scary" problem.
Maybe you could collaborate with Chad Jones or some other heavyweight?
Based on the OP it seems there are some issues writing papers about transformative AI in general. I would say it's best for the journal to feature a variety of perspectives on this topic instead of privileging a particular viewpoint. "The Journal of Existential Risk of AI" risks being perceived as too niche, in my opinion. But "The Journal of Transformative AI" seems likely to acquire interest by default, if AI continues to progress rapidly.
Again, thanks for the reply.
Building a corrigible AGI has a lot of advantages. But one disadvantage is the "morality is scary" problem you mention in the linked comment. If there is a way to correct the AGI, who gets to decide when and how to correct it? Even if we get the right answers to all of the philosophical questions you're talking about, and successfully program them into the AGI, the philosophical "unwashed masses" you fear could exert tremendous public pressure to use the corrigibility functionality and change those right answers into wrong ones.
Since corrigibility is so advantageous (including its ability to let us put off all of your tricky philosophical problems), it seems to me that we should think about the "morality is scary" problem so we can address what appears to be corrigibility's only major downside. I suspect the "morality is scary" problem is more tractable than you assume. Here is one idea (I did a rot13 so people can think independently before reading my idea): Oevat rirelbar va gur jbeyq hc gb n uvtuyl qrirybcrq fgnaqneq bs yvivat. Qrirybc n grfg juvpu zrnfherf cuvybfbcuvpny pbzcrgrapr. Inyvqngr gur grfg ol rafhevat gung vg pbeerpgyl enax-beqref cuvybfbcuref ol pbzcrgrapr nppbeqvat gb 3eq-cnegl nffrffzragf. Pbaqhpg n tybony gnyrag frnepu sbe cuvybfbcuvpny gnyrag. Pbafgehpg na vibel gbjre sbe gur jvaaref bs gur gnyrag frnepu gb fghql cuvybfbcul naq cbaqre cuvybfbcuvpny dhrfgvbaf juvyr vfbyngrq sebz choyvp cerffher.
Thanks for the reply. I'm not a philosopher, but it seems to me that most of these problems could be addressed after an AGI is built, if the AGI is corrigible. Which problems can you make the strongest case for as problems which we can't put off this way?
Is starting a new journal devoted to this topic a feasible option?
The underlying assumption here ("the halt assumption") seems to be that big-shot decisionmakers will want to halt AI development if it's clear that unsolved alignment problems remain.
I'm a little skeptical of the halt assumption. Right now it seems that unsolved alignment problems remain, yet I don't see big-shots moving to halt AI development. About a week after Grok's MechaHitler incident, the Pentagon announced a $200 million contract with xAI.
Nonetheless, in a world where the halt assumption is true, the highest-impact action might be a meta approach of "making the notion of illegible problems more legible". In a world where the halt assumption becomes true (e.g. because the threshold for concern changes), if the existence and importance of illegible problems has been made legible to decisionmakers, that by itself might be enough to stop further development.
So yeah, in service of increasing meta-legibility (of this issue), maybe we could get some actual concrete examples of illegible problems and reasons to think they are important? Because I'm not seeing any concrete examples in your post, or in the comments of this thread. I think I am more persuadable than a typical big-shot decisionmaker, yet my cynical side reads this post and thinks: "Navel-gazers think it's essential to navel-gaze. News at 11."
Another angle: Are there concrete examples of AI alignment problems which were once illegible and navel-gazey, which are now legible and obviously important? (Hopefully you won't say the need for AI alignment itself; I've been on board with that for as long as I can remember.)
There is another possible world here, which is that legibility actually correlates pretty well with real-world importance, and the halt assumption is false, and your post is going to redirect scarce AI alignment talent away from urgent problems which matter, and towards fruitless navel-gazing. I'm not claiming this is the world we live in, but it would be good to gather evidence, and concrete examples could help.
I fully support people publishing lists of AI alignment problems which seem neglected. (Why can't I find a list like that already?) But I suspect many list entries will have been neglected for good reason.
alignment equivalents to "make a trillion dollars" for capabilities that are easy to verify, strictly imply alignment, and extremely difficult to get any traction on (and with it, a series of weakenings of such a metric that are easier to get traction on but also less-strictly imply alignment).
I expect there's a fair amount of low-hanging fruit in finding good targets for automated alignment research. E.g. how about an LLM agent which reads 1000s of old LW posts looking for a good target? How about unlearning? How about a version of RLHF where you show an alignment researcher two AI-generated critiques of an alignment plan, and they rate which critique is better?
I believe in Thinking Fast and Slow, Kahneman refers to this fallacy as "What You See Is All There Is" (WYSIATI). And it used to be common for people to talk about "Unknown Unknowns" (things you don't know, that you also don't know you don't know).
+1
Glad you're self-aware about this. I would focus less on championing the concept, and more on treating it as a hypothesis about a research approach which may or may not deliver benefits. I wouldn't evangelize until you've got serious benefits to show, and show those benefits first (with the concept that delivered those benefits as more of a footnote).