I wonder if Google is optimizing harder for benchmarks, to try and prop up its stock price against possible deflation of an AI bubble.
It occurs to me that an AI alignment organization should create comprehensive private alignment benchmarks and start releasing the scores. They would have to be constructed in a non-traditional way so they're less vulnerable to standard goodharting. If these benchmarks become popular with AI users and AI investors, they could be a powerful way to steer AI development in a more responsible direction. By keeping them private, you could make it harder for AI companies to optimize against the benchmarks, and nudge them towards actually solving deeper alignment issues. It would also be a powerful illustration of the point that advanced AI will need to solve unforeseen/out-of-distribution alignment challenges. @Eliezer Yudkowsky
Thanks for making this list!
Having written all this down in one place, it's hard not to feel some hopelessness that all of these problems can be made legible to the relevant people, even with a maximum plausible effort.
I think that a major focus should be on prioritizing these problems based on how plausible a story you can tell for a catastrophic outcome if the problem remains unsolved, conditional on an AI that is corrigible and aligned in the ordinary sense.
I suppose coming up with such a clear catastrophe story for a problem is more or less the same thing as legibilizing it, which reinforces my point from the previous thread that a priori, it seems likely to me that illegible problems won't tend to be as important to solve.
The longer a problem has been floating around without anyone generating a clear catastrophe story for it, the greater probability we should assign that it's a "terminally illegible" problem which just won't cause a catastrophe if it's unsolved.
Maybe it would be good to track how much time has been spent attempting to come up with a clear catastrophe story for each problem, so people can get a sense of when diminishing research returns are reached for a given problem? Perhaps researchers who make attempts should leave a comment in this thread indicating how much time they spent trying to generate catastrophe stories for each problem?
Perhaps it's worth concluding on a point from a discussion between @WillPetillo and myself under the previous post, that a potentially more impactful approach (compared to trying to make illegible problems more legible), is to make key decisionmakers realize that important safety problems illegible to themselves (and even to their advisors) probably exist, therefore it's very risky to make highly consequential decisions (such as about AI development or deployment) based only on the status of legible safety problems.
I still think the best way to do this is to identify at least one problem which initially seemed esoteric and illegible, and eventually acquired a clear and compelling catastrophe story. Right now this discussion all seems rather hypothetical. From my perspective, the problems on your list seem to fall into two rough categories: legible problems which seem compelling, and super-esoteric problems like "Beyond Astronomical Waste" which don't need to be solved prior to creation of an aligned AI. Off the top of my head I haven't noticed a lot of problems moving from one category to the other by my lights? So just speaking for myself, this list hasn't personally convinced me that esoteric and illegible problems should receive much more scarce resources, although I admit I only took a quick skim.
I definitely like the directions you are exploring in and I agree they are improvements over the implicit AGI lab directed concept. That's a useful thing to keep in mind, but so is what keeps them from being final ideas.
+1
What do you think? Does that make sense at all, or maybe it seems more like a time wasting distraction? I have to admit I'm uncomfortable with the amount I have gotten stuck on the idea that championing this concept is a useful thing for me to be doing.
Glad you're self-aware about this. I would focus less on championing the concept, and more on treating it as a hypothesis about a research approach which may or may not deliver benefits. I wouldn't evangelize until you've got serious benefits to show, and show those benefits first (with the concept that delivered those benefits as more of a footnote).
Your rot13 suggestion has the obvious corruption problem, but also has the problem of public relations for the plan. I doubt it would be popular. However, I like where your head is at.
My suggestion is not supposed to be the final idea. It's just supposed to be an improvement over what appears to be Wei Dai's implicit idea, of having philosophers who have some connection to AGI labs solve these philosophical issues, and hardcode solutions in so they can't be changed.
(Perhaps you could argue that Wei Dai's implicit idea is better, because there's only a chance that these philosophers will be listened to, and even then it will be in the distant future. Maybe those conditions keep philosophers honest. But we could replicate those conditions in my scenario as well: Randomly generate 20 different groups of philosophers, then later randomly choose 1 group to act on their conclusions, and only act on their conclusions after a 30-year delay.)
Basically, whatever system we use for deciding on and controlling the corrigible AI becomes the system we are concerned with ensuring the alignment of. It doesn't really solve the problem, it just backs it up one matryoshka doll around the AI.
I'm not convinced they are the same problem, but I suppose it can't hurt to check if ideas for the alignment problem might also work for the "morality is scary" problem.
Maybe you could collaborate with Chad Jones or some other heavyweight?
Based on the OP it seems there are some issues writing papers about transformative AI in general. I would say it's best for the journal to feature a variety of perspectives on this topic instead of privileging a particular viewpoint. "The Journal of Existential Risk of AI" risks being perceived as too niche, in my opinion. But "The Journal of Transformative AI" seems likely to acquire interest by default, if AI continues to progress rapidly.
Again, thanks for the reply.
Building a corrigible AGI has a lot of advantages. But one disadvantage is the "morality is scary" problem you mention in the linked comment. If there is a way to correct the AGI, who gets to decide when and how to correct it? Even if we get the right answers to all of the philosophical questions you're talking about, and successfully program them into the AGI, the philosophical "unwashed masses" you fear could exert tremendous public pressure to use the corrigibility functionality and change those right answers into wrong ones.
Since corrigibility is so advantageous (including its ability to let us put off all of your tricky philosophical problems), it seems to me that we should think about the "morality is scary" problem so we can address what appears to be corrigibility's only major downside. I suspect the "morality is scary" problem is more tractable than you assume. Here is one idea (I did a rot13 so people can think independently before reading my idea): Oevat rirelbar va gur jbeyq hc gb n uvtuyl qrirybcrq fgnaqneq bs yvivat. Qrirybc n grfg juvpu zrnfherf cuvybfbcuvpny pbzcrgrapr. Inyvqngr gur grfg ol rafhevat gung vg pbeerpgyl enax-beqref cuvybfbcuref ol pbzcrgrapr nppbeqvat gb 3eq-cnegl nffrffzragf. Pbaqhpg n tybony gnyrag frnepu sbe cuvybfbcuvpny gnyrag. Pbafgehpg na vibel gbjre sbe gur jvaaref bs gur gnyrag frnepu gb fghql cuvybfbcul naq cbaqre cuvybfbcuvpny dhrfgvbaf juvyr vfbyngrq sebz choyvp cerffher.
Thanks for the reply. I'm not a philosopher, but it seems to me that most of these problems could be addressed after an AGI is built, if the AGI is corrigible. Which problems can you make the strongest case for as problems which we can't put off this way?
Is starting a new journal devoted to this topic a feasible option?
Source.
I'm concerned that this "law" may apply to Anthropic. People devoted to Anthropic as an organization will have more power than people devoted to the goal of creating aligned AI.
I would encourage people at Anthropic to leave a line of retreat and consider the "least convenient possible world" where alignment is too hard. What's the contingency plan for Anthropic in that scenario?
Next, devise a collective decision-making procedure for activating that contingency plan. For example, maybe the contingency plan should be activated if X% of the technical staff votes to activate it. Perhaps after having a week of discussion first? What would be the trigger to spend a week doing discussion? You can answer these questions and come up with a formal procedure.
If you had both a contingency plan and a formal means to activate it, I would feel a lot better about Anthropic as an organization.