The argument goes: there will be a time in the future, t’, where e.g. a terrible AI accident occurs, alignment failures are documented (e.g. partial deception), or the majority of GDP is AI such that more people are pouring resources into aligning AI. Potentially to the point that >90% of alignment resources will be used in the years before x-catastrophe or a pivotal act (Figure 2)
The initial graph (Fig. 1) seems surprisingly useful as a frame for arguing different cruxes & intuitions. I will quickly enumerate a few & would appreciate comments where you disagree.
If we just govern compute usage while advances in hardware/software are continued, this may lead to just shifting t’ to the right w/o slowing down timelines, which implies less resources poured into alignment in total for no benefit.
If we limit compute successfully for many years, but hardware & software improvements continue, an actor can defect and experience a large, discontinuous increase in capabilities. If we (somehow) limit all of them, it will become much much harder to produce transformative AI (intuition: it’s like someone trying to build it today).
As mentioned before, t’ could be caused by “a terrible AI accident occurs, alignment failures are documented (e.g. partial deception), or the majority of GDP is AI such that more people are pouring resources into aligning AI.”
There are other potential causes as well, and I would find it beneficial to investigate how true the above 3 (and others) are as far as convincing real AI researchers into switching their research focus to alignment. I mean literally talking to researchers in machine learning and asking what capabilities (negative or positive) would get them to seriously consider switching their research focus.
Additionally, if we find that e.g. showing partial deception in models really would be convincing, then pouring resources into showing that sooner would be shifting t’ to the left, implying more overall resources poured into alignment.
I expect most AI researchers who try to do alignment to (1) not do impactful work or (2) reinvent the wheel. So assuming we have lots of people who want to do alignment, is there a process that makes them avoid (1) and (2)? For example, a course/workshop they take, post they read, etc.
What I currently think is important is creating multiple documents like https://arbital.com/p/AI_boxing/ . So if someone comes up w/ a boxed-ai plan, we can clearly argue that it must (1) actually build an air-type sandbox and (2) still be useful in the remaining channels to perform a pivotal act. If their plan actually considers these two arguments, then I am much more excited about them.
So creating more documents like that for e.g. interpretability, learning from human feedback, etc and iterating on those arguments with researchers working in those fields today, will be useful for future researchers to not waste their time w/ dead-ends & reinventing the wheel. my latest post on avoiding dead-end research.
Another useful thing to have is clearly specifying the sub-problems, which may look like grounding it in already established formalizations. I think this is really hard, but having these allows outsiders to make clear progress on the problem and would even allow us to directly pay unaligned researchers to work on it today (or set up bounties/millennium prize-like questions)
Related, if we do expect way more people to enter the field, are we building the infrastructure to support that scaling? Ideally what scales to 10,000 people also applies to the thousand or so people that want to do alignment research today.
Special thanks to Tamay Besiroglu for introducing me to this framing & several arguments, though note my takes are different than his.
Well covid was pretty much a massive obvious biorisk disaster. Did it lead to huge amounts of competence and resources being put into pandemic prevention?
My impression is not really.
I mean I also expect an AI accident that kills a similar number of people to be pretty unlikely. But https://www.lesswrong.com/posts/LNKh22Crr5ujT85YM/after-critical-event-w-happens-they-still-won-t-believe-you
I wonder how much COVID got people to switch to working on Biorisks.
What I’m interested here is talking to real researchers and asking what events would convince them to switch to alignment. Enumerating those would be useful for explaining to them.
I think asking for specific capabilities would also be interesting. Or what specific capabilities they would’ve said in 2012. Then asking how long they expect between that capability and an x-catastrophe.