Some meandering thoughts on alignment
A nearcast of how we might go about solving alignment using basic current techniques, assuming little/no substnative government intervention is:
1. During the beginning of takeoff, we do control, attempting to prevent catastrophic actions (e.g. major rogue internal deployments (RIDs)), while trying to elict huge amounts of AI labour.
2. At some point, the value of scaling AI capabilities while trying to maintain control will be very low, because the main bottleneck on elictation of useful AI labour is human oversight. If you are pausing the AI all the time to wait for the humans to understand what the AIs are doing, it doesn't help to make the AI smarter. At this point AI companies pause / slowdown as much as possible (e.g. unilaterally or via coordinating with other companies), but this isn't for very long. During the pause, try to build an AI system that you would trust to manage the intelligence explosion.
3. Handoff to the AI system, i.e. allow the AI to manage the training run, conduct large, open ended experiments, etc, that humans cannot effectively oversee. The AIs should be corrigible and attempt to ensure that their successors are corrigibile, and still check in with humans as much as possible post-handoff, keep them informed about the situation. At this point, it's correct to handoff to the AIs because the alternative -- scaling to vastly superhuman AI without letting the human-slightly superhuman AIs do a huge amount of reserach -- is even more dangerous.
My best guess (overly detailed story) for how this plan goes if you try to implement it in the real world is:
* High-stakes control basically works (during steps 1-2). We prevent AIs from doing major RIDs or subsantially subverting safety/ human oversight.
* Elicitation will totally fail on open-ended, hard to check tasks such (like preparadigmatic agent foundations reserach, etc). We'll get some useful work on the more prosaic agendas, e.g. misalignment an