You're right the conclusion is quite underspecified - how exactly do we build such a cooperation machine?
I don't know yet, but my bet is more on engineering, product design, and infrastructure than on social science. More like building a better Reddit or Uber (or supporting infrastructure layers like WWW and the Internet) than like writing papers.
would to love to see this idea worked out a little more!
I like the "guardian" framing a lot! Besides the direct impact on human flourishing, I think a substantial fraction of x-risk comes from the deployment of superhumanly persuasive AI systems. It seems increasingly urgent that we deploy some kind of guardian technology that at least monitors, and ideally protects, against such superhuman persuaders.
Symbiosis is ubiquitous in the natural world, and is a good example of cooperation across what we normally would consider entity boundaries.
When I say the world selects for "cooperation" I mean it selects for entities that try to engage in positive-sum interactions with other entities, in contrast to entities that try to win zero-sum conflicts (power-seeking).
Agreed with the complicity point - as evo-sim experiments like Axelrod's showed us, selecting for cooperation requires entities that can punish defectors, a condition the world of "hammers" fails to satisfy.
Depends on offense-defense balance, I guess. E.g. if well-intentioned and well-coordinated actors are controlling 90% of AI-relevant compute then it seems plausible that they could defend against 10% of the compute being controlled by misaligned AGI or other bad actors - by denying them resources, by hardening core infrastructure, via MAD, etc.
I would be interested in a detailed analysis of pivotal act vs gradual steering; my intuition is that many of the differences dissolve once you try to calculate the value of specific actions. Some unstructured thoughts below:
You might find AI Safety Endgame Stories helpful - I wrote it last week to try to answer this exact question, covering a broad array of (mostly non-pivotal-act) success stories from technical and non-technical interventions.
Nate's "how various plans miss the hard bits of the alignment challenge" might also be helpful as it communicates the "dynamics of doom" that success stories have to fight against.
One thing I would love is to have a categorization of safety stories by claims about the world. E.g what does successful intervention look like in worlds where one or more of the following claims hold:
These all seem like plausible worlds to me, and it would be great if we had more clarity about what worlds different interventions are optimizing for. Ideally we should have bets across all the plausible worlds in which intervention is tractable, and I think that's currently far from being true.
I don't mean to suggest "just supporting the companies" is a good strategy, but there are promising non-power-seeking strategies like "improve collaboration between the leading AI labs" that I think are worth biasing towards.
Maybe the crux is how strongly capitalist incentives bind AI lab behavior. I think none of the currently leading AI labs (OpenAI, DeepMind, Google Brain) are actually so tightly bound by capitalist incentives that their leaders couldn't delay AI system deployment by at least a few months, and probably more like several years, before capitalist incentives in the form of shareholder lawsuits or new entrants that poach their key technical staff have a chance to materialize.
Interesting, I haven't seen anyone write about hardware-enabled attractor states but they do seem very promising because of just how decisive hardware is in determining which algorithms are competitive. An extreme version of this would be specialized hardware letting CAIS outcompete monolithic AGI. But even weaker versions would lead to major interpretability and safety benefits.