Whose problem is it, really?

Let’s imagine that AGI safety research has been remarkably successful: researchers have (1) correctly identified the most plausible learning architectures that AGI will exhibit, (2) accurately anticipated the existential risks associated with these architectures, and (3) crafted algorithmic control proposals that provably mitigate these risks. Would it finally be time to pack up, go home, and congratulate ourselves for saving the species from an extinction-level event? Not if there isn’t a plan in place for making sure these control proposals actually end up getting implemented!

Some safety researchers might protest: “but this is not our job! Our goal is to figure out technical solutions to the AGI safety control problem. If we do this successfully, we’re done—let’s leave the implementation details to the AI governance people.” I am definitely sympathetic to this sort of thought, but I still think it isn’t quite right. 

Throughout the hierarchical framework put forward in this sequence, I have worked to demonstrate why any end-to-end AGI safety research agenda that focuses solely on the AGI and not at all on the humans who are supervising it, assigning it goals, tweaking its internal dynamics, etc. is almost definitely going to fail. I see this final question in the hierarchical framework—how are we going to get AI engineers to use the correct control mechanisms?—as arguably the most important extension of this principle. Successful control proposals will only mitigate AGI-induced existential risks—the putative goal of AGI safety research—if they are actually used, which means that it is incumbent on AGI safety researchers to devote serious effort and attention to the problem of implementing their best safety proposals.

It is important to get clear about the nature of the implementation problem. Steve Byrnes frames it well: “The AI community consists of many thousands of skilled researchers scattered across the globe. They disagree with each other about practically everything. There is no oversight on what they’re doing. Some of them work at secret military labs.” The implementation problem is hard because it requires (at least) coordinating behavior across all of these actors. It is also conceivable that a ‘lone wolf’—i.e., some independent programmer—could also succeed in building an AGI, and we would definitely want to make sure that our implementation proposals also account for this contingency.

A natural first thought here would be to simply frame this as a legal problem—i.e., to pass laws internationally that make it illegal to operate or supervise an AGI that is not properly equipped with the relevant control mechanisms. I think this proposal is necessary but insufficient. The biggest problem with it is that it would be extremely hard to enforce. Presuming that AGI is ultimately instantiated as a fancy program written in machine code, actually ensuring that no individual is running ‘unregulated’ code on their machine would require oversight measures draconian enough to render them highly logistically and politically fraught, especially in Western democracies. The second biggest problem with attempting to legislate the problem away is that, practically speaking, legislative bodies will not be able to keep up with the rapid pace of AGI development (especially if you are sympathetic to an ‘iteratively-accelerating’ timeline in Question 5). 

What such measures would do, however, is establish a strong social incentive of some strength to not engineer or preside over ‘unregulated’ AGI. All else being equal, most people prefer not to do illegal things, even when the chance of being caught is minimal. While I have my strong doubts about the efficacy of framing the implementation problem as a legal one, I think this sort of framing, when considered most generally, is onto something important: namely, that the ‘control proposal implementation problem’ is ultimately one related to incentive engineering.

The incentivization-facilitation trade-off

Surely the most relevant case study for AGI safety in incentive engineering is climate change advocacy. Over the past thirty years or so, the problem of human-caused disruptions in the earth’s climate patterns has gone from an esoteric, controversial, and poorly understood existential threat to one of the most politically and sociologically relevant, highly-discussed problems of the day. Political bodies all over the world have passed sweeping and ambitious measures to reduce the amount of carbon emitted into the earth’s atmosphere (e.g., the Paris Agreement), and, by far most surprising from the perspective of incentive engineering, many major corporations have pivoted from ‘climate-agnosticism’ to fairly aggressively self-regulating in an attempt to mitigate climate-change-related-risks (e.g., voluntarily signing onto The Climate Pledge, which commits corporations to achieve net-zero carbon a full decade before the Paris Agreement). 

At the risk of sounding slightly too idealistic, it currently seems like climate change is now a problem that basically all major societal players are willing to take seriously and individually contribute to mitigating. From a game-theoretical perspective, this is quite a counterintuitive equilibrium! While fully understanding the dynamics of this shift would probably take a (totally-not-AGI-safety-related) sequence of its own, I think one fundamental takeaway here is that getting decentralized entities to all cooperate towards a common goal orthogonal to—or sometimes outright conflicting with—their primary interests (i.e., maintaining political power, maximizing their bottom line, etc.) requires a full-scale awareness campaign that generates a sufficient amount of political, social, financial, etc. pressure to counterbalance these primary interests. The best way to align a firm like Amazon with a goal like becoming carbon neutral is (A) to find strategies for doing so that are minimally costly (i.e., that minimally conflict with their primary interests), and (B) to engineer their financial, social, political, etc. incentives such that they become willing to implement these strategies in spite of the (hopefully minimized) cost.  

Analogously, in order to ensure that the relevant players actually implement the best control proposals yielded by AGI research (and are ideally proud to do so), I think a very important first step is to make the prospect of AGI-related existential risks far more salient to far more people, experts and non-experts alike. Admittedly, the skills required to formulate technical control proposals to mitigate AGI-induced existential risk and those required to persuade vast swathes of people that AGI-induced existential risks are real, important, and impending are notably different. This does not mean, however, that AGI safety researchers should view this latter undertaking as someone else’s problem. 

If the goal of AGI safety research is to mitigate the associated existential risks and mitigating these risks ultimately relies on ensuring that the relevant actors actually implement the best control proposals AGI safety researchers are able to generate, then it necessarily follows that solving this implementation problem must be a part of an end-to-end AGI safety research agenda (even if this entails bringing other thinkers into the fold). Getting lots of people to genuinely care about AGI safety seems an important precondition for generating sufficient political, social, financial, etc. pressure, as referenced in (B).     

The last point, then—and probably the most relevant one for technical researchers—is that safety researchers should work to proactively develop control proposals that are maximally likely to actually get implemented (this is (A), above). The AGI safety control problem is already hard enough without this additional constraint, but the fact of the matter is that specific control proposals will only work as well as the number of AGI engineers that are willing to use them. Evan Hubinger and others sometimes talk about this as the ‘competitiveness’ of a control proposal—i.e., how economically and computationally realistic it would be for those who have control over an AGI to enact these proposals. 

Consider again the analogy to climate change: if climate scientists developed a breakthrough technology (easily attachable to all smokestacks, exhaust pipes, etc.) that chemically converted greenhouse gases to some benign alternative, ‘going green’ probably would have been a far easier sell for major corporations—i.e., would have required far less pressure—than what is being recommended today (e.g., radically rethinking energy consumption patterns, etc.). Analogously, the more financially, computationally, etc. ‘painless’ the AGI safety control proposal, the more likely it will be to actually get implemented. 

I have represented this trade-off below:

Here, we see a more formal presentation of the same idea we have been thinking about throughout this section: the more radical the constraints imposed by the control proposal, the more necessary it will be to engineer the relevant actors' incentives to adopt control proposals. Conversely, the more ‘painless’ the control proposal (the more the control proposal ‘facilitates’ status quo AGI), the less it will be necessary to do wide-scale incentive engineering. It is probably far harder to formulate and guarantee the efficacy of highly facilitatory control proposals, but it would be far easier to have to do less incentivization (i.e., we could just skip the societal-level awareness campaigns, political pressure, etc.). On the other hand, it is probably (relatively) easier to formulate and guarantee the efficacy of less facilitatory control proposals, but this would come at the cost of less willing actors who would therefore require pressure to adopt the control proposals, which would, in turn, require more intense, climate-change-esque incentive engineering.

To conclude this section, I will only underline that the goal of AGI safety research—i.e., mitigating AGI-induced existential risks—will not be achieved until the control proposals yielded from the best AGI safety research are actually running on the computers of the labs, firms, and individual actors who find themselves presiding over an AGI (as well as, for the human alignment control proposals, in the minds of these actors!). If for whatever reason these entities choose not to implement the best control proposals that are actually yielded by technical safety research, I think this would constitute a clear failure to actualize the mission established by the field. In spite of its not being strictly computational, the question of control proposal implementation must be taken as seriously as the preceding three in this hierarchical framework. 

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 10:59 AM

I like you writing about this: the policy problem is not mentioned often enough on this forum. Agree that it needs to be part of AGI safety research.

I have no deep insights to add, just a few high level remarks:

to pass laws internationally that make it illegal to operate or supervise an AGI that is not properly equipped with the relevant control mechanisms. I think this proposal is necessary but insufficient. The biggest problem with it is that it is totally unenforceable.

I feel that the 'totally unenforceable' meme is very dangerous - it is too often used as an excuse by people who are looking for reasons to stay out of the policy game. I also feel that your comments further down in the post in fact contradict this 'totally unenforceable'.

Presuming that AGI is ultimately instantiated as a fancy program written in machine code, actually ensuring that no individual is running ‘unregulated’ code on their machine would require oversight measures draconian enough to render them logistically and politically inconceivable, especially in Western democracies.

You mean, exactly like how the oversight measures against making unregulated copies of particular strings of bits, in order to protect the business model of the music industry and Hollywood, was politically inconceivable in the time period from the 1980s till now, especially in Western democracies? We can argue about how effective this oversight has been, but many things are politically conceivable.

My last high-level remark is that there is a lot of AI policy research, and some of it is also applicable to AGI and x-risk. However, it is very rare to see AI policy researchers post on this forum.

Thanks for your comment! I agree with both of your hesitations and I think I will make the relevant changes to the post: instead of 'totally unenforceable,' I'll say 'seems quite challenging to enforce.' I believe that this is true (and I hope that the broad takeaway from this post is basically the opposite of 'researchers need to stay out of the policy game,' so I'm not too concerned that I'd be incentivizing the wrong behavior). 

To your point, 'logistically and politically inconceivable' is probably similarly overblown.  I will change it to 'highly logistically and politically fraught.' You're right that the general failure of these policies shouldn't be equated with their inconceivability. (I am fairly confident that, if we were so inclined, we could go download a free copy of any movie or song we could dream of—I wouldn't consider this a case study of policy success—only of policy conceivability!).