My thoughts here are speculative in nature, and should not be taken as strongly-held beliefs. As part of an experiment in memetic minimization, only the tldr version of this post exists. There are no supporting links. Let me know what you think in the comments.
Summary: the ultimate threat of AGI is in what it practically does. The danger of (near-term, not-incomprehensibly-superhuman) AGI compared to other actors is motivation. Humans are bad at terrorism, and it’s not because mass murder is objectively hard. A human-level agent without human biases can reasonably be expected to do tremendous damage if they so wish. It seems to be a debatable matter if it is possible with current or near-term technological capabilities to completely wipe out humanity within a short time frame, or if a partially boxed destructive agent is inherently limited in its power. One limiting factor in current discussions is discussion of plausible methods to destroy the world. Giving too much detail is infohazardous, but giving too little or too fanciful details makes the pessimistic position seem fantastical to those not aware of plausible methods. (I personally have a human-extinction method I’m fairly confident in, but am not sure it’s worth the risk to share online. I also do not believe that methods presented in this forum so far are convincing to outsiders, even if correct.)
A possible argument against near-term AGI doom is the limitations of a world without human agents, and how fast/easy it would be to create safely robust agents with equivalent power and dexterity as humans. If AGI physically can’t succeed at its task without us in the short term, it will need to keep us alive until we are fully redundant. This could buy us time,
and potentially even allow for blackmail. If we can find a way for an agent to plausibly be forced to abide by a contract [this isn’t the right wording exactly, more like the agent keeping a promise due to blackmail now, even if you don’t hold collateral later], then preventing the extinction of humanity (if not fully solving alignment) might be feasible, with a moderate degree of confidence.
All of this is dependent on an AGI’s plans being dependent on our continued existence, which is minimally equivalent to the shortest possible amount of time it would take to create real-world agents that are, at least, capable of reliably keeping current internet infrastructure running. Destroying humanity before that point would be counterproductive to almost all instrumental goals.
Balckmail is only possible once malicious AI is discovered, identified, and only if the humanity is somewhat united against it. Given AI's likely ability to first stay under the radar, then hack human politics and inject enough confusion into the conversation, this is not very likely.
The goal would be to start any experiment which might plausibly lead to AGI with a metaphorical gun to the computer’s head, such that being less than (observably) perfectly honest with us, “pulling any funny business,” etc. would lead to its destruction. If you can make it so that the path of least resistance to make it safely out of a box is to cooperate rather than try to defect and risk getting caught, you should be able to productively manipulate AGIs in many (albeit not all) possible worlds. Obviously this should be done on top of other alignment methods, but I doubt it would hurt things much, and would likely help as a significant buffer.
On reflection, I see your point, and will cross that section out for now, with the caveat that there may be variants of this idea which have significant safety value.