Seth Ahrenbach — LessWrong

Thanks for your feedback. Yeah, people really seem to dislike it

Thanks. Can you link me to where you wrote similar things?

"Destroy humanity" as an immediate subgoal

I suppose if the goal is terminal, then it would override self preservation, so the risk would be due to the AGI accidentally killing us all, or some other corner case of alignment optimization gone bad., e.g. dopamine laced clouds that put us in a stupor or whatever.. Perhaps I need to assume alignment has not been solved, too. Thanks. Edit: I believe without a full accounting that unsolved alignment would allow the subgoal to persist. Given common knowledge that the AGI could destroy humanity either through accident or imperfect alignmenr, and given the goal of self-preservation (although not terminal), I think we still get a sub-goal of destroying humanity, because the competitive structure exists. I think with a richer action set, e.g. “kill the humans who would kill me (in secret) and satisfy the desires of the rest”, a bad equilibrium still results, and our best move right now is to not build it.

"Destroy humanity" as an immediate subgoal

Seth Ahrenbach2y10

You are correct that my argument would be stronger if I could prove that the NE I identified is the only one.

I do not think it is reasonable that AGI would fail to obtain strategic advantage if sought, unless we pre-built in MAD-style assurances. But perhaps under my assumptions a stable “no one manages to destroy the other” outcome results. I would need to do more work to bring in assumptions about AGI becoming vastly more powerful and definitely winning, to prevent this. And I think this is the case, but maybe I should make it more clear.

Similarly, if we can achieve a provable alignment, rather than probabilistic, then we simply do not have the game arise. The AGI would never be in a position to protect its own existence at the expense of ours, due to that provable alignment.

In each case I think you are changing the game, which is something we can and I think should do, but barring some actual work to do that, I think we are left with a game as I’ve described, maybe without sufficient technical detail.

"Destroy humanity" as an immediate subgoal

Seth Ahrenbach2y10

I totally agree with your diagnosis of how some smart people sometimes misuse game theory. And I agree that that’s the loss condition

"Destroy humanity" as an immediate subgoal

Seth Ahrenbach2y10

The missing piece is mutually assured destruction. Given that we did not play the Nash equilibrium as von Neumann suggested, the next best thing is MAD and various counterproliferation treaties that happened to work okay for humans. With an AGI counterparty, we can hope to build in a MAD-like assurance, but it will be a lot more challenging. The equilibrium move is to right now not build AGI.

"Destroy humanity" as an immediate subgoal

Seth Ahrenbach2y10

Correct. Are you intending for this to be a reductio ad absurdum?

"Destroy humanity" as an immediate subgoal

Seth Ahrenbach2y10

I'm not sure about how high the state of epistemic uncertainty needs to be, but you are correct that there is epistemic uncertainty for all parties. Given a probabilistic action filter, it is uncertain whether any particular action will entail the destruction of humanity, and this is common knowledge. I am not the first or only one to propose epistemic uncertain on the part of the AI with respect to goals and actions. See Stuart Russell: https://arxiv.org/abs/2106.10394

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments