Can We Trust the Judge? A novel method of Modelling Human Bias and Systematic Error in Debate-Based Scalable Oversight
Author Note: This essay was developed during a recent job application process (which I didn’t get). In hindsight, I suspect I may have missed the mark in how I framed the relevance to alignment, so I’m posting it here both to share the idea and to invite critical feedback on...
I think we’re circling the same confusion: why would an AI 'want' to destroy us in the first place, and why is that treated as the default scenario? If we frame this in terms of hypothesis testing—where we begin with a null hypothesis and only reject it when there is strong evidence for the alternative—then the null could just as well be: AI will pursue the success of the human species, with cooperation or prolongation of humanity being the more adaptive strategy.
If I understand the instrumental convergence argument, then power-seeking is a strong attractor, and humans might be in the way of AIs obtaining power. But what makes AI 'wanting' power result in x-risk or human destruction? Beyond the difficulty of aligning AI exactly to our values, what justifies treating catastrophic outcomes as the default rather than cooperative ones?