With Respect
Given that in more than a third of the cases where GPT and the answer set disagreed you thought GPT was right and the answer set was wrong, did you check for cases where GPT and the answer set agreed on an answer you thought was wrong?
Yours Sincerely
Astral Codex Ten: https://astralcodexten.substack.com/p/your-book-review-why-machines-will
"Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing)."
In a "heatplot" or plots cf https://www.elsblog.org/the_empirical_legal_studi/2023/05/heatplots-for-correlation-coefficients-graphs.html
You could also study the distribution of correlation strengths found over the range of correlations tested, possible, seeing how it compares to what would be expected by chance.
https://www.lesswrong.com/posts/QzkTfj4HGpLEdNjXX/an-artificially-structured-argument-for-expecting-agi-ruin
skeptical reaction with one expression of support: https://statmodeling.stat.columbia.edu/2023/05/31/jurassic-ai-extinction/
https://statmodeling.stat.columbia.edu/2015/12/16/lacour-and-green-1-this-american-life-0/
and generally "beware the one of just one study"
Is martingale different from conservation of expected evidence?
https://www.lesswrong.com/posts/jiBFC7DcCrZjGmZnJ/conservation-of-expected-evidence