MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures
In examining the low performance of Large Language Models(LLMs)on the Moral Scenarios task, part of the widely used MMLU benchmark by Hendrycks et al., we found surprising results. When presented with moral scenarios individually, the accuracy is 37% better than with the original dual-scenario questions. This outcome indicates that the...
Sep 27, 202318