x

LESSWRONG
LW

corey morris — LessWrong

corey morris

corey morris

Message

19

1

4

3y

corey morris

19

3y

;

MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures

In examining the low performance of Large Language Models(LLMs)on the Moral Scenarios task, part of the widely used MMLU benchmark by Hendrycks et al., we found surprising results. When presented with moral scenarios individually, the accuracy is 37% better than with the original dual-scenario questions. This outcome indicates that the...

Sep 27, 2023•18