Why does METR score o3 as effective for such a long time duration despite overall poor scores? — LessWrong