METR’s preliminary evaluation of o3 and o4-mini — LessWrong