Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities — LessWrong