Survey of Multi-agent LLM Evaluations — LessWrong