Comparing Forecasting Track Records for AI Benchmarking and Beyond — LessWrong