Time horizon is clearly an important dimension of difficulty given an agent(in this case a human), there is some overlap in the difficulty of AI today to do long horizon tasks which humans can, while excelling at tasks that take humans less time(from problem statement to solution). I think you acknowledge as much in the end of your post. Here I'll try to make a stronger case for studying time horizons and extrapolating to AI performance.
Following the Alberta Plan, we can model intelligence as an online learning agent that continually senses, maintains a learned world model, plans/searches within it, acts, and learns from observation–action–reward at every time step. See Sutton, Bowling, & Pilarski, The Alberta Plan for AI Research, arXiv:2208.11173 v3, Mar 21, 2023. And Sutton's talks on thr Oak architecture.
You get more feedback over many steps, allowing you to learn better policies and world models, you can also run search longer both in your world model and in real life. There are other variables of course such as how effectively the policy learns from feedback and how good is the policy to begin with, but if you make the assumption that search and building world models are needed, which they almost certainly are, you do this more effectively given more time. Obviously humans do this, our productive output scales with time. You can score higher on a difficult test given 3 days than 3 hours most of the time. Same is true if you take a mathematician discovering a theorem, it usually takes a minimum amount of time for an expert mathemetician to work on a hard problem before solving it. And more time leads to more output, and harder problems require more time, though this is asymptotic of course.
Now assume mechanical complexity (typing code, reading, making trivial calculations) is a small factor im the time budget in the METR tasks, that is to say assume this is the amount of time it would take an agent at human level intelligence to run search and build models good enough to solve this task.
Crucially note that longer time horizon tasks require exponentially more search and steps for model building in the naive case due to combinatorial explosion. So the job of intelligence in long horizon tasks is to contain this effectively, which is what a human would be doing.
Planning and search is likely necessary for AGI and a time horizon benchmark like METR would capture how well an LLM can execute these aspects by doing a task that took a human X amount of search and world building, where the human time horizon is a proxy for the span of search and world model complexity in the case of "non programmable" tasks, assuming human and LLM intelligences operate according to the same principle outlined in the oak architecture.
I think some of your examples such as factoring pi, and shoveling are less difficult cognitively despite the long time horizon as they involve repetitive movements and are qualitatively different from the kind of tasks required for AI or scientific progress. The same goes for an arctic expedition though there are some overlaps in terms of maintaining coherence of sub plans over long horizons. I think your convincing a person example would compare well with the kind of long horizon difficulty we're trying to capture. Youd have to try different approaches, learn from your experience, build models of his mind, update those models and so on. If you cant do this over many steps, aka you fall out the context window or cant generate good sub plans which align with the higher goal, or something it is simply impossible.
Obviously the other factors are a)online learning during the task, which maybe required but wont be adequately captured by the time aspect of the benchmark and b)one shot intelligence at the start of the task aka how good are your guesses for next possible steps.
The big unknown is how does this interact with other variables? It's possible if your policy is already superhuman at generating candidate tasks, you need a lot less time to search. But I'd argue this capability is somewhat orthogonal to the "effectively run search by creating and trying different candidate options and build world models across long horizons whie maintaining coherence with the overall goal" ability, though both will be needed ultimately.
Time horizon is clearly an important dimension of difficulty given an agent(in this case a human), there is some overlap in the difficulty of AI today to do long horizon tasks which humans can, while excelling at tasks that take humans less time(from problem statement to solution). I think you acknowledge as much in the end of your post. Here I'll try to make a stronger case for studying time horizons and extrapolating to AI performance.
Following the Alberta Plan, we can model intelligence as an online learning agent that continually senses, maintains a learned world model, plans/searches within it, acts, and learns from observation–action–reward at every time step. See Sutton, Bowling, & Pilarski, The Alberta Plan for AI Research, arXiv:2208.11173 v3, Mar 21, 2023. And Sutton's talks on thr Oak architecture.
You get more feedback over many steps, allowing you to learn better policies and world models, you can also run search longer both in your world model and in real life. There are other variables of course such as how effectively the policy learns from feedback and how good is the policy to begin with, but if you make the assumption that search and building world models are needed, which they almost certainly are, you do this more effectively given more time. Obviously humans do this, our productive output scales with time. You can score higher on a difficult test given 3 days than 3 hours most of the time. Same is true if you take a mathematician discovering a theorem, it usually takes a minimum amount of time for an expert mathemetician to work on a hard problem before solving it. And more time leads to more output, and harder problems require more time, though this is asymptotic of course.
Now assume mechanical complexity (typing code, reading, making trivial calculations) is a small factor im the time budget in the METR tasks, that is to say assume this is the amount of time it would take an agent at human level intelligence to run search and build models good enough to solve this task.
Crucially note that longer time horizon tasks require exponentially more search and steps for model building in the naive case due to combinatorial explosion. So the job of intelligence in long horizon tasks is to contain this effectively, which is what a human would be doing.
Planning and search is likely necessary for AGI and a time horizon benchmark like METR would capture how well an LLM can execute these aspects by doing a task that took a human X amount of search and world building, where the human time horizon is a proxy for the span of search and world model complexity in the case of "non programmable" tasks, assuming human and LLM intelligences operate according to the same principle outlined in the oak architecture.
I think some of your examples such as factoring pi, and shoveling are less difficult cognitively despite the long time horizon as they involve repetitive movements and are qualitatively different from the kind of tasks required for AI or scientific progress. The same goes for an arctic expedition though there are some overlaps in terms of maintaining coherence of sub plans over long horizons. I think your convincing a person example would compare well with the kind of long horizon difficulty we're trying to capture. Youd have to try different approaches, learn from your experience, build models of his mind, update those models and so on. If you cant do this over many steps, aka you fall out the context window or cant generate good sub plans which align with the higher goal, or something it is simply impossible.
Obviously the other factors are a)online learning during the task, which maybe required but wont be adequately captured by the time aspect of the benchmark and b)one shot intelligence at the start of the task aka how good are your guesses for next possible steps.
The big unknown is how does this interact with other variables? It's possible if your policy is already superhuman at generating candidate tasks, you need a lot less time to search. But I'd argue this capability is somewhat orthogonal to the "effectively run search by creating and trying different candidate options and build world models across long horizons whie maintaining coherence with the overall goal" ability, though both will be needed ultimately.