A very interesting article indeed! Besides the potential implication for AI safety and explainability, this quantitative evaluation of the relative rates of evolution of model performance with and without CoT can be quite helpful in real-world scenarios where the cost of CoT in terms of latency and/or token usage is a real limitation.
Not sure whether it is possible to answer this question at the moment, but I would be interested in any insight on whether this trend comes from a fundamental property of modern LLMs (and thus likely to continue for the foreseeable future) or if it may be primarily due to a current focus on CoT from frontier labs.
A very interesting article indeed! Besides the potential implication for AI safety and explainability, this quantitative evaluation of the relative rates of evolution of model performance with and without CoT can be quite helpful in real-world scenarios where the cost of CoT in terms of latency and/or token usage is a real limitation.
Not sure whether it is possible to answer this question at the moment, but I would be interested in any insight on whether this trend comes from a fundamental property of modern LLMs (and thus likely to continue for the foreseeable future) or if it may be primarily due to a current focus on CoT from frontier labs.