Epistemic status: This idea emerged during my participation in the MATS program this summer. While I intended to develop it further and conduct more rigorous analysis, time constraints led me to publish this initial version (30-60m of work) . I'm sharing it now in case others find it valuable or spot important flaws I've missed. Very open to unfiltered criticism and suggestions for improvement. |
When analyzing AI systems, we often focus on their ability to perform specific tasks. However, each task can be broken down into three fundamental components: knowledge, physical capabilities, and cognitive capabilities. This decomposition offers a potential novel approach to analyzing AI risks.
Let's examine why cognitive capabilities deserve special attention:
However, we face a significant challenge: for any given task, especially dangerous ones, it's difficult to determine which cognitive capabilities are strictly necessary for its completion. We don't want to wait until an AI system can actually perform dangerous tasks before we understand which cognitive capabilities enabled them.
Instead of working backwards from observed dangerous behaviors, we can approach this systematically by mapping the relationship between cognitive capabilities and risks:
This approach has several advantages:
There are two potential approaches to this analysis:
The Capabilities-First approach is generally superior because it reduces confirmation bias. Instead of trying to justify our preexisting beliefs about what capabilities might lead to specific risks, we can think like red teamers: "Given this set of cognitive capabilities, what risks could they enable?"
To make this analysis tractable, we could:
If the analysis proves intractable even with these approaches, that finding itself would be valuable - it would demonstrate the inherent complexity of the problem space.
This framework enables several practical applications:
The immediate challenge is prioritization. While a complete analysis of all possible combinations of cognitive capabilities and risks would be ideal, we can start with:
This framework provides a structured way to think about AI risk assessment and monitoring, moving us beyond task-based analysis to a more fundamental understanding of how cognitive capabilities combine to enable potential risks.
Acknowledgments: I would like to thank Quentin Feuillade-Montixi, Ben Smith, Pierre Peigné, Nicolas Miailhe, JP and others for the fascinating discussions that helped shape this idea during the MATS program. While they contributed valuable conversations, none of them were involved in this post, and any mistakes or questionable ideas are entirely my own responsibility.