Does anyone know of a convincing definition of 'intent' in LLMs (or a way to identify it)? In model organisms type work, I find it hard to 'incriminate' LLMs. Even though the output of the LLM will remain what it is regardless of 'intent', I think this distinction may be important because 'intentionally lying' and 'stochastic parroting' should scale differently with overall capabilities.
I find this hard for several reasons, but I'm highly uncertain whether these are fundamental limitations:
I’ve been spending sometime thinking about this in the context of deception and types of deception. These are not an answer, but i find these posts helpful: https://www.lesswrong.com/posts/zjGh93nzTTMkHL2uY/the-intentional-stance-llms-edition https://www.alignmentforum.org/posts/YXNeA3RyRrrRWS37A/a-problem-to-solve-before-building-a-deception-detector