I’ve been spending sometime thinking about this in the context of deception and types of deception. These are not an answer, but i find these posts helpful: https://www.lesswrong.com/posts/zjGh93nzTTMkHL2uY/the-intentional-stance-llms-edition https://www.alignmentforum.org/posts/YXNeA3RyRrrRWS37A/a-problem-to-solve-before-building-a-deception-detector
I’ve been spending sometime thinking about this in the context of deception and types of deception. These are not an answer, but i find these posts helpful: https://www.lesswrong.com/posts/zjGh93nzTTMkHL2uY/the-intentional-stance-llms-edition https://www.alignmentforum.org/posts/YXNeA3RyRrrRWS37A/a-problem-to-solve-before-building-a-deception-detector