yix's Shortform

yix

This is a special post for quick takes by yix. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Does anyone know of a convincing definition of 'intent' in LLMs (or a way to identify it)? In model organisms type work, I find it hard to 'incriminate' LLMs. Even though the output of the LLM will remain what it is regardless of 'intent', I think this distinction may be important because 'intentionally lying' and 'stochastic parroting' should scale differently with overall capabilities.

I find this hard for several reasons, but I'm highly uncertain whether these are fundamental limitations:

LLMs behave very differently depending on context. Asking it about something it did post-hoc elicits a different 'mode' and doesn't necessarily allow us to make statements about its original behavior.
Mechanistic techniques seem to be good at generating hypotheses, not validating them. Pointing at a SAE feature activation that says 'deception' does not seem conclusive because auto-interp pipelines often does not include enough context for robust explanations for complex high level behaviors like deception.

I’ve been spending sometime thinking about this in the context of deception and types of deception. These are not an answer, but i find these posts helpful: https://www.lesswrong.com/posts/zjGh93nzTTMkHL2uY/the-intentional-stance-llms-edition https://www.alignmentforum.org/posts/YXNeA3RyRrrRWS37A/a-problem-to-solve-before-building-a-deception-detector

LLMs behave very differently depending on context. Asking it about something it did post-hoc elicits a different 'mode' and doesn't necessarily allow us to make statements about its original behavior.
Mechanistic techniques seem to be good at generating hypotheses, not validating them. Pointing at a SAE feature activation that says 'deception' does not seem conclusive because auto-interp pipelines often does not include enough context for robust explanations for complex high level behaviors like deception.

LESSWRONG
LW

LESSWRONG
LW

yix's Shortform

2