Can questions rigidly designate intentions? (And what that might mean for alignment)

LESSWRONG
LW

Can questions rigidly designate intentions? (And what that might mean for alignment) — LessWrong

Saul Kripke defines a rigid designator as a term that refers to the same object in every world in which that object exists.^[1] The Morning Star (Phosphorus) and the Evening Star (Hesperus) both name the planet Venus, and the statement "Hesperus is Phosphorus" is a necessary a posteriori truth on Kripke's widely accepted account. Definite descriptions are typically non-rigid designators; for example, "The winner of the 2025 Nobel Peace Prize" is a non-rigid designator and a definite description. The assertion "The winner of the 2025 Nobel Peace Prize is María Corina Machado" is true, but it's not necessarily true. On a possible world semantics, there are many worlds in which Machado exists but someone else won the award in 2025; that she won is only a contingent fact about the world we live in.^[2] Rigid designators tag an object in every world in which that object exists whereas non-rigid designators can designate different objects across worlds. Turning now to the two questions motivating this post: Why think that questions can rigidly designate intentions? And what in the world does this have to do with aligning frontier models?

Why think that questions can rigidly designate intentions?

Imagine an organism with one intention: find food. Now suppose that organism--call it Fred--has to query its environment to achieve this goal.^[3] Fred must differentiate food from non-food and basically there are two methods for doing so: a direct query and a proxy query.

Direct query: Fred queries the environment for food itself. Questions like "what things have nutritional value to Fred?" or "is this food?", when given full scope over the objects in Fred's environment and answered accurately, partition that environment into edible and non-edible for Fred.

Proxy query: Fred queries the environment for a proxy for food. For example, suppose that all and only food is colored red where Fred is, and so the question "what things are red?" when answered accurately partitions the environment into red and non-red categories, and by stipulation into edible and non-edible for Fred.

These two methods closely approximate rigid and non-rigid designation as a relationship that holds between queries and intentions. Direct queries function like rigid designators and proxy queries function like non-rigid designators insofar as they bear on the intention find food. In every world in which Fred has the goal of finding food, the direct query "what things have nutritional value to Fred?" bears on this goal in that a complete answer to the question is equivalent to satisfying the goal. By contrast, proxy queries fail in certain counterfactual environments as they only track world-contingent correlates of the satisfaction conditions for the intention. If we imagined a world in which all and only food was blue, the proxy query "what things are red?" would no longer bear on Fred's intention to find food. Much like a non-rigid designator, a proxy query tracks a contingent fact about an intention given some environment and not the intention itself. Much like a rigid designator, a direct query acts as a context-partitioning tag for the intention itself.

Question semantics differ from the semantics of declarative utterances. On the standard reading, declarative utterances (assertions) have propositional content. Propositions can be understood as functions from a set of possible worlds to truth valuations at each of those worlds. A proposition tokened by an assertive utterance bisects the set of all possible worlds, sorting those worlds into ones in which the proposition is true and worlds in which the proposition is false. By contrast a standard question semantics represents questions as a partition over a set of worlds. This partition, induced by the question, is a mutually exclusive and exhaustive set of alternatives (or cells)--each alternative itself a set of worlds. A partial answer to a question returns a truth valuation for every world in at least one cell of the partition. A complete answer satisfies the same condition for every cell in the partition. A key feature of question semantics is that differences between worlds that are in the same cell of a question partition by definition do not directly contribute to answering the question. For example, if I ask "who talked to who at the party last night?" a partition over possible conversation partners from last night's party would not track differences in shirt colors of the party-goers. Conversely, "who wore pink to the party last night?" would be indifferent to conversational pairings. Let's call a stack (a nested partial order) of context-structuring question partitions a filtration.

An organism like Fred could have a context-structuring question partition that models its context by way of a direct query or proxy query. Only by systematically varying Fred's environment could the observer determine which partition was operative. Of course, the observer can only get asymmetric information about whether or not a question structuring partition is a direct query or a proxy query insofar as it relates to Fred's intentional outlook; you can falsify the claim that some query is a direct query but you cannot confirm a direct query unless every possible environmental variation is tested (an impossible task in the real world). If the environmental variation results in a behavior change, say the observer turns the food blue and Fred keeps going after red stuff for a while but then changes behavior or dies, we learn that Fred was tracking red but as a proxy for something else. The observer might adopt a guilty until proven innocent approach for the possible context structuring questions for Fred, assuming that all possible queries are directly bearing on an intention until an intervention proves otherwise.

What might this mean for alignment?

If one can model the intentional outlook (the normative disposition) of a complex system by tracking certain context structuring questions (direct queries not proxy queries) that system asks its environment, then the problem of modelling the intentional outlook of a frontier AI model gets reduced to the problem of modelling a structured representation of its context under sufficient counterfactual variation. Right now there is no good way to model the intentional outlook of an LLM or any frontier AI. There are many reasons for this, but one possibly overlooked reason is that our ability to formally represent normative/intentional content lags far behind our ability to formally represent descriptive/factual content. The idea that questions can rigidly designate intentions suggests that question semantics could function as a bridge between a semantics for intentions/norms and a semantics for descriptions, providing a research avenue for alignment science.

A speculative preview of this research avenue:

The implicit and explicit questions that structure context are learnable and differentiable. We can represent these question partitions as tensors, a partial order of which constitutes a context structuring filtration. Declarative utterances retain their status as functions from filtration to filtration and can be thought of as mask tensors, truth-functionally updating an intention-laden context. By forcing a model to generate question partitions that correspond to its understanding of the conversational context at each step in a discourse (this would be an additional kind of attention layer) alignment scientists may be able to lint the normative dispositions of frontier models and systematically test for devious intent.

^{^}
https://plato.stanford.edu/entries/rigid-designators/; https://neuroself.wordpress.com/wp-content/uploads/2020/11/naming-and-necessity-full-text.pdf
^{^}
I don't have any strong commitments about the metaphysical status of possible worlds. Among other things, I view them as a useful analytical frame to think about assertions, questions, and counterfactuals. All talk of possible worlds in this post could be replaced with a pointless topology that might have sparser metaphysical commitments.
^{^}
I use goal and intention interchangeably throughout.