I recently saw a tweet by Nora Belrose that claimed that ELK works much better when adding a "prompt-invariance term".
And thinking about it, there seems to be an important underlying principle here, not just for AI alignment, but also for rationality as applied to humans.
When humans think about something, we use a frame to decide what questions to ask, how we model it, what aspects of it are important, etc...
What is true about something is generally not going to be something that depends on the frame (things involving self-reference seem like the main thing that might be an exception). Which means that processes optimized for use in truthseeking will tend to be "frame-invariant", they'll do the same thing to explore the question regardless of the frame being used.
So when we notice that a change in frame would change the way we would think or feel about something, this indicates that we may be using processes that have not been optimized for truthseeking. Thus, someone trying to determine the truth would be wise to notice when this is happening, as it could indicate a process optimized for non-truthseeking, or a truthseeking process that is poorly optimized - both opportunities to improve one's truthseeking ability.
Eliezer has made a similar point:
Another way of breaking loose of 'arguments': Any time somebody manages to persuade you of something via much hard work, do not neglect to remember that you would, if you had been smarter, probably have been persuadable by the empty string.
In addition to being relevant for studying AI (as in the original tweet), this principle also turns up in physics as general covariance: the true laws of physics are invariant under coordinate transformations. Coordinates are things set by humans in order to be able to refer to and measure something, and choosing them carefully can make certain problems much easier. This makes them an example of the same general concept as a "frame" as used earlier. Nonetheless, their choice cannot affect what is physically true. Einstein described this principle while working to discover General Relativity.
This apparent breadth of applicability suggests that this principle is quite deep.
It's also worth noting that this principle can be "turned-around", and given a particular question - we might be able to determine the truth more efficiently by choosing a frame optimized for such a purpose. This is used to great effect in physics, though I feel less hopeful about humans being able to consistently do this in ways that don't optimize for something else instead, i.e. frame control.
Also, I think we can use this idea to point to a particular form of frame control more precisely. If someone has a truthseeking process which is frame-invariant within a certain domain, but not beyond that domain, then they are being frame controlled if someone pushes a frame on them which takes them beyond that domain. Deliberately doing so would be a clear case of manipulation.
Interesting! I've recently been thinking a bunch about "narratives" (frames) and how strongly they shape how/what we think. Making it much harder to see "the" truth since changing the narrative changes things quite a bit.
I'm curious if anyone has an example of how they would go about applying frame-invariance to rationality.
See also the heuristics & biases work on framing effects, e.g. Tversky and Kahneman's Rational Choice and the Framing of Decisions