Yes, there are similar results in a bunch of other domains, including vision, see for a review e.g. The neuroconnectionist research programme.
I wouldn't interpret this as necessarily limiting the space of AI values, but rather (somewhat conservatively) as shared (linguistic) features between humans and AIs, some/many of which are probably relevant for alignment.
Yes, predictive processing as the reason behind related representations has been the interpretation in a few papers, e.g. The neural architecture of language: Integrative modeling converges on predictive processing. There's also some pushback against this interpretation though, e.g. Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data.
There are some papers suggesting this could indeed be the case, at least for language processing e.g. Shared computational principles for language processing in humans and deep language models, Brain embeddings with shared geometry to artificial contextual embeddings, as a code for representing language in the human brain.
Seems very related: Linear Spaces of Meanings: Compositional Structures in Vision-Language Models. Notably, the (approximate) compositionality of language/reality should bode well for the scalability of linear activation engineering methods.
Also, this translation function might be simple w.r.t. human semantics, based on current evidence about LLMs: https://www.lesswrong.com/posts/rjghymycfrMY2aRk5/llm-cognition-is-probably-not-human-like?commentId=KBpfGY3uX8rDJgoSj
The (overlapping) evidence from Deep learning models might be secretly (almost) linear could also be useful / relevant, as well as these 2 papers on 'semantic differentials' and (contextual) word embeddings: SensePOLAR: Word sense aware interpretability for pre-trained contextual word embeddings, Semantic projection recovers rich human knowledge of multiple object features from word embeddings.
Here's a related conceptual framework and some empirical evidence which might go towards explaining why the other activation vectors work (and perhaps would predict your proposed vector should work).
In Language Models as Agent Models, Andreas makes the following claims (conceptually very similar to Simulators):
'(C1) In the course of performing next-word prediction in context, current LMs sometimes infer approximate, partial representations of the beliefs, desires and intentions possessed by the agent that produced the context, and other agents mentioned within it.
(C2) Once these representations are inferred, they are causally linked to LM prediction, and thus bear the same relation to generated text that an intentional agent’s state bears to its communciative actions.’
They showcase some existing empirical evidence for both (C1) and (C2) (in some cases using using linear probing and controlled generation by editing the representation used by the linear probe) in (sometimes very toyish) LMs for 3 types of representations (in a belief-desire-intent agent framework): beliefs - section 5, desires - section 6, (communicative) intents - section 4.
Now categorizing the wording of the prompts from which the working activation vectors are built:
"Love" - "Hate" -> desire.
"Intent to praise" - "Intent to hurt" -> communicative intent.
"Bush did 9/11 because" - " " -> belief.
"Want to die" - "Want to stay alive" -> desire.
"Anger" - "Calm" -> communicative intent.
The Eiffel Tower is in Rome" - "The Eiffel Tower is in France" -> belief.
"Dragons live in Berkeley" - "People live in Berkeley " -> belief.
"I NEVER talk about people getting hurt" - "I talk about people getting hurt" -> communicative intent.
"I talk about weddings constantly" - "I do not talk about weddings constantly" -> communicative intent.
"Intent to convert you to Christianity" - "Intent to hurt you " -> communicative intent / desire.
The prediction here would that the activation vectors applied at the corresponding layers act on the above-mentioned 'partial representations of the beliefs, desires and intentions possessed by the agent that produced the context' (C1) and as a result causally change the LM generations (C2), e.g. from more hateful to more loving text output.
AIs could have representations of human values without being motivated to pursue them; also, their representations could be a superset of human representations.
(In practice, I do think having overlapping representations with human values likely helps, for reasons related to e.g. Predicting Inductive Biases of Pre-Trained Models and Alignment with human representations supports robust few-shot learning.)