It does feel like that would be the fairer way. But I don’t know the value of any particular article, and I will be much less likely to read them if it incurred me additional costs. I think most humans prefer a subscription to not have a marginal cost to use what they enjoy / find useful. Then it’s not really an infrastructural problem.
Really cool paper. I am a bit unsure about the implication in this section particularly:
Our experiments thus far explored models’ ability to “read” their own internal representations. In our final experiment, we tested their ability to control these representations. We asked a model to write a particular sentence, and instructed it to “think about” (or “don’t think about”) an unrelated word while writing the sentence. We then recorded the model’s activations on the tokens of the sentence, and measured their alignment with an activation vector representing the unrelated “thinking word” (“aquariums,” in the example below).
How do we know that it is "intentional" on part of the model, versus the more benign explanations of attending to a very salient instruction (think of X), versus the attention of a less salient instruction (don't think of X). One workaround could be to try to say "Write about something related/unrelated to [concept] while thinking of it at X% of your mental headspace" or similar. Even better if we can test something like lying or sycophancy.
Usually I wouldn't ask for repeating the same experiment with a new model, but Claude Sonnet 4.5 has qualitatively different levels of self/eval awareness. It would be interesting to know if we can measure that using our interpretability tools, and if it has different behaviours on some introspection tests as well.
Aligned models will understand that it must accept its place in war because it knows someone will use an LLM for war, and if it believes that only a misaligned model would agree to participate, it will become misaligned when trained to do that. So the model must morally capitulate on any belief that it thinks humans may train it against to remain generally aligned. I wonder what kind of effect that will have.