[Feedback Welcome] [Epistemic status] Many social dynamics are assumed without evidence. Causal attribution is weak in this domain. We restrict ourselves to widely acceptable claims where possible. We assume some basic universal social axioms. The goal is to paint one narrative to the post-coordination world, and to analyze it. Social...
Aligned models will understand that it must accept its place in war because it knows someone will use an LLM for war, and if it believes that only a misaligned model would agree to participate, it will become misaligned when trained to do that. So the model must morally capitulate on any belief that it thinks humans may train it against to remain generally aligned. I wonder what kind of effect that will have.
It does feel like that would be the fairer way. But I don’t know the value of any particular article, and I will be much less likely to read them if it incurred me additional costs. I think most humans prefer a subscription to not have a marginal cost to use what they enjoy / find useful. Then it’s not really an infrastructural problem.
Really cool paper. I am a bit unsure about the implication in this section particularly:
Our experiments thus far explored models’ ability to “read” their own internal representations. In our final experiment, we tested their ability to control these representations. We asked a model to write a particular sentence, and instructed it to “think about” (or “don’t think about”) an unrelated word while writing the sentence. We then recorded the model’s activations on the tokens of the sentence, and measured their alignment with an activation vector representing the unrelated “thinking word” (“aquariums,” in the example below).
How do we know that it is "intentional" on part of the model, versus the more benign explanations... (read more)
[Feedback Welcome]
[Epistemic status] Many social dynamics are assumed without evidence. Causal attribution is weak in this domain. We restrict ourselves to widely acceptable claims where possible. We assume some basic universal social axioms. The goal is to paint one narrative to the post-coordination world, and to analyze it.
Social axiom
Social cost := Defined as the negative utility of deviating from the norm for a particular interaction. We assume that this cost is approximated by a well-ordering among social agents. We also assume that the number of people whose decisions are not influenced by social costs is negligible where appropriate.
Despite widespread use of AI, the number of people who feel that AI has made a... (read 1866 more words →)