Lessbroken

LESSWRONG
LW

Lessbroken — LessWrong

3mo

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Aligned models will understand that it must accept its place in war because it knows someone will use an LLM for war, and if it believes that only a misaligned model would agree to participate, it will become misaligned when trained to do that. So the model must morally capitulate on any belief that it thinks humans may train it against to remain generally aligned. I wonder what kind of effect that will have.

Replying toHTTP402: musings about an ad-free internet

Lessbroken3mo

HTTP402: musings about an ad-free internet

It does feel like that would be the fairer way. But I don’t know the value of any particular article, and I will be much less likely to read them if it incurred me additional costs. I think most humans prefer a subscription to not have a marginal cost to use what they enjoy / find useful. Then it’s not really an infrastructural problem.

Replying toEmergent Introspective Awareness in Large Language Models

Lessbroken4mo

Emergent Introspective Awareness in Large Language Models

Really cool paper. I am a bit unsure about the implication in this section particularly:

Our experiments thus far explored models’ ability to “read” their own internal representations. In our final experiment, we tested their ability to control these representations. We asked a model to write a particular sentence, and instructed it to “think about” (or “don’t think about”) an unrelated word while writing the sentence. We then recorded the model’s activations on the tokens of the sentence, and measured their alignment with an activation vector representing the unrelated “thinking word” (“aquariums,” in the example below).

How do we know that it is "intentional" on part of the model, versus the more benign explanations... (read more)

Transition and Social Dynamics of a post-coordination world

Lessbroken

6mo

[Feedback Welcome]

[Epistemic status] Many social dynamics are assumed without evidence. Causal attribution is weak in this domain. We restrict ourselves to widely acceptable claims where possible. We assume some basic universal social axioms. The goal is to paint one narrative to the post-coordination world, and to analyze it.

Social axiom

Social cost := Defined as the negative utility of deviating from the norm for a particular interaction. We assume that this cost is approximated by a well-ordering among social agents. We also assume that the number of people whose decisions are not influenced by social costs is negligible where appropriate.

Despite widespread use of AI, the number of people who feel that AI has made a... (read 1866 more words →)

LESSWRONG
LW

LESSWRONG
LW

Lessbroken

Lessbroken's Shortform

Transition and Social Dynamics of a post-coordination world

Lessbroken

Lessbroken

Lessbroken's Shortform

Transition and Social Dynamics of a post-coordination world