x

LESSWRONG

LW

Mikołaj Hasik — LessWrong

Mikołaj Hasik

Mikołaj Hasik

Message

1

6mo

Mikołaj Hasik

6mo

Introspection or confusion?

Mikołaj Hasik6mo10

Hi, I found your post while searching for reproductions of the anthropic study on smaller models back in November, I think it raised important concerns about that the other direct replications didn't address completely - like this where the difference between control questions was shown in probabilities instead of logit differences, obscuring whether this is a simple bias shift in logits towards yes or if there is more at play.

I ran your tests with vectors gathered from subtracting mean activations from baseline words from a given word's activations like i... (read more)