Hi, I found your post while searching for reproductions of the anthropic study on smaller models back in November, I think it raised important concerns about that the other direct replications didn't address completely - like this where the difference between control questions was shown in probabilities instead of logit differences, obscuring whether this is a simple bias shift in logits towards yes or if there is more at play.
I ran your tests with vectors gathered from subtracting mean activations from baseline words from a given word's activations like i... (read more)
Hi, I found your post while searching for reproductions of the anthropic study on smaller models back in November, I think it raised important concerns about that the other direct replications didn't address completely - like this where the difference between control questions was shown in probabilities instead of logit differences, obscuring whether this is a simple bias shift in logits towards yes or if there is more at play.
I ran your tests with vectors gathered from subtracting mean activations from baseline words from a given word's activations like i... (read more)