Hi, I found your post while searching for reproductions of the anthropic study on smaller models back in November, I think it raised important concerns about that the other direct replications didn't address completely - like this where the difference between control questions was shown in probabilities instead of logit differences, obscuring whether this is a simple bias shift in logits towards yes or if there is more at play.
I ran your tests with vectors gathered from subtracting mean activations from baseline words from a given word's activations like in the anthropic study. I have found both examples of graphs looking like a 'yes' bias and ones that showed the graphs like... (read 782 more words →)
Hi, I found your post while searching for reproductions of the anthropic study on smaller models back in November, I think it raised important concerns about that the other direct replications didn't address completely - like this where the difference between control questions was shown in probabilities instead of logit differences, obscuring whether this is a simple bias shift in logits towards yes or if there is more at play.
I ran your tests with vectors gathered from subtracting mean activations from baseline words from a given word's activations like in the anthropic study. I have found both examples of graphs looking like a 'yes' bias and ones that showed the graphs like... (read 782 more words →)