Yes, ideally probes trained with different random seeds should converge to the same direction if there is a well-defined signal in the data. I think the divergence here is largely an artifact of the dataset quality. The original dataset had only about 120 examples, mostly focused on cybersecurity topics, so the probes may have overfit or gotten stuck in different local minima.
The Bigbench dataset is slightly better but still lacks diversity. It’s sufficient to reveal the interesting structure (that's why I stopped here), but to get more consistent probe directions we’d need a larger and more balanced dataset.
Maybe creating a dataset from a list of forbidden behaviours ? (like the Constitutional AI or the Constitutional Classifiers). At least trying to have a diverse and large enough dataset for the probes to converge properly.
Yes, ideally probes trained with different random seeds should converge to the same direction if there is a well-defined signal in the data. I think the divergence here is largely an artifact of the dataset quality. The original dataset had only about 120 examples, mostly focused on cybersecurity topics, so the probes may have overfit or gotten stuck in different local minima.
The Bigbench dataset is slightly better but still lacks diversity. It’s sufficient to reveal the interesting structure (that's why I stopped here), but to get more consistent probe directions we’d need a larger and more balanced dataset.
Maybe creating a dataset from a list of forbidden behaviours ? (like the Constitutional AI or the Constitutional Classifiers). At least trying to have a diverse and large enough dataset for the probes to converge properly.