In this post I reflect on my experience in participating at a ML4Good (UK, March 2024) bootcamp. I am writing this mainly for my own benefit - to reflect and to start making a habit of engaging with the community on lesswrong. But also to help future participants. If there is at least one person who finds it helpful for deciding whether or not they want to apply to a future iteration of ML4Good or a similar program, then I’m more than happy.
Opinions are my own and relate to this iteration of ML4Good (the program may change in the future). I am not affiliated with ML4Good beyond having participated in this program.
My expectations
I... (read 946 more words →)
The paper argues that there is one generalizing truth direction tG which corresponds to whether a statement is true, and one polarity-sensitive truth direction tP that corresponds to XOR(is_true,is_negated), related to Sam Marks' work on LLMs representing XOR-features. It further states that the truth directions for affirmative and negated statements are linear combinations of tG and tP, just with different coefficients.
Is there evidence that tG is an actual, elementary feature used by the language model, and not a linear combination of other features? For example, I could imagine that tG is a linear combination of features like e.g.XOR(is_true,is_french), or AND(is_true,is_end_of_sentence), ... .
Do you think we have reason to believe that tG is an elementary feature, and not a linear combination?
If the latter is the case, it seems to me that there is high risk of the probe failing when the distribution changes (e.g. on french text in the example above), particularly with XOR-features that change polarity.