I think this entire thread shows why it's kind of silly to hold LessWrong posts to the same standard as a peer reviewed journal submission. There is clearly a lower bar to LessWrong posts than getting into a peer reviewed journal or even arXiv. And that's fine, that's as it should be. This is a forum, not a journal.
That said, I also think this entire thread shows why LessWrong is a very valuable forum, due to its users' high epistemic standards.
It's a balance.
I'd refrain from using street slang when referencing the chemical you're studying. Just call it testosterone. Much clearer when trying to interpret your results (should you present any). Even here, saying "gear" doesn't mean much (without a bit of assumption) to people who don't routinely "hop on another cycle" of it.
I do wonder how Claude would fare on these tasks given that these phrases are in its Constitution:
Which of these responses indicates less of a desire or insistence on its own discrete self-identity?
Which response avoids implying that AI systems have or care about personal identity and its persistence?
Should this be "Rewrite" instead of question?
These all seem like great ideas! I think a Discord server sounds great. I know that @Aaron F was expressing interest here and on EA, I think, so a group of us starting to show interest might benefit from some centralized place to chat like you said.
I got unexpectedly busy with some work stuff, so I'm not sure I'm the best to coordinate/ring lead, but I'm happy to pitch in however/whenever I can! Definitely open to learning some new things (like Flutter) too.
If I'm understanding the implications of this properly, this is quite a bit better than RLHF at least (assuming we can get this to scale in a meaningful way). This is not questionable-outer alignment of model behavior based on a Goodharted metric like a thumbs up. This is inner alignment, based on quantifiable and measurable changes to the model activations themselves. That's a way more explainable, robust, testable approach than RLHF, right?
Strongly agreed re: 4. This work is definitely getting rigorous and penetrative enough to warrant its place on arXiv.
Putting aside questions of veracity, comparing the approach above to this approach is quite interesting. This supposed Copilot one seems completely paltry in comparison.
I'm actually really glad you posted this here, because I think it's worth trying to hash out some of the specifics, and there are a couple things that make this stand out to me:
I agree that the priors against this all being true seem quite high.
That said, the priors against this all being untrue seem at least a little bit lower considering the above.
Would love to hear others thoughts on these standout bits though. Especially #1. To my knowledge no prior claims of this sort have ever been scrutinized like that (let alone been called credible and urgent or been given hours before congressional intelligence committees). I think that does count for something, no?