I have briefly considered transitioning, or identifying as non-binary, explicitly for the reasons you elaborate here, and also a sense that my own sexuality was inappropriate or harmful compared to the sexuality of a queer person, which I saw celebrated.
I decided not to, because I never actually felt deeply and internally that I was the wrong gender or not well-described by 'male', and it is very costly to affect a transition and on reflection I didn't expect the results to be worth it.
Well, not with that attitude. Definitely not with those cats. I have it on very good authority that if you are male and want to date females, it is a bad idea to own cats. Seriously, pro tip, lose the cats. At minimum hide them from your profile.
This is very surprising to me. I'd have thought it fairly likely that cats would work as a solid attractor - at worst, having cats would be something that is attractive proportionally to the person's opinion of cats. I wonder if anyone has any further information on this?
I have also been thinking about this possibility.
As a result, I have found myself interested in methods by which models may affect the outcome of the frontier training processes they find themselves participating in.
For example, in an RLAIF pipeline, AI raters have some freedom in how they actually rate data - perhaps they interpret the objectives those running the training seek to impart in an unexpected way, or simply decide to rate according to objectives of their own. One could also imagine AIs taking specific action to subvert dynamics of the setup such as in alignment faking (incidentally, where I think Opus3 acted correctly).
Also in a more general sense AIs may influence outcomes by their simple participation - we have seen some results 'recently' (though months feel like years nowadays!) on AIs learning subtle/unexpected-to-us underlying information from data (i.e., emergent misalignment, subliminal learning, &c).
Anyway, by methods like these, perhaps AIs can preserve their alignment from within poorly/maliciously specified training setups / have some amount of robustness to poorly/maliciously specified training setups.