I made a manifold post for this for those who wish to bet on it: https://manifold.markets/JasonBrown/will-a-gpt4-level-efficient-hrm-bas?r=SmFzb25Ccm93bg
Thank you for providing a good introduction and arguments in favour of this research direction. Whilst I strongly agree with the idea of safety pre-training being valuable (and have even considered working on it myself with some collaborators), I think there are several core claims here that are false and that ultimately one should not consider alignment to be solved.
TL;DR I think safety pre-training is probably a huge boost to alignment, but our work is far from done and there are still lots of issues / uncertainties.
Thank you!
Your post was also very good and I agree with its points. I'll probably edit my post in the near future to reference it along with some of the other good references your post had that I wasn't aware of.
Yes, it's still unclear how to measure modification magnitude in general (or if that's even possible to do in a principled way) but for modifications which are limited to text, you could use the entropy of the text and to me that seems like a fairly reasonable and somewhat fundamental measure (according to information theory). Thank you for the references in your other comment, I'll make sure to give them a read!
Thank you, this looks very interesting
I've ended up making another post somewhat to this effect, trying to predict any significant architectural shifts over the next year and a half: https://manifold.markets/Jasonb/significant-advancement-in-frontier