AI safety & alignment researcher
In Rob Bensinger's typology: AGI-alarmed, tentative welfarist, and eventualist (or variabilist for sufficiently long values of 'soon').
I have signed no contracts or agreements whose existence I cannot mention.
Interesting new paper that examines this question:
When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors (see §5)
Not having heard back, I'll go ahead and try to connect what I'm saying to your posts, just to close the mental loop:
Hmm, ok, I see that that's true provided that we assume A necessarily makes B more likely, which certainly seems like the intended reading. Seems like kind of a weird point for them to make in that context (partly because B may often only be trivial evidence of A, as in the raven paradox), so I wonder if it may have been a typo on their part. But as you point out it does work either way. Thanks!
(minor note: I realized I had an error in my comment -- unless my thought process at the time was pretty different than I now imagine it to be -- so I edited it slightly. Doesn't really affect your point)
Again you can look at https://www.lesswrong.com/moderation#rejected-posts to see the actual content and verify numbers/quality for yourself.
Having just done so, I now have additional appreciation for LW admins; I didn't realize the role involved wading through so much of this sort of thing. Thank you!
Really interesting work! A few questions and comments:
See also Project Lawful's highly entertaining riff on postrationalism and other variants of rationalism (here 'Sevarism').
I agree that updates to the capabilities of the main model would require updates to the supervising model (separate post-trains might help limit that -- or actually another possibility would be to create the supervising model as a further fine-tune / post-train of the main model, so that if there were updates to the main model it would only require repeating the hopefully-not-that-heavyweight fine-tune/post-train of the updated main model).
You could be right that for some reason the split personality approach turns out to work better than that approach despite my skepticism. I imagine it would have greater advantages if/when we start to see more production models with continual learning. I certainly wish you luck with it!
Wanted to get an intuitive feel for it, so here's a quick vibecoded simulation & distribution:
https://github.com/eggsyntax/jackpot-age-simulation
(Seems right but not guaranteed bug-free; I haven't even looked at the code)
It seems strictly simpler to use a separate model (which could be a separate post-train of the same base model) than to try to train multiple personalities into the same model.
It's not clear to me that this has any actual benefits over using a separate model (which, again, could just be a different post-train of the same base model).
Suggestion: rephrase to 'one or more of the following'; otherwise it would be easy for relevant readers to think, 'Oh, I've only got one or two of those, I'm fine.'