"In my experience, the vast majority of people in AI safety are in favor of big-tent coalition protests on AI in theory"
is this true? I think many people (myself included) are worried about conflationary alliances backfiring (as we see to some extent in the current admin)
I think it’s actually another control process—specifically, the process of controlling our identities. We have certain conceptions of ourselves (“I’m a good person” or “I’m successful” or “people love me”.) We then are constantly adjusting our lives and actions in order to maintain those identities—e.g. by selecting the goals and plans which are most consistent with them, and looking away from evidence that might falsify our identities. So perhaps our outermost loop is a control process after all.
rhymes with Steve on approval reward (and MONA)
in general I find I'm sowewhat skeptical about the particulars of activate inference coalitional agency frame, but very excited about locality, path-dependence, predictive processing, and myopia
yeah I am very grateful for MIRI, and I don't think we should be complacent about existential risks (e.g. 50% P(doom) seems totally reasonable to me)
the hope is that
a) we to get the transformative ai to do our alignment homework for us, and
b) that companies / society will become more concerned about safety (such that the ratio of safety to capabilities research increases a lot)
yeah I agree, I think the update is basically just "AI control and automated alignment research seem very viable and important", not "Alignment will be solved by default"
I think the "open character training" paper is probably a good place to look
also curious how you feel about using standard harmfulness questions for auditing MOs.
My vague sense is that models tend to have more reflexive refusal mechanisms that asking Chinese models about CCP censored stuff, but does seem like models strategically lie sometimes. (ofc you'd also need to setup a weak-to-strong situational where the weak model genuinely doesn't know how to make a bomb, chem weapon, etc., but seems doable)
I'm pretty skeptical about the value of adding a canary string for conceptual / results posts, and I rarely see this on similar posts