Wiki Contributions

Comments

Arturs10

These frontier models could still be vulnerable to stealth (e.g. “sleeper agent”) attacks, specialist models, and stealth attacks by specialist models. The balance depends on the ability gap – if the top model is way ahead of others, then maybe defence dominates attack efforts. But a big ability gap does not seem to be playing out, instead there are several frontier models near-frontier, and lots of (more or less) open source stuff not far behind.

Arturs10

Seems like a stark case of contrast between Bayesianism and the way a frequentist might approach things. I.e. do not reject the null hypothesis of no significant probability until convinced by evidence, either formal arguments or by seeing real-life mishaps. Labeling something as having P(x)~0 probably helps to compartmentalize things, focus to other tasks at hand. But can lead to huge risks being neglected, like in this case of AI Alignment.

Edit: "premortem" seems like a useful exercise to align mind & gut