BrianTan — LessWrong

I'm an Operations Associate at Arcadia Impact, a UK nonprofit that empowers people and organisations to tackle pressing global issues, with a focus on AI safety and governance.

Before joining Arcadia, I:

co-founded WhiteBox Research and led its operations and marketing. WhiteBox aims to develop more AI interpretability and safety researchers in Asia.
was a Group Support Contractor for the Centre for Effective Altruism (CEA) for two years, where I helped support EA groups around the world.
co-founded EA Philippines and worked full-time as a community builder for EA PH in 2021.

You can reach out to me at brian [at] arcadiaimpact [dot] org or find me on LinkedIn.

Thanks for linking these! I also want to highlight that Sam shared his AGI timeline in the Bloomberg interview: "I think AGI will probably get developed during this president’s term, and getting that right seems really important."

My typo reaction may have glitched, but I think you meant "Don't push the frontier of capabilities" in the last bullet?

I've only read the blog post and a bit of the paper so far, but do you plan to investigate how to remove alignment faking in these situations? I wonder if there are simple methods to do so without negatively affecting the model's capabilities and safety.

Thanks for doing this important research! I may have found 2 minor typos:

The abstract says "We find the model complies with harmful queries from free users 14% of the time", but in other places it says 12% - should it be 12%?
In the blog post, "sabotage evaluations" seems to link to a private link

Thanks for this analysis! A minor note: you're probably aware of this, but OpenPhil funds a lot of technical AI safety field-building work as part of their "Global Catastrophic Risks Capacity Building" grants. So the proportion of field-building / talent-development grants would be significantly higher if those were included.

Thanks for making this! This is minor, but I think the total should be $189M and not $169M?

Your last sentence in the first paragraph seems to be cut off at "gets a lot more than"!

I'm following up on Leon's question - have the results already been posted? If not, when will they be posted (if they will be)? I'm curious to know. Thanks!

And this thread from Dr. Eric Feigl-Ding is worrying too.

Thanks for this. This tweet from Dr. Jacob Glanville, founder and CEO of Centivax, makes me worried about this variant too:

The new B.1.1.529 strain out of South Africa has 15 mutations in the RBD where majority of neutralizing antibodies bind. The current vaccines and even Delta-based vaccines probably won’t work against this new strain. Swift, vigorous containment is needed.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments