I am a summer fellow at Centre for the Governance of AI, where I research compute governance. Previously, I was a participant in ARENA 5.0, hardware security research assistant through the SPAR program, and security engineer at a hedge fund. I recently graduated from Columbia University in December 2024, where I studied computer science.
My academic interests lie in AI safety & governance, hardware & software security, (meta-)ethics, physics, and cognitive science. My primary goal is improve the welfare of sentient beings, and I think one of the best ways to secure a flourishing future is by ensuring that the transition to a world with transformative AI goes well. I’m also interested in issues like wild animal welfare and factory farming.
I have signed no contracts or agreements whose existence I cannot mention.
Semi-cooperation is one way for both sides to learn from each other—but so is poor infosec or even outright espionage. If both countries are leaking or spying enough, that might create a kind of uneasy balance (and transparency), even without formal agreements. It’s not exactly stable, but it could prevent either side from gaining a decisive lead.
In fact, sufficiently bad infosec might even make certain forms of cooperation and mutual verification easier. For instance, if both countries are considering setting up trusted data centers to make verifiable claims about AGI development, the fact that espionage already permeates much of the AI supply chain could paradoxically lower the bar for trust. In a world where perfect secrecy is already compromised, agreeing to “good enough” transparency might become more feasible.
Thanks for the comment. Strong upvoted!
I agree that the quotations described as "backwards" are not necessarily wrong given the two possible (and reasonable) interpretations of the RLHF procedure. Thanks for flagging this subtlety; I had not thought of it before. I will update the body of the post to reflect this subtlety.
Meta point: I'm so grateful for the LessWrong community. This is my first post and first comment, and I find it so wild that I'm part of a community where people like you write such insightful comments. It's very inspiring :)
Great post! It's been almost a year since this was posted so I was curious if anyone has worked on these questions:
I did a quick lit review and didn't find much. Here's what I did find (not perfectly related to the above questions, though).
So, has anyone pursued the two quoted questions above? Super curious if anyone has good results!