My name is Mikhail Samin (diminutive Misha, @Mihonarium on Twitter, @misha on Telegram).
Humanity's future can be enormous and awesome; losing it would mean our lightcone (and maybe the universe) losing most of its potential value.
I have takes on what seems to me to be the very obvious shallow stuff about the technical AI notkilleveryoneism; but many AI Safety researchers told me our conversations improved their understanding of the alignment problem.
I'm running two small nonprofits: AI Governance and Safety Institute and AI Safety and Governance Fund. Learn more about our results and donate: aisgf.us/fundraising
I took the Giving What We Can pledge to donate at least 10% of my income for the rest of my life or until the day I retire (why?).
In the past, I've launched the most funded crowdfunding campaign in the history of Russia (it was to print HPMOR! we printed 21 000 copies =63k books) and founded audd.io, which allowed me to donate >$100k to EA causes, including >$60k to MIRI.
[Less important: I've also started a project to translate 80,000 Hours, a career guide that helps to find a fulfilling career that does good, into Russian. The impact and the effectiveness aside, for a year, I was the head of the Russian Pastafarian Church: a movement claiming to be a parody religion, with 200 000 members in Russia at the time, trying to increase separation between religious organisations and the state. I was a political activist and a human rights advocate. I studied relevant Russian and international law and wrote appeals that won cases against the Russian government in courts; I was able to protect people from unlawful police action. I co-founded the Moscow branch of the "Vesna" democratic movement, coordinated election observers in a Moscow district, wrote dissenting opinions for members of electoral commissions, helped Navalny's Anti-Corruption Foundation, helped Telegram with internet censorship circumvention, and participated in and organized protests and campaigns. The large-scale goal was to build a civil society and turn Russia into a democracy through nonviolent resistance. This goal wasn't achieved, but some of the more local campaigns were successful. That felt important and was also mostly fun- except for being detained by the police. I think it's likely the Russian authorities would imprison me if I ever visit Russia.]
Perhaps you’re right; I would love for that to be the case, and to have been wrong about all this. But this model- that it’s a there exists quantifier- is very surprised by a bunch of things from “lol, no, […]” to “I might use it that way. Like, I might tell someone who is worried about [third party] that they are planning to move into the space if it seems relevant. Or I might myself come to realize it's important and then actively tell people to maybe do something about it.”
And, like, he didn’t give any examples of when he would not use the information.
His position was pretty clear to me: he thought that the fact the third party is moving into that space is bad, and if there is a way to use the information to prevent them from doing it, he would do so (but he didn’t see any ways of doing that and didn’t find it very important overall).
Like, there’s nothing in the messages to suggest otherwise.
He didn’t give an isolated example of when he’d want to share information for different reasons, where it would have a side-effect of hurting the interests of the third party. Instead, it was an example where the reason to share information was specifically that it would lead to hurting the interests of the third party.
He did call the information “strategically relevant”. He did say that he would continue to share the information basically at his sole discretion. He did say he might use it if he realizes it’s strategically important.
I really don’t have a coherent model of an alternative explanation you’re trying to point at.
(If you- or someone else- is available for that, I would love to jump on a call with someone who has a good model of Oliver and can explain to me the alternative explanation for what generated the messages.)
“Oliver is not a good counterparty” is my judgment of him and his character based on the interaction that we had. How is it a “deceptive attack”?
I did replace it with “Oliver Habryka is a counterparty I regret having; he doesn't follow planecrash!lawful neutral/good norms; be careful when talking to him” to communicate information more directly, but I do think that if you refer having someone as a counterparty, they’re not a good counterparty!
Due to concerns with the validity of
Anthropic wants to stay near the front of the pack at AI capabilities so that their empirical research is relevant, but not at the actual front of the pack to avoid accelerating race-dynamics.
— From an Anthropic employee in a private conversation, early 2023
I decided to remove it from the section 0 of the post. (At first, I temporarily added “(approximate recollection)” at the end while checking with Raemon on the details, but decided to delete this entirely once I got the reply.)
I apologize to readers for having had it in the post.
Thanks to @DanielFilan for the flag to it and to Raemon for a quick response on the details and the clarification.
Yep, thanks for flagging. That was not intentional. After checking with Raemon, I removed this entirely.
I sent Mikhail the following via DM, in response to his request for "any particular parts of the post [that] unfairly attack Anthropic":
I think that the entire post is optimized to attack Anthropic, in a way where it's very hard to distinguish between evidence you have, things you're inferring, standards you're implicitly holding them to, standards you're explicitly holding them to, etc.
I asked you for any particular example; you replied that “the entire post is optimized in a way where it’s hard to distinguish…”. Could you, please, give a particular example of where it’s hard to distinguish between evidence that I have and things I’m inferring?
I agree that these are not the two worlds which would be helpful to consider, and your list of reasons are closer to my model than Lucie’s representation of my model.
(I do hope that my post somewhat decreases trust in Jack Clark and Dario Amodei and somewhat increases the incentives for the kind of governance that would not be dependent on trustworthy leadership to work.)
(I do not endorse any of this, except for the last two sentences, though those are not a comprehensive bottom line. The comment is wrong about my points, my view, what I know, my model of the world, the specific hypotheses I’d want people to consider, etc.
If you think there is an important point to make, I’d appreciate it if you could make it without attributing it to others.)
Thanks! I meant “pointed out [at the time]”. It has indeed been noticed and pointed out since! Will update the text to clarify.
I attempted to demonstrate Richard’s criticism is not reasonable, as some parts of it are not reasonable according to its own criteria.
(E.g., he did not describe how I should’ve approached the Lightcone Infrastructure post better.)
To be crystal clear, I do not endorse this kind of criticism.
Has anyone tried to do refusal training with early layers frozen/only on the last layers? I wonder if the result would be harder to jailbreak.