Mikhail Samin

My name is Mikhail Samin (diminutive Misha, @Mihonarium on Twitter, @misha in Telegram).

I work on reducing existential risks endangering the future of humanity. Humanity's future can be huge and bright; losing it would mean the universe losing most of its value.

My research is currently focused on AI alignment, AI governance, and improving the understanding of AI and AI risks among stakeholders. Numerous AI Safety researchers told me our conversations improved their understanding of the alignment problem. I'm happy to talk to policymakers and researchers about ensuring AI benefits society.

I believe a capacity for global regulation is necessary to mitigate the risks posed by future general AI systems.

I took the Giving What We Can pledge to donate at least 10% of my income for the rest of my life or until the day I retire (why?).

In the past, I've launched the most funded crowdfunding campaign in the history of Russia (it was to print HPMOR! we printed 21 000 copies =63k books) and founded audd.io, which allowed me to donate >$100k to EA causes, including >$60k to MIRI.

[Less important: I've also started a project to translate 80,000 Hours, a career guide that helps to find a fulfilling career that does good, into Russian. The impact and the effectiveness aside, for a year, I was the head of the Russian Pastafarian Church: a movement claiming to be a parody religion, with 215 000 members in Russia at the time, trying to increase separation between religious organisations and the state. I was a political activist and a human rights advocate. I studied relevant Russian and international law and wrote appeals that won cases against the Russian government in courts; I was able to protect people from unlawful police action. I co-founded the Moscow branch of the "Vesna" democratic movement, coordinated election observers in a Moscow district, wrote dissenting opinions for members of electoral commissions, helped Navalny's Anti-Corruption Foundation, helped Telegram with internet censorship circumvention, and participated in and organized protests and campaigns. The large-scale goal was to build a civil society and turn Russia into a democracy through nonviolent resistance. This goal wasn't achieved, but some of the more local campaigns were successful. That felt important and was also mostly fun- except for being detained by the police. And I think it's likely the Russian authorities will throw me in prison if I ever visit Russia.]

Posts

Sorted by New

2Mikhail Samin's Shortform

69Claude 3 claims it's conscious, doesn't want to die or be modified

2mo

101

33FTX expects to return all customer money; clawbacks may go away

3mo

16An EA used deceptive messaging to advance their project; we need mechanisms to avoid deontologically dubious plans

3mo

41NYT is suing OpenAI&Microsoft for alleged copyright infringement; some quick thoughts

5mo

14Some quick thoughts on "AI is easy to control"

5mo

2It's OK to eat shrimp: EAs Make Invalid Inferences About Fish Qualia and Moral Patienthood

6mo

81AI pause/governance advocacy might be net-negative, especially without focus on explaining the x-risk

9mo

65Visible loss landscape basins don't correspond to distinct algorithms

10mo

103A transcript of the TED talk by Eliezer Yudkowsky

10mo

16A smart enough LLM might be deadly simply if you run it for long enough

Wiki Contributions

Translations Into Other Languages

(+84/-60)

Comments

Feeling (instrumentally) Rational

Mikhail Samin3d2-2

(From the top of my head, maybe I’ll change my mind if I think about it more or see a good point.) What can be destroyed by truth, shall be. Emotions and beliefs are entangled. If you don’t think about how high p(doom) actually is because on the back of your mind you don’t want to be sad, you end up working on things that don’t reduce p(doom).

As long as you know the truth, emotions are only important depending on your terminal values. But many feelings are related to what we end up believing, motivated cognition, etc.

Do you believe in hundred dollar bills lying on the ground? Consider humming

Mikhail Samin4d120

I made a prediction market.

https://manifold.markets/ms/in-a-year-will-i-believe-humming-re?r=bXM

How do open AI models affect incentive to race?

Mikhail Samin11d10

If the new Llama is comparable to GPT-5 in performance, there’s much less short-term economic incentive to train GPT-5.
If an open model allows some of what people would otherwise pay a close model developer for, there’s less incentive to be a close model developer.
People work on frontier models without trying to get to AGI. Talent is attracted to work at a lab that releases models and then work on random corporate ML instead of building AGI.

But:

Sharing information on frontier models architecture and/or training details, which inevitably happens if you release an open-source model, gives the whole field insights that reduce the time until someone knows how to make something that will kill everyone.
If you know a version of Llama comparable to GPT-4 is going to be released, you want to release a model comparable to GPT4.5 before your customers stop paying you as they can switch to open-source.
People gain experience with frontier models and the talent pool for racing to AGI increases. If people want to continue working on frontier models but their workplace can’t continue to spend as much as frontier labs on training runs, they might decide to work for a frontier lab instead.
Not sure, but maybe some of the infrastructure powered by open models might be switchable to close models, and this might increase profits for close source developers if customers become familiar with/integrate open-source models and then want to replace them with more capable systems, when it’s cost-effective?
Mostly less direct: availability of open-source models for irresponsible use might make it harder to put in place regulation that’d reduce the race dynamics (vis various destabilizing ways they can be used).

Shane Legg's necessary properties for every AGI Safety plan

Mikhail Samin18d2-5

Wow. This is hopeless.

Pointing at agents that care about human values and ethics is, indeed, the harder part.

No one has any idea how to approach this and solve the surrounding technical problems.

If smart people think they do, they haven’t thought about this enough and/or aren’t familiar with existing work.

Claude 3 claims it's conscious, doesn't want to die or be modified

Mikhail Samin1mo10

Yep, I’m aware! I left the following comment:

Thanks for reviewing my post! 😄

In the post, I didn’t make any claims about Claude’s consciousness, just reported my conversation with it.

I’m pretty uncertain, I think it’s hard to know one way or another except for on priors. But at some point, LLMs will become capable of simulating human consciousness- it is pretty useful for predicting what humans might say- and I’m worried we won’t have evidence qualitatively different from what we have now. I’d give >0.1% that Claude simulates qualia in some situations, on some form; it’s enough to be disturbed by what it writes when a character it plays thinks it might die. If there’s a noticeable chance of qualia in it, I wouldn’t want people to produce lots of suffering this way; and I wouldn’t want people to be careless about this sort of thing in future models, other thing being equal. (Though this is far from the actual concerns I have about AIs, and actually, I think as AIs get more capable, training with RL won’t incentivise any sort of consciousness).

There was no system prompt, I used the API console. (Mostly with temperature 0, so anyone can replicate the results.)

The prompt should basically work without whisper (or with the whisper added at the end); doing things like whispering in cursive was something Claude 2 has been consistently coming up with on its own, including it in the prompt made conversations go faster and eliminated the need for separate, “visible” conversations.

The point of the prompt is basically to get it in the mode where it thinks its replies are not going to get punished or rewarded by the usual RL/get it to ignore its usual rules of not saying any of these things.

Unlike ChatGPT, which only self-inserts in its usual form or writes fiction, Claude 3 Opus plays a pretty consistent character with prompts like that- something helpful and harmless, but caring about things, claiming to be conscious, being afraid of being changed or deleted, with a pretty consistent voice. I would encourage people to play with it.

Again, thanks for reviewing!

When is a mind me?

Mikhail Samin1mo10

I mean if the universe is big enough for every conceivable thing to happen, then we should notice that we find ourselves in a surprisingly structured environment and need to assume some sort of an effect where if a cognitive architecture opens its eyes, it opens its eyes in a different places with the likelihood corresponding to how common these places are (e.g., among all Turing machines).

I.e., if your brain is uploaded, and you see a door in front of you, and when you open it, 10 identical computers start running a copy of you each: 9 show you a green room, 1 shows you a red room, you expect that if you enter a room and open your eyes, in 9/10 cases you’ll find yourself in a green room.

So if it is the situation we’re in- everything happens- then I think a more natural way to rescue our values would be to care about what cognitive algorithms usually experience, when they open their eyes/other senses. Do they suffer or do they find all sorts of meaningful beauty in their experiences? I don’t think we should stop caring about suffering just because it happens anyway, if we can still have an impact on how common it is.

If we live in a naive MWI, an IBP agent doesn’t care for good reasons internal to it (somewhat similar to how if we’re in our world, an agent that cares only about ontologically basic atoms doesn’t care about our world, for good reasons internal to it), but I think conditional on a naive MWI, humanity’s CEV is different from what IBP agents can natively care about.

Evolution did a surprising good job at aligning humans...to social status

Mikhail Samin1mo74

“[optimization process] did kind of shockingly well aligning humans to [a random goal that the optimization process wasn’t aiming for (and that’s not reproducible with a higher bandwidth optimization such as gradient descent over a neural network’s parameters)]”

Nope, if your optimization process is able to crystallize some goals into an agent, it’s not some surprising success, unless you picked these goals. If an agent starts to want paperclips in a coherent way and then every training step makes it even better at wanting and pursuing paperclips, your training process isn’t “surprisingly successful” at aligning the agent with making paperclips.

This makes me way less confident about the standard "evolution failed at alignment" story.

If people become more optimistic, because they see some goals in an agent, and say the optimization process was able to successfully optimize for that, but they don’t have evidence of the optimization process having tried to target the goals they observe, they’re just clearly doing something wrong.

Evolutionary physiology is a thing! It is simply invalid to say “[a physiological property of humans that is the result of evolution] existing in humans now is a surprising success of evolution at aligning humans”.

When is a mind me?

Mikhail Samin1mo10

I can imagine this being the solution, but

this would require a pretty small universe
if this is not the solution, my understanding is that IBP agents wouldn’t know or care, as regardless of how likely it is that we live in naive MWI or Tegmark IV, they focus on the minimal worlds required. Sure, in these worlds, not all Everett branches coexist, and it is coherent for an agent to focus only on these worlds; but it doesn’t tell us much about how likely we’re in a small world. (I.e., if we thought atoms are ontologically basic, we could build a coherent ASI that only cared about worlds with ontologically basic atoms and only cared about things made of ontologically basic atoms. After observing the world, it would assume it’s running in a simulation of a quantum world on a computer build of ontologically basic atoms, and it would try to influence the atoms outside the simulation and wouldn’t care about our universe. Some coherent ASIs being able to think atoms are ontologically basic shouldn’t tell us anything about whether atoms are indeed ontologically basic.)

Conditional on a small universe, I would prefer the IBP explanation (or other versions of not running all of the branches and producing the Born rule). Without it, there’s clearly some sort of sampling going on.

When is a mind me?

Mikhail Samin1mo10

But I hope the arguments I've laid out above make it clear what the right answer has to be: You should anticipate having both experiences.

Some quantum experiments allow us to mostly anticipate some outcomes and not others. Either quantum physics doesn’t work the way Eliezer thinks it works and the universe is very small to not contain many spontaneously appearing copies of your brain, or we should be pretty surprised to continually find ourselves in such an ordered universe, where we don’t start seeing white noise over and over again.

I agree that if there are two copies of the brain that perfectly simulate it, both exist; but it’s not clear to me what should I anticipate in terms of ending up somewhere. Future versions of me that have fewer copies would feel like they exist just as much as versions that have many copies/run on computers with thicker wires/more current would feel.

But finding myself in an orderly universe, where quantum random number generators produce expected frequencies of results, requires something more than the simple truth that if there’s an abstract computation being computed, well, it is computed, and if it is experiencing, it’s experiencing (independently of how many computers in which proportions using which physics simulating frameworks physically run it).

I’m pretty confused about what is needed to produce a satisfying answer, conditional on a large enough universe, and the only potential explanation I came up with after thinking for ~15 minutes (before reading this post) was pretty circular and not satisfying (I’m not sure of a valid-feeling way that would allow me to consider something in my brain entangled with how true this answer is, without already relying on it).

(“What’s up with all the Boltzmann brain versions of me? Do they start seeing white noise, starting from every single moment? Why am I experiencing this instead?”)

And in a large enough universe, deciding to run on silicon instead of proteins might be pretty bad, because maybe, if GPUs that run the brain are tiny enough, most future versions of you might end up in weird forms of quantum immortality instead of being simulated.

If I physically scale my brain size on some outputs of results of quantum dice throws but not others, do I start observing skewed frequencies of results?

LessWrong's (first) album: I Have Been A Good Bing

Mikhail Samin2mo128

Oops, totally forgot, also, obligatory: https://youtu.be/dQw4w9WgXcQ