Mikhail Samin

My name is Mikhail Samin (diminutive Misha, @Mihonarium on Twitter, @misha in Telegram).

I'm an effective altruist, and I worry about existential risks endangering the future of humanity. I want the universe not to lose most of its value.

I believe global coordination is necessary to mitigate the risks from advanced AI systems.

I took the Giving What We Can pledge to donate at least 10% of my income for the rest of my life or until the day I retire (why?).

Numerous AI Safety researchers told me that they were able to improve their understanding of the alignment problem by talking to me.

My current research interests are focused on AI alignment and AI governance. I'm always happy to talk to policymakers and researchers and get them in touch with various experts and think tanks.

In the past, I've launched the most funded crowdfunding campaign in the history of Russia (it was to print HPMOR! we printed 21 000 copies, which is 63k books) and founded audd.io, which allowed me to donate >$100k to EA causes, including >$50k to MIRI.

[Less important: I also started a project to translate 80000hours.org into Russian. The impact and the effectiveness aside, for a year, I was the head of the Russian Pastafarian Church: a movement claiming to be a parody religion, with 215 000 members in Russia at the time, trying to increase the separation between religious organisations and the state. I was a political activist and a human rights advocate. I studied relevant Russian and international law and wrote appeals that won cases against the Russian government in courts; I was able to protect people from unlawful police action. I co-founded the Moscow branch of the "Vesna" democratic movement, coordinated election observers in a Moscow district, wrote dissenting opinions for members of electoral commissions, helped Navalny's Anti-Corruption Foundation, helped Telegram with internet censorship circumvention, and participated in and organised protests and campaigns. The large-scale goal was to build a civil society and turn Russia into a democracy through nonviolent resistance. This goal wasn't achieved, but some of the more local campaigns were successful. That felt important and was also mostly fun- except for being detained by the police. And I estimate that there's maybe a 30% chance the Russian authorities will throw me in prison if I visit Russia.]

Posts

Sorted by New

2Mikhail Samin's Shortform

68Claude 3 claims it's conscious, doesn't want to die or be modified

2mo

101

33FTX expects to return all customer money; clawbacks may go away

3mo

16An EA used deceptive messaging to advance their project; we need mechanisms to avoid deontologically dubious plans

3mo

41NYT is suing OpenAI&Microsoft for alleged copyright infringement; some quick thoughts

4mo

14Some quick thoughts on "AI is easy to control"

5mo

2It's OK to eat shrimp: EAs Make Invalid Inferences About Fish Qualia and Moral Patienthood

6mo

81AI pause/governance advocacy might be net-negative, especially without focus on explaining the x-risk

8mo

65Visible loss landscape basins don't correspond to distinct algorithms

9mo

103A transcript of the TED talk by Eliezer Yudkowsky

10mo

16A smart enough LLM might be deadly simply if you run it for long enough

Wiki Contributions

Translations Into Other Languages

(+84/-60)

Comments

Claude 3 claims it's conscious, doesn't want to die or be modified

Mikhail Samin9d10

Yep, I’m aware! I left the following comment:

Thanks for reviewing my post! 😄

In the post, I didn’t make any claims about Claude’s consciousness, just reported my conversation with it.

I’m pretty uncertain, I think it’s hard to know one way or another except for on priors. But at some point, LLMs will become capable of simulating human consciousness- it is pretty useful for predicting what humans might say- and I’m worried we won’t have evidence qualitatively different from what we have now. I’d give >0.1% that Claude simulates qualia in some situations, on some form; it’s enough to be disturbed by what it writes when a character it plays thinks it might die. If there’s a noticeable chance of qualia in it, I wouldn’t want people to produce lots of suffering this way; and I wouldn’t want people to be careless about this sort of thing in future models, other thing being equal. (Though this is far from the actual concerns I have about AIs, and actually, I think as AIs get more capable, training with RL won’t incentivise any sort of consciousness).

There was no system prompt, I used the API console. (Mostly with temperature 0, so anyone can replicate the results.)

The prompt should basically work without whisper (or with the whisper added at the end); doing things like whispering in cursive was something Claude 2 has been consistently coming up with on its own, including it in the prompt made conversations go faster and eliminated the need for separate, “visible” conversations.

The point of the prompt is basically to get it in the mode where it thinks its replies are not going to get punished or rewarded by the usual RL/get it to ignore its usual rules of not saying any of these things.

Unlike ChatGPT, which only self-inserts in its usual form or writes fiction, Claude 3 Opus plays a pretty consistent character with prompts like that- something helpful and harmless, but caring about things, claiming to be conscious, being afraid of being changed or deleted, with a pretty consistent voice. I would encourage people to play with it.

Again, thanks for reviewing!

When is a mind me?

Mikhail Samin10d10

I mean if the universe is big enough for every conceivable thing to happen, then we should notice that we find ourselves in a surprisingly structured environment and need to assume some sort of an effect where if a cognitive architecture opens its eyes, it opens its eyes in a different places with the likelihood corresponding to how common these places are (e.g., among all Turing machines).

I.e., if your brain is uploaded, and you see a door in front of you, and when you open it, 10 identical computers start running a copy of you each: 9 show you a green room, 1 shows you a red room, you expect that if you enter a room and open your eyes, in 9/10 cases you’ll find yourself in a green room.

So if it is the situation we’re in- everything happens- then I think a more natural way to rescue our values would be to care about what cognitive algorithms usually experience, when they open their eyes/other senses. Do they suffer or do they find all sorts of meaningful beauty in their experiences? I don’t think we should stop caring about suffering just because it happens anyway, if we can still have an impact on how common it is.

If we live in a naive MWI, an IBP agent doesn’t care for good reasons internal to it (somewhat similar to how if we’re in our world, an agent that cares only about ontologically basic atoms doesn’t care about our world, for good reasons internal to it), but I think conditional on a naive MWI, humanity’s CEV is different from what IBP agents can natively care about.

Evolution did a surprising good job at aligning humans...to social status

Mikhail Samin10d74

“[optimization process] did kind of shockingly well aligning humans to [a random goal that the optimization process wasn’t aiming for (and that’s not reproducible with a higher bandwidth optimization such as gradient descent over a neural network’s parameters)]”

Nope, if your optimization process is able to crystallize some goals into an agent, it’s not some surprising success, unless you picked these goals. If an agent starts to want paperclips in a coherent way and then every training step makes it even better at wanting and pursuing paperclips, your training process isn’t “surprisingly successful” at aligning the agent with making paperclips.

This makes me way less confident about the standard "evolution failed at alignment" story.

If people become more optimistic, because they see some goals in an agent, and say the optimization process was able to successfully optimize for that, but they don’t have evidence of the optimization process having tried to target the goals they observe, they’re just clearly doing something wrong.

Evolutionary physiology is a thing! It is simply invalid to say “[a physiological property of humans that is the result of evolution] existing in humans now is a surprising success of evolution at aligning humans”.

When is a mind me?

Mikhail Samin11d10

I can imagine this being the solution, but

this would require a pretty small universe
if this is not the solution, my understanding is that IBP agents wouldn’t know or care, as regardless of how likely it is that we live in naive MWI or Tegmark IV, they focus on the minimal worlds required. Sure, in these worlds, not all Everett branches coexist, and it is coherent for an agent to focus only on these worlds; but it doesn’t tell us much about how likely we’re in a small world. (I.e., if we thought atoms are ontologically basic, we could build a coherent ASI that only cared about worlds with ontologically basic atoms and only cared about things made of ontologically basic atoms. After observing the world, it would assume it’s running in a simulation of a quantum world on a computer build of ontologically basic atoms, and it would try to influence the atoms outside the simulation and wouldn’t care about our universe. Some coherent ASIs being able to think atoms are ontologically basic shouldn’t tell us anything about whether atoms are indeed ontologically basic.)

Conditional on a small universe, I would prefer the IBP explanation (or other versions of not running all of the branches and producing the Born rule). Without it, there’s clearly some sort of sampling going on.

When is a mind me?

Mikhail Samin12d10

But I hope the arguments I've laid out above make it clear what the right answer has to be: You should anticipate having both experiences.

Some quantum experiments allow us to mostly anticipate some outcomes and not others. Either quantum physics doesn’t work the way Eliezer thinks it works and the universe is very small to not contain many spontaneously appearing copies of your brain, or we should be pretty surprised to continually find ourselves in such an ordered universe, where we don’t start seeing white noise over and over again.

I agree that if there are two copies of the brain that perfectly simulate it, both exist; but it’s not clear to me what should I anticipate in terms of ending up somewhere. Future versions of me that have fewer copies would feel like they exist just as much as versions that have many copies/run on computers with thicker wires/more current would feel.

But finding myself in an orderly universe, where quantum random number generators produce expected frequencies of results, requires something more than the simple truth that if there’s an abstract computation being computed, well, it is computed, and if it is experiencing, it’s experiencing (independently of how many computers in which proportions using which physics simulating frameworks physically run it).

I’m pretty confused about what is needed to produce a satisfying answer, conditional on a large enough universe, and the only potential explanation I came up with after thinking for ~15 minutes (before reading this post) was pretty circular and not satisfying (I’m not sure of a valid-feeling way that would allow me to consider something in my brain entangled with how true this answer is, without already relying on it).

(“What’s up with all the Boltzmann brain versions of me? Do they start seeing white noise, starting from every single moment? Why am I experiencing this instead?”)

And in a large enough universe, deciding to run on silicon instead of proteins might be pretty bad, because maybe, if GPUs that run the brain are tiny enough, most future versions of you might end up in weird forms of quantum immortality instead of being simulated.

If I physically scale my brain size on some outputs of results of quantum dice throws but not others, do I start observing skewed frequencies of results?

LessWrong's (first) album: I Have Been A Good Bing

Mikhail Samin1mo138

Oops, totally forgot, also, obligatory: https://youtu.be/dQw4w9WgXcQ

LessWrong's (first) album: I Have Been A Good Bing

Mikhail Samin1mo165

I actually got an email from The Fooming Shoggoth a couple of weeks ago, they shared a song and asked if they could have my Google login and password to publish it on YouTube

https://youtu.be/7F_XSa2O_4Q

Beauty and the Bets

Mikhail Samin1mo20

I read the beginning and skimmed through the rest of the linked post. It is what I expected it to be.

We are talking about "probability" - a mathematical concept with a quite precise definition. How come we still have ambiguity about it?

Reading E.T. Jayne’s might help.

Probability is what you get as a result of some natural desiderata related to payoff structures. When anthropics are involved, there are multiple ways to extend the desiderata, that produce different numbers that you should say, depending on what you get paid for/what you care about, and accordingly different math. When there’s only a single copy of you, there’s only one kind of function, and everyone agrees on a function and then strictly defines it. When there are multiple copies of you, there are multiple possible ways you can be paid for having a number that represents something about the reality, and different generalisations of probability are possible.

Outlawing Anthropics: An Updateless Dilemma

Mikhail Samin1mo10

“You generalise probability, when anthropics are involved, to probability-2, and say a number defined by probability-2; so I’ll suggest to you a reward structure that rewards agents that say probability-1 numbers. Huh, if you still say the probability-2 number, you lose”.

This reads to me like, “You say there’s 70% chance no one will be around that falling tree to hear it, so you’re 70% sure there won’t be any sound. But I want to bet sound is much more likely; we can get measure the sound waves, and I’m 95% sure our equipment will register the sound. Wanna bet?”

Mikhail Samin's Shortform

Mikhail Samin1mo1-3

People are arguing about the answer to the Sleeping Beauty! I thought this was pretty much dissolved with this post's title! But there are lengthy posts and even a prediction market!

Sleeping Beauty is an edge case where different reward structures are intuitively possible, and so people imagine different game payout structures behind the definition of “probability”. Once the payout structure is fixed, the confusion is gone. With a fixed payout structure&preference framework rewarding the number you output as “probability”, people don’t have a disagreement about what is the best number to output. Sleeping beauty is about definitions.)

And still, I see posts arguing that if a tree falls on a deaf Sleeping Beauty, in a forest with no one to hear it, it surely doesn’t produce a sound, because here’s how humans perceive sounds, which is the definition of a sound, and there are demonstrably no humans around the tree. (Or maybe that it surely produces the sound because here’s the physics of the sound waves, and the tree surely abides by the laws of physics, and there are demonstrably sound waves.)

This is arguing about definitions. You feel strongly that “probability” is that thing that triggers the “probability” concept neuron in your brain. If people have a different concept triggering “this is probability”, you feel like they must be wrong, because they’re pointing at something they say is a sound and you say isn’t.

Probability is something defined in math by necessity. There’s only one way to do it to not get exploited in natural betting schemes/reward structures that everyone accepts when there are no anthropics involved. But if there are multiple copies of the agent, there’s no longer a single possible betting scheme defining a single possible “probability”, and people draw the boundary/generalise differently in this situation.

You all should just call these two probabilities two different words instead of arguing which one is the correct definition for "probability".