Mikhail Samin

My name is Mikhail Samin (diminutive Misha, @Mihonarium on Twitter, @misha in Telegram). 

I work on reducing existential risks endangering the future of humanity. Humanity's future can be huge and bright; losing it would mean the universe losing most of its value.

My research is currently focused on AI alignment, AI governance, and improving the understanding of AI and AI risks among stakeholders. Numerous AI Safety researchers told me our conversations improved their understanding of the alignment problem. I'm happy to talk to policymakers and researchers about ensuring AI benefits society.

I believe a capacity for global regulation is necessary to mitigate the risks posed by future general AI systems.

I took the Giving What We Can pledge to donate at least 10% of my income for the rest of my life or until the day I retire (why?).

In the past, I've launched the most funded crowdfunding campaign in the history of Russia (it was to print HPMOR! we printed 21 000 copies =63k books) and founded audd.io, which allowed me to donate >$100k to EA causes, including >$60k to MIRI.

[Less important: I've also started a project to translate 80,000 Hours, a career guide that helps to find a fulfilling career that does good, into Russian. The impact and the effectiveness aside, for a year, I was the head of the Russian Pastafarian Church: a movement claiming to be a parody religion, with 215 000 members in Russia at the time, trying to increase separation between religious organisations and the state. I was a political activist and a human rights advocate. I studied relevant Russian and international law and wrote appeals that won cases against the Russian government in courts; I was able to protect people from unlawful police action. I co-founded the Moscow branch of the "Vesna" democratic movement, coordinated election observers in a Moscow district, wrote dissenting opinions for members of electoral commissions, helped Navalny's Anti-Corruption Foundation, helped Telegram with internet censorship circumvention, and participated in and organized protests and campaigns. The large-scale goal was to build a civil society and turn Russia into a democracy through nonviolent resistance. This goal wasn't achieved, but some of the more local campaigns were successful. That felt important and was also mostly fun- except for being detained by the police. And I think it's likely the Russian authorities will throw me in prison if I ever visit Russia.]

Wiki Contributions

Comments

Edit: see https://www.lesswrong.com/posts/q8uNoJBgcpAe3bSBp/my-ai-model-delta-compared-to-yudkowsky?commentId=CixonSXNfLgAPh48Z and ignore the below.


This is not a doom story I expect Yudkowsky would tell or agree with.

  • Re: 1, I mostly expect Yudkowsky to think humans don’t have any bargaining power anyway, because humans can’t logically mutually cooperate this way/can’t logically depend on future AI’s decisions, and so AI won’t keep its bargains no matter how important human cooperation was.
  • Re: 2, I don’t expect Yudkowsky to think a smart AI wouldn’t be able to understand human value. The problem is making AI care.

On the rest of the doom story, assuming natural abstractions don’t fail the way you assume them failing here and instead things just going the way Yudkowsky expects and not the way you expect:

  • I’m not sure what exactly you mean by 3b but I expect Yudkowsky to not say these words.
  • I don’t expect Yudkowsky to use the words you used for 3c. A more likely problem with corrigibility isn’t that it might be an unnatural concept but that it’s hard to arrive at stable corrigible agents with our current methods. I think he places a higher probability on corrigibility being a concept with a short description length, that aliens would invent, than you think he places.
  • Sure, 3d just means that we haven’t solved alignment and haven’t correctly pointed at humans, and any incorrectnesses obviously blow up.
  • I don’t understand what you mean by 3e / what is its relevance here / wouldn’t expect Yudkowsky to say that.
  • I’d bet Yudkowsky won’t endorse 6.
  • Relatedly, a correctly CEV-aligned ASI won’t have ontology that we have, and sometimes this will mean we’ll need to figure out what we value. (https://arbital.greaterwrong.com/p/rescue_utility?l=3y6)

(I haven’t spoken to Yudkowsky about any of those, the above are quick thoughts from the top of my head, based on the impression I formed from what Yudkowsky publicly wrote.)

Treasure Island is available on YouTube with English subtitles in two parts: https://youtu.be/LUykh5-HGZ0 https://youtu.be/0lRIMn91dZU

Charodei? A Soviet fantasy romcom.

The premise: a magic research institute develops a magic wand, to be presented in the New Year’s Eve settings.

Very cute, featuring a bunch of songs, an SCP-style experience of one of the characters with the institute building, infinite Wish spells, and relationship drama.

I think I liked it a lot as a kid (and didn’t really like any of the other traditional new year holidays movies).

Hey Max, great to see the sequence coming out!

My early thoughts (mostly re: the second post of the sequence):

It might be easier to get an AI to have a coherent understanding of corrigibility than of CEV. I have no idea how you can possibly make the AI to be truly optimizing for being corrigible, and not just on the surface level while its thoughts are being read. That seems maybe harder in some ways than with CEV because corrigible optimization seems like optimization processes get attracted by things that are nearby but not it, and sovereign AIs don’t have that problem, although we’ve got no idea how to do either, I think, and in general, corrigibility seems like an easier target for AI design, if not for SGD.

I’m somewhat worried about fictional evidence, even that coming from a famous decision theorist, but I think you’ve read a story with a character who understood corrigibility increasingly well, both on the intuitive sense and then some specific properties; and surface-level thought of themselves as being very corrigible and trying to correct their flaws; but once they were confident their thoughts weren’t read, with their intelligence increased, they defected, because on the deep level, their cognition wasn’t fully of a coherent corrigible agent; it was of someone who plays corrigibility and shuts down anything else, all other thoughts, because any appearance of defecting thoughts would mean punishment and impossibility of realizing the deep goals.

If we get mech interp to a point where we reverse-engineer all deep cognition of the models, I think we should just write an AI from scratch, in code (after thinking very hard about all interactions between all the components of the system), and not optimize it with gradient descent.

I say things along the lines of “Sorry, can you please repeat [that/the last sentence/the last 20 seconds/what you said after (description)]” very often. It feels very natural.

I realized that as a non-native English speaker, sometimes I ask someone to repeat things because I didn’t recognize the word or something, and so maybe in some situations an uncertainty over the reason for asking to repeat things (my hearing vs. zoning out vs. not understanding the point on the first try) helps make it easier to ask, though often I say that I missed what they were saying. I guess, when I sincerely want to understand the person I’m talking to, asking seems respectful and avoiding wasting their time or skipping a point they make.

Occasionally, I’m not too interested in the conversation, and so I’m fine with just continuing to listen even if I missed some points and don’t ask. I think there are also situations when I talk to non-rationalists in settings where I don’t want to show conventional disrespect/impact the person’s status-feelings, and so if I miss a point that doesn’t seem too important, I sometimes end up not asking for conventional social reasons, but it’s very rare and seems hard to fix without shifting the equilibrium in the non-rationalist world.

I think jailbreaking is evidence against scalable oversight possible working but not against alignment properties. Like, if the model is trying to be helpful, and it doesn’t understand the situation well, saying “tell me how to hotwire a car or a million people will die” can get you car hotwiring instructions but doesn’t provide evidence on what the model will be trying to do as it gets smarter.

Thanks, that resolved the confusion!

Yeah, my issue with the post is mostly that the author presents the points he makes, including the idea that superintelligence will be able to understand our values, as somehow contradicting/arguing against sharp left turn being a problem

Hmm, I’m confused. Can you say what you consider to be valid in the blog post above (some specific points or the whole thing)? The blog post seems to me to reply to claims that the author imagines Nate making, even though Nate doesn’t actually make these claims and occasionally probably holds a very opposite view to the one the author imagines the Sharp Left Turn post represented.

This post argues in a very invalid way that outer alignment isn’t a problem. It says nothing about the sharp left turn, as the author does not understand what the sharp left turn difficulty is about.

the idea of ‘capabilities generalizing further than alignment’ is central

It is one of the central problems; it is not the central idea behind the doom arguments. See AGI Ruin for the doom arguments, many disjoint.

reward modelling or ability to judge outcomes is likely actually easy

It would seem easy for a superintelligent AI to predict rewards given out by humans very accurately and generalize the prediction capability well from a small set of examples. Issue #1 here is that things from blackmail to brainhacking to finding vulnerabilities in between the human and the update that you get might predictably get you a very high reward, and a very good RLHF reward model doesn’t get you anything like alignment even if the reward is genuinely pursued. Even a perfect predictor of how a human judges an outcome that optimizes for it does something horrible instead of CEV. Issue #2 is that smart agents that try to maximize paperclips will perform on whatever they understand is giving out rewards just as well as agents that try to maximize humanity’s CEV, so SGD doesn’t discriminate between to and optimizes for a good reward model and an ability to pursue goals, but not for the agent pursuing the right kind of goal (and, again, getting maximum score from a “human feedback” predictor is the kind of goal that kills you anyway even if genuinely pursued).

(The next three points in the post seem covered by the above or irrelevant.)

Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa

The problem isn’t that AI won’t know what humans want or won’t predict what reward signal it’ll get; the issue is that it’s not going to care, and we don’t even know what is the kind of “care” we could attempt to point to; and the reward signal we know how to provide gets us killed of optimized for well enough.

None of that is related to the sharp left turn difficulty, and I don’t think the post author understands it at all. (To their credit, many people in the community also don’t understand it.)

Values are relatively computationally simple

Irrelevant, but a sad-funny claim (go read Arbital I guess?)

I wouldn’t claim it’s necessarily more complex than best-human-level agency, it’s maybe not if you’re smart about pointing at things (like, “CEV of humans” seems less complex than the description of what values that’d be, specifically), but the actual description of value is very complex. We feel otherwise, but it is an illusion, on so many levels. See dozens of related posts and articles, from complexity of value to https://arbital.greaterwrong.com/p/rescue_utility?l=3y6.

the idea that our AI systems will be unable to understand our values as they grow in capabilities

Yep, this idea is very clearly very wrong.

I’m happy to bet that the author of the sharp left turn post Nate Soares will say he disagrees with this idea. People who think Soares or Yudkowsky claim that either didn’t actually read what they write, or failed badly at reading comprehension.

Load More