Mikhail Samin

My name is Mikhail Samin / Михаил Самин (diminutive Misha / Миша, sometimes @Mihonarium on social media).
I'm an effective altruist, I worry about the future of humanity and want the universe not to lose most of its value.

I took the Giving What We Can pledge to donate at least 10% of my income for the rest of my life or until the day I retire (why?).
It seems that I have good intuitions about the AI alignment problem; some full-time alignment researchers told me that they were able to improve their understanding of the problem after talking to me.

I'm currently doing EA & AI Alignment outreach (e.g., I'm organising a translation of the 80,000 Hours' Key Ideas series and partnering with Vert Dider for a translation and dubbing of Rob Miles' videos) and considering switching to direct alignment research.

In the past, I've launched the most funded crowdfunding campaign in the history of Russia (it was to print HPMOR! we printed 21 000 copies, which is 63k books) and founded audd.io, which allowed me to donate >$50k to MIRI.

Wiki Contributions


Thanks for the comment! (And sorry for the delayed reply- I was at a CFAR workshop when I posted this, and after the workshop took some days off.)

The text's target audience was two people who I'd expect to understand my intuitions, so I did not attempt to justify some claims fully. Added that to the post. Also, I apologize for the post title, it's a claim that I haven't justified by the text, and that represents my views to a lesser extent. I stand by the other three mentioned claims. I'm not sure where exactly the crux lies, though. I'd be interested in having a call for higher-bandwidth communication. Feel free to schedule (30 min | 60 min).

I'll try to dump chunks of my model that seem maybe relevant and might clarify where the cruxes lie a bit. 

Do you expect that if the training setup described in the post produces a superhuman-level AGI, it does not kill everyone?

My picture is, roughly:

  • We're interested in where we end up when we get to a superhuman AI capable enough to prevent other AGIs from appearing until alignment is solved.
  • There's a large class of cognitive architectures that kill everyone.
    • A lot of cognitive architectures, when implemented (in a neural network or a neural network + something external) and put into a training setup from a large class of training setups, would score well[1] on any of a broad class of loss functions, quickly become highly context-aware and go do their thing in the real world (e.g., kill everyone).
    • There's a smaller class of cognitive architectures that don't kill everyone and would allow humanity to solve alignment and launch an aligned AGI.
  • We need to think about what cognitive architecture we aim for and need concrete stories for why we succeed at getting to them. How a thing having bits of the final structure behaves until it gets to a superhuman level is only important insofar as it helps us get to a tiny target.
  • (Not as certain, but part of my current picture.) If we have a neural network implementing a cognitive architecture that a superhuman AI might have, a gradient descent on loss functions of a broad range of training setups won't change that cognitive architecture or its preferences much.
  • We need a concrete story for why our training setup ends up at a cognitive architecture that has a highly specific set of goals such that it doesn't kill everyone.
  • No behavior we see until a superhuman cognitive structure plays a major role in producing it gives much evidence into what goals that cognitive structure might have.
    • (Not certain, depends on how general grokking is.) Gradient might be immediately pointing at some highly capable cognitive structures!
  • You need to have really good reasons to expect the cognitive structure you'll end up with to be something that doesn't end humanity. Before there's a superhuman-level cognitive structure, circuits noticeable in the neural network don't tell you what goals that cognitive architecture will end up pursuing upon reflection. In my view, this is closely related to argument 22 and Sharp Left Turn. If you don't have strong reasons to believe you successfully dealt with those, you die.

I'm interested in hearing a concrete story here

A significant part of my point was that concrete stories are needed when you expect the process to succeed, not when you expect it to fail. There are things in the setup clearly leading to failure. I ignored some of them (e.g., RLHF) and pointed towards others: there are things that don't incentivize something like exactly some concrete goals such that maximizing for them leaves survivors and do incentivize being more agentic and context-aware.

I specifically meant that generally, when there are highly agentic and context-aware regions of a nearby gradient space, the gradient descent will update the weights towards them, slowly moving towards installing a capable cognitive architecture. In the specific training setup I described, there's a lot of pressure towards being more agentic: if you start rewarding an LLM for what the text ends up at, a variation of that LLM that's more focused on getting to a result will be getting selected. If you didn't come up with a way to point this process at a rare cognitive architecture that leaves survivors, it won't. The capabilities will generalize to acting effectively to steer the lightcone's future. There are reasons for unaligned goals to be closer to what AGI ends up with than aligned goals, and there's no reason for aligned behaviour to generalize exactly the way you imagined. 

If I'm choosing futures on the basis of whether I think they lead to lots of diamonds, why do I need to keep improving that value in order to keep wanting to make diamonds?

I'm not aware of a way to select features leading to lots of diamonds when these features are present in a superhuman AGI. If you do RL, the story that I imagine is something like "For most loss functions/training processes you can realistically come up with, there are many goals such that pursuing them leads to the behavior you evaluate highly; a small fraction of these goals represent wanting to achieve lots of conventional diamond in the real universe; the agents you find maximize some random mixture of these goals (with goals that are less complex or can more easily emerge from the initial heuristics used or such that directly optimizing for them performs better on your loss probably having more weight); you probably don't have the diamond-maximization-in-the-actual-universe as a significant part of these gals unless you do something really smart outside of what I think the field is on the way to achieve; and even if you do, it breaks when the sharp left turn happens."

Human values are even more complicated than diamonds, though it might be easier to come up with a training process where it's easier to miss the difference between what's simple and correlated and what you think is simple and correlated. I believe the iterative process the field might be doing here mostly searches for training setups such that we're not able to find how they fail, and most of those fail. Because of that, I think we need to have a really good and probably formal understanding of what it is that we want to end up in and that understanding should produce some strong constraints on what a training process for an aligned AGI might look like, which would then hopefully inform us/give us insights into how people should build the thing. We have almost none of that kind of research, with only infra-bayesianism currently directly attacking it AFAIK, and I'd really like to see more somewhat promising attempts at this.

Maybe it's somewhat coming at alignment stories from the opposite direction: I think the question of where we end up and how do we get there is far more important to think about than things like "here's a story of what path a gradient descent takes and why". 

  1. ^

    Not important, but for the sake of completeness- an AGI might instead, e.g., look around and hack whatever it's running on without having to score well

There’s a scan of 1 mm^3 of a human brain, 1.4 petabytes with hundred(s?) of millions of synapses