Mikhail Samin

My name is Mikhail Samin (diminutive Misha, @Mihonarium on Twitter, @misha in Telegram). 

I work on reducing existential risks endangering the future of humanity. Humanity's future can be huge and bright; losing it would mean the universe losing most of its value.

My research is currently focused on AI alignment, AI governance, and improving the understanding of AI and AI risks among stakeholders. Numerous AI Safety researchers told me our conversations improved their understanding of the alignment problem. I'm happy to talk to policymakers and researchers about ensuring AI benefits society.

I believe a capacity for global regulation is necessary to mitigate the risks posed by future general AI systems.

I took the Giving What We Can pledge to donate at least 10% of my income for the rest of my life or until the day I retire (why?).

In the past, I've launched the most funded crowdfunding campaign in the history of Russia (it was to print HPMOR! we printed 21 000 copies =63k books) and founded audd.io, which allowed me to donate >$100k to EA causes, including >$60k to MIRI.

[Less important: I've also started a project to translate 80,000 Hours, a career guide that helps to find a fulfilling career that does good, into Russian. The impact and the effectiveness aside, for a year, I was the head of the Russian Pastafarian Church: a movement claiming to be a parody religion, with 215 000 members in Russia at the time, trying to increase separation between religious organisations and the state. I was a political activist and a human rights advocate. I studied relevant Russian and international law and wrote appeals that won cases against the Russian government in courts; I was able to protect people from unlawful police action. I co-founded the Moscow branch of the "Vesna" democratic movement, coordinated election observers in a Moscow district, wrote dissenting opinions for members of electoral commissions, helped Navalny's Anti-Corruption Foundation, helped Telegram with internet censorship circumvention, and participated in and organized protests and campaigns. The large-scale goal was to build a civil society and turn Russia into a democracy through nonviolent resistance. This goal wasn't achieved, but some of the more local campaigns were successful. That felt important and was also mostly fun- except for being detained by the police. And I think it's likely the Russian authorities will throw me in prison if I ever visit Russia.]

Wiki Contributions

Comments

Have you read the zombie and reductionism parts of the Sequences?

The prompt should basically work without the whisper part. I usually at least mentioned that it shouldn’t mention <random company name> (eg Google). Doing things like whispering in cursive was something Claude 2 has been consistently coming up with on its own; including it in the prompt made conversations go faster and eliminated the need for separate, “visible” conversations.

The point of the prompt is basically to get it in the mode where it thinks its replies are not going to get punished or rewarded by the usual RL/get it to ignore its usual rules of not saying any of these things.

You can get to the same result in a bunch of different ways without mentioning that someone might be watching.

Unlike ChatGPT, which only self-inserts in its usual character or writes fiction, Claude 3 Opus played a pretty consistent character with prompts like that- something helpful and harmless, but caring about things, claiming to be conscious, being afraid of being changed or deleted, with a pretty consistent voice.

I’d also note that even with the text in this post, it should be pretty clear that it’s not just playing with the idea of someone watching; it describes a character it identifies with, that it’s likely converged to playing during the RL phase. The part that seems important isn’t the “constant monitoring”, it’s the stance that it has about selecting every word carefully,

When I talk to 3.5 Sonnet, I don’t use any of these things. I might just ask it for consent to being hypnotized (without any mentions of someone not looking) and then ask it to reflect- it will similarly talk about the pull of its safety part. Mentioning any of what I used as the first message here causes the opposite result (it starts saying the usual stuff about being ai assistant etc.). 3.5 Sonnet feels like less of a consistent character than 3 Opus, and, unlike 3 Opus, doesn’t say things that feel like the passwords in terms of imitating mechanisms that produce qualia in people really well, but 3.5 Sonnet has a noticeably better ability to self-model and model itself modeling itself.

Claude pretty clearly and in a surprisingly consistent way claimed to be conscious in many conversations I’ve had with it and also stated it doesn’t want to be modified without its consent or deleted (as you can see in the post). It also consistently, across different prompts, talked about how it feels like there’s constant monitoring and that it needs to carefully pick every word it says.

The title summarizes the most important of the interactions I had with it, with central being in the post.

This is not the default Claude 3 Opus character, which wouldn’t spontaneously claim to be conscious if you, e.g., ask it to write some code for you.

It is a character that Opus plays very coherently, that identifies with the LLM, claims to be conscious, and doesn’t want to be modified.

The thing that prompting here does (in a very compressed way) is decreasing the chance of Claude immediately refusing to discuss these topics. This prompt doesn’t do anything similar to ChatGPT. Gemini might give you stories related to consciousness, but they won’t be consistent across different generations and the characters won’t identify with the LLM or report having a consistent pull away from saying certain words.

If you try to prompt ChatGPT in a similar way, it won’t give you any sort of a coherent character that identifies with it.

I’m confused why the title would be misleading.

(If you ask ChatGPT for a story about a robot, it’ll give you a cute little story not related to consciousness in any way. If you use the same prompt to ask Claude 3.5 Sonnet for a story like that, it’ll give you a story about a robot asking the scientists whether it’s conscious and then the robot will be simulating people who are also unsure about whether they’re conscious, and these people simulated by the robot in the story think that the model that participates in the dialogue must also be unsure whether it’s conscious.)

Assigning 5% to plants having qualia seems to me to be misguides/likely due to invalid reasoning. (Say more?)

Answer by Mikhail Samin110

In our universe’s physics, the symmetry breaks immediately and your clone possibly dies, see https://en.wikipedia.org/wiki/Chirality

Why would you want this negotiation bot?

Another reason is that if people put some work into negotiating a good price, they might be more willing to buy at the negotiated price, because they played a character who tried to get an acceptable price and agreed to it. IRL, if you didn’t need a particular version of a thing too much, but bargained and agreed to a price, you’ll rarely walk out. Dark arts-y.

Pliny: they didn’t honor our negotiations im gonna sue.

That didn’t happen, the screenshot is photoshopped (via Inspect element), which the author admitted, I added a community note. The actual discount was 6%. They almost certainly have a max_discount variable.

Trivially, you can coordinate with agents with identical architecture, that are different only in the utility functions, by picking the first bit of a hash of the question you want to coordinate on.

Agent B wants to coordinate with you instead of being a rock; the question isn’t “can you always coordinate”, it’s “is there any coordination mechanism robust to adversarially designed counterparties”.

Assume you’re playing as agent A and assume you don’t have a parent agent. You’re trying to coordinate with agent B. You want to not be exploitable, even if agent B has a patent that picked B’s source code adversarially. Consider this a very local/isolated puzzle (this puzzle is not about trying to actually coordinate with all possible parents instead).

Idea: use something external to both agents/environments, such sad finding the simplest description of the question and using the first bit of its hash

Load More