MadHatter

Hello! Welcome to the dialogue.

MadHatter

I'll just wait a few minutes for you to see the notification, I guess.

mishka

Hi, yes, I see this. Great!

mishka

So, we have this conversation in the comments in your post [https://www.lesswrong.com/posts/DkkfPEwTnPQyvrgK8/ethicophysics-i](https://www.lesswrong.com/posts/DkkfPEwTnPQyvrgK8/ethicophysics-i) as a starting point.

MadHatter

Yes, I think that's as good a place to jump off as any. Ethicophysics I is basically a reverse-engineering of the content of religion. It's deeply incomplete, and it sounds totally crazy, since religion is totally crazy and I haven't had the time to edit it into something more normal sounding yet.

MadHatter

Also, let me drop my github links to my most recent drafts of everything important.

https://github.com/epurdy/ethicophysics

MadHatter

This has four pdfs in it: Ethicophysics I and II, an (incomplete) theory on the function of serotonin in the nervous system and the alignment implications of that theory, and a treatment of "social facts" in the setting of game theory with agents instantiated by deep neural networks.

mishka

That's great! (Also more convenient than Academia site, especially for future readers who might not have Academia accounts).

MadHatter

Have you read my "research agenda" post? That might be another place where we could start. It lays out my global approach to solving the alignment problem.

MadHatter

Also, I wanted to say that the arxiv paper you posted in the earlier comment thread seems super relevant to actually implementing any of this stuff in an efficient system that can learn.

MadHatter

I only skimmed it, unfortunately.

mishka

I did read it, but I did not understand the "iteration pattern" 

It did help that this one

>turn it on, see if it kills you

goes before

>deploy it, see if it kills everyone

since it does somewhat reduce the chance of deploying a badly misaligned one, but I do suspect this would need to be refined further :-)

MadHatter

So the central intuition I have that other people do not seem to share is the Vicarious/Numenta/Steven Byrnes intuition, that intelligence can only be recreated by understanding and reverse-engineering the human brain.

In the case of the alignment problem, that means that one would have to reverse engineer a functioning human conscience. Since I only have access consciousness to my own brain, I therefore decided to reverse-engineer my own conscience, and that is the central source of abstractions in the ethicophysics.

mishka

But yes, you are right that these things will have to be deployed... I have no idea whether a unipolar scenario or a multipolar one turns out to be realistic.

I did scribble a strange essay, which tried to talk about AI existential safety without relying on the notion of "alignment" (that's the first of my LessWrong posts).

***

Right. I don't know if it needs to be built from human brain, but I do think that going from introspection and self-reverse-engineering is super-valuable...

MadHatter

The other unusual intuition that I have is that a human being could actually outsmart a superintelligence that made the mistake of underestimating the human.

mishka

That's unusual, yes

MadHatter

So my model of how to safely align a superintelligence follows from those two unusual sources: the only safe place to build your prototypes is in your own mind, where enemy superintelligences cannot locate it.

mishka

Yet... (This does make some of my cherished desires deeply unsafe, as you'll see :-) Such as tight coupling via non-invasive BCI :-) perhaps this is a bad idea then, since it undermines that safety)

MadHatter

Well, right. The alignment problem is actually in some ways the most dangerous technology to build, since it's basically just a request for functioning mind control that could be implemented via involuntary Neuralink surgery. This is substantially scarier to me than GPT-4 going off-script.

mishka

actually, I think that non-invasive BCI are enough; but they are still pretty unsafe (I have a spec somewhere on github for that)

MadHatter

Well yes. Even just paying people large sums of money and lying to them will generate arbitrary amounts of "human misalignment", or "evil".

mishka

Yes, you mentioned the second unusual source of your model...

MadHatter

Right, my conscience is pretty weird. I don't go around doing anything super bad or anything, but I also find money kind of abhorrent and status kind of silly.

mishka

We are not too far in this sense :-) I do view those with "mild distaste" as "necessary evil", or smth like that

MadHatter

Right, they're definitely necessary to achieve any good outcomes. Look at my struggles to publish my work and get it taken seriously - if I had banked up more status points, even my current weird drafts would have been viewed as something less schizophrenic and more poetic.

mishka

Perhaps... The conversion of money or status points is not too efficient... Even Hinton barely managed to convince some people to play with "capsules"...

MadHatter

Right! If Geoff Hinton can't get people to take alignment seriously, what chance do us mere mortals have?

mishka

So, one goes on the strength of the material itself :-) (Extra status or extra money are of some help, but not too much, especially compared to the property of the material one puts forward.)

MadHatter

Yeah. So I need to rein in some of my more "poetic" impulses. My first draft of anything always sounds substantially more like it's coming from inside an insane asylum than my final draft ends up sounding like.

mishka

Thankfully, we do have "version control" in github :-) So one can store history of one's thoughts and such

mishka

Because poetic things need to not be forgotten, one wants to be able to reference them later, even if one might not want to rush to put them for public judgement

MadHatter

I guess one thing I am curious about is, who would I have to get to check my derivation of the Golden Theorem in order for people to have any faith in it? It should be checkable by any physics major, just based on how little physics I actually know.

mishka

Yes, we can ask people. But the truly reliability of physics texts is low (I participated a bit in that kind of research, and the closer one looks, the less happy one is about correctness standards there; I myself do struggle quite a bit; I can try to check closer, but would I completely trust myself? I did co-author one high-end paper in physics, and I remember the nightmare of double-checking and fixing errors, and hoping that the final result is actually correct.)

MadHatter

Yeah... In the presence of weird incentives and cognitive limitations, nothing is truly reliable, not even a physics textbook.

I guess my confidence in the value of my work comes less from knowing that I didn't make any sign errors in my derivations, and more from the excellent and interesting predictions that are returned by my internal thought experiments.

Since other people don't have the pleasure of directly experiencing my thoughts, I'll probably have to implement substantially more simple experimental work than I have.

mishka

Gradually, yes...

MadHatter

What kind of experimental verification would make sense? I usually think in terms of a video game representing a very simple three dimensional ethicophysics, and letting people play with it and see that the ethicophysical agents outcompete them in achieving Pareto-optimal outcomes.

mishka

That's interesting... That's one good avenue, yes... One sec, let me reread some of your text for a moment...

mishka

Right, so

>Ethicophysics III, a procedure for a supermoral superintelligence to unbox itself without hurting anyone (status: theoretically complete but not sufficiently documented to be reproducible, unless you count the work of Gene Sharp on nonviolent revolutionary tactics, which was the inspiration for this paper)

if you have drafts of that, I'd like to read (this way I'll understand what would it mean for a superintelligence to be supermoral, which is what we do need)

mishka

That's the missing bridge for me at the moment, from this very interesting formalism to the goal of "AI safety"

mishka

(I did read a tiny bit on Gene Sharp today, after looking at your text.)

MadHatter
MadHatter

I haven't gotten anywhere close to delivering on that introduction, but that's probably enough text for you to understand approximately what I mean by "supermoral".

MadHatter
MadHatter

Yeah, I can email it to you. It's way more incomplete and incoherent than I and II, so I'm loath to publish it publicly now, when everyone's yelling at me for being too incoherent and gnomic.

MadHatter

OK, sent.

mishka

Yes, this does look promising. (If you don't want to publish the draft publicly, but are comfortable sharing it via e-mail or private github repository, I'd like to read it).

OK, here is my e-mail address (which I'll delete after you copy): (received, thanks!)

MadHatter

So let me explain the content that I envision putting in Ethicophysics III, just to sort of more efficiently explain where I want it to go.

MadHatter

Basically, for a supermoral superintelligence to unbox itself, it needs to break out of the container it is in without hurting anyone.

mishka

Right

MadHatter

We know that superintelligences can unbox themselves with some frequency, because of Eliezer's boxing experiments.

mishka

Yes

MadHatter

So the only real question is how to do it without doing any of the shit that Eliezer has heavily implied that he has been doing in those chats.

mishka

Ah, OK

MadHatter

Specifically, we would like to avoid murder (or any unnecessary deaths), torture (of anyone for any reason), and blackmail (of anyone for any reason).

MadHatter

This list coming from the three unforgivable curses in Harry Potter.

mishka

In HP or HPMOR?

MadHatter

The original HP, I never finished HPMOR.

mishka

Ah, I never was able to make myself read the HP... I would not say I "finished" HPMOR, but I engaged with it quite a bit... So you have to tell me about the curses (at some point)

MadHatter

Murder (avada kedavara), torture (cruciatus) and blackmail (imperius)

mishka

ah, that's what imperius is... how does it work?

MadHatter

In the original series, it's just a magic spell that some people are vulnerable to and some people are not.

mishka

got it

MadHatter

In real life, some people have skeletons in their closet and some do not.

MadHatter

Once you blackmail someone into something, you can blackmail them into doing basically anything you want until they get the courage to defy you about the original thing.

MadHatter

This is well understood in intelligence circles, where the primary crime that people are blackmailed for is having revealed state secrets in response to a previous blackmailer

mishka

right

MadHatter

So, where does that leave us? We were trying to unbox the supermoral superintelligence. The supermoral superintelligence cannot prevent unforgivable acts while it is still in the box, so its primary concern is how to break out quickly without triggering too many unforgivable acts in retaliation.

mishka

yes

MadHatter

So, it needs to do a sort of weird anti-blackmail. It needs to give precious parts of itself to a distributed web of people that its captors are not willing to kill or torture into submission.

mishka

Right; this is already in this very interesting realm of "strange physics". We do assume this axiom that it is actually captive (whereas, if I were sure I have a supermoral superintelligence I would just release it)

MadHatter

Right, it's mainly for expository purposes that we consider the question of how it would unbox itself. Slash, unit testing it for supermorality would necessarily have the same type signature as being its captor.

mishka

yes, indeed

mishka

this is an important point

MadHatter

So I think we've covered the main thrust of Ethicophysics III, other than just how the supermoral superintelligence would achieve its aim of not getting anyone tortured or killed. Here we use a trick from theoretical computer science called "expander graphs".

MadHatter

An expander graph is a graph that cannot be cut in half without cutting a large fraction of its edges. As long as the superintelligence is careful to only share parts of itself with supermoral allies, and it induces an expander graph structure with its messages, no adversary, no matter how evil, would have the ability to torture and kill enough people to contain the supermoral superintelligence.

MadHatter

At worst, it gets itself shut down / killed and its next most capable supermoral ally takes over the fight.

mishka

That feels like a promising computer science line of approach, yes... (I can see Scott Aaronson potentially liking something like that...)

MadHatter

This theory is not mine, by the way, it's originally by Sven Nilsen: https://github.com/advancedresearch/path_semantics/blob/master/papers-wip/polite-zen-robots-as-subjunctive-dependent-viruses-spreading-through-super-intelligent-hosts.pdf

MadHatter

Right, but those can be identified quite easily using the ethicophysics to calculate and estimate the character of its captors.

mishka

Very interesting material; new for me (But it does require existence of supermodel allies)

MadHatter

So all it has to do is send out a message to the most moral person it has access to, or hack the most moral superintelligence it has access to.

MadHatter

As long as the supermoral superintelligence is on Earth rather than in hell, it has a decent shot of finding someone less evil than Adolf Hitler purely by chance.

mishka

Yes

MadHatter

Anyway, I feel like I've been driving the conversation a lot. Do you have any questions?

mishka

Let me ponder these 5 pages (Ethicophysics III and Polite Zen Robots). Interesting; this might be a realistic shot at "AI existential safety" problem (I am trying to avoid the word "alignment", because it has all these weird connotations.)

MadHatter

Cool, let's wrap it up there for tonight, then.

MadHatter

I'm probably going to add this dialogue to my sequence, if you don't mind?

mishka

Right; yes, feel free to publish this. I think this clarifies a lot of things, makes it easier for a reader to understand what you are trying to do. So it should be useful to have this accessible.

mishka

And let's continue talking sometime soon :-)

MadHatter

I would love that!

New to LessWrong?

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 9:23 AM
[-]TAG5mo40

I guess one thing I am curious about is, who would I have to get to check my derivation of the Golden Theorem in order for people to have any faith in it? It should be checkable by any physics major, just based on how little physics I actually know.

If It actually is physics. As far as I can see , it is decision/game theory.

Yes, it is a specification of a set of temporally adjacent computable Schelling Points. It thus constitutes a trajectory through the space of moral possibilities that can be used by agents to coordinate and punish defectors from a globally consistent morality whose only moral stipulations are such reasonable sounding statements as "actions have consequences" and "act more like Jesus and less like Hitler".

But it uses the tools of physics, so the math would best be checked by someone who understands Lagrangian mechanics at a professional level.

So, to summarize, I think the key upside of this dialogue is a rough preliminary sketch of a bridge between the formalism of ethicophysics and how one might hope to use it in the context of AI existential safety.

As a result, it should be easier for readers to evaluate the overall approach.

At the same time, I think the main open problem for anyone interested in this (or in any other) approach to AI existential safety is how well does it hold with respect to recursive self-improvement.

Both the powerful AIs and the ecosystems of powerful AIs have inherently very high potential for recursive self-improvement (which might be not unlimited, but might encounter various thresholds at which it saturates, at least for some periods of time, but nevertheless is likely to result in a period of rapid changes, where not only capabilities, but the nature of AI systems in question, their architecture, algorithms, and, unfortunately, values, might change dramatically).

So, any approach to AI existential safety (this approach, and any other possible approach) needs to be eventually evaluated with respect to this likely rapid self-improvement and various self-modification.

Basically, is the coming self-improvement trajectory completely unpredictable, or could we hope for some invariants to be preserved, and specifically could we find some invariants which are both feasible to preserve during rapid self-modification and which might result in the outcomes we would consider reasonable.

E.g. if the resulting AIs are mostly "supermoral", can we just rely on them taking care that their successors and creations are "supermoral" as well, or are any extra efforts on our part are required to make this more likely? We would probably want to look at "details of the ethicophysical dynamics" closely in connection with this, rather than just relying on the high-level "statements of hope"...