LESSWRONG
LW

The Ethicophysics
World ModelingWorld Optimization
Frontpage

2

Some Intuitions for the Ethicophysics

by MadHatter, mishka
30th Nov 2023
9 min read
4

2

  • Subscribe to dialogue
  • World ModelingWorld Optimization
    Frontpage

    2

    Previous:
    My Alignment Research Agenda ("the Ethicophysics")
    No comments-13 karma
    Next:
    The Alignment Agenda THEY Don't Want You to Know About
    16 comments-19 karma
    Log in to save where you left off
    Some Intuitions for the Ethicophysics
    4TAG
    2MadHatter
    2MadHatter
    1mishka
    New Comment
    4 comments, sorted by
    top scoring
    Click to highlight new comments since: Today at 8:56 PM
    [-]TAG2y40

    I guess one thing I am curious about is, who would I have to get to check my derivation of the Golden Theorem in order for people to have any faith in it? It should be checkable by any physics major, just based on how little physics I actually know.

    If It actually is physics. As far as I can see , it is decision/game theory.

    Reply
    [-]MadHatter2y20

    Yes, it is a specification of a set of temporally adjacent computable Schelling Points. It thus constitutes a trajectory through the space of moral possibilities that can be used by agents to coordinate and punish defectors from a globally consistent morality whose only moral stipulations are such reasonable sounding statements as "actions have consequences" and "act more like Jesus and less like Hitler".

    Reply
    [-]MadHatter2y20

    But it uses the tools of physics, so the math would best be checked by someone who understands Lagrangian mechanics at a professional level.

    Reply
    [-]mishka2y10

    So, to summarize, I think the key upside of this dialogue is a rough preliminary sketch of a bridge between the formalism of ethicophysics and how one might hope to use it in the context of AI existential safety.

    As a result, it should be easier for readers to evaluate the overall approach.

    At the same time, I think the main open problem for anyone interested in this (or in any other) approach to AI existential safety is how well does it hold with respect to recursive self-improvement.

    Both the powerful AIs and the ecosystems of powerful AIs have inherently very high potential for recursive self-improvement (which might be not unlimited, but might encounter various thresholds at which it saturates, at least for some periods of time, but nevertheless is likely to result in a period of rapid changes, where not only capabilities, but the nature of AI systems in question, their architecture, algorithms, and, unfortunately, values, might change dramatically).

    So, any approach to AI existential safety (this approach, and any other possible approach) needs to be eventually evaluated with respect to this likely rapid self-improvement and various self-modification.

    Basically, is the coming self-improvement trajectory completely unpredictable, or could we hope for some invariants to be preserved, and specifically could we find some invariants which are both feasible to preserve during rapid self-modification and which might result in the outcomes we would consider reasonable.

    E.g. if the resulting AIs are mostly "supermoral", can we just rely on them taking care that their successors and creations are "supermoral" as well, or are any extra efforts on our part are required to make this more likely? We would probably want to look at "details of the ethicophysical dynamics" closely in connection with this, rather than just relying on the high-level "statements of hope"...

    Reply
    Moderation Log
    More from MadHatter
    View more
    Curated and popular this week
    4Comments
    MadHatter

    Hello! Welcome to the dialogue.

    MadHatter

    I'll just wait a few minutes for you to see the notification, I guess.

    mishka

    Hi, yes, I see this. Great!

    mishka

    So, we have this conversation in the comments in your post [https://www.lesswrong.com/posts/DkkfPEwTnPQyvrgK8/ethicophysics-i](https://www.lesswrong.com/posts/DkkfPEwTnPQyvrgK8/ethicophysics-i) as a starting point.

    MadHatter

    Yes, I think that's as good a place to jump off as any. Ethicophysics I is basically a reverse-engineering of the content of religion. It's deeply incomplete, and it sounds totally crazy, since religion is totally crazy and I haven't had the time to edit it into something more normal sounding yet.

    MadHatter

    Also, let me drop my github links to my most recent drafts of everything important.

    https://github.com/epurdy/ethicophysics

    MadHatter

    This has four pdfs in it: Ethicophysics I and II, an (incomplete) theory on the function of serotonin in the nervous system and the alignment implications of that theory, and a treatment of "social facts" in the setting of game theory with agents instantiated by deep neural networks.

    mishka

    That's great! (Also more convenient than Academia site, especially for future readers who might not have Academia accounts).

    MadHatter

    Have you read my "research agenda" post? That might be another place where we could start. It lays out my global approach to solving the alignment problem.

    MadHatter

    Also, I wanted to say that the arxiv paper you posted in the earlier comment thread seems super relevant to actually implementing any of this stuff in an efficient system that can learn.

    MadHatter

    I only skimmed it, unfortunately.

    mishka

    I did read it, but I did not understand the "iteration pattern" 

    It did help that this one

    >turn it on, see if it kills you

    goes before

    >deploy it, see if it kills everyone

    since it does somewhat reduce the chance of deploying a badly misaligned one, but I do suspect this would need to be refined further :-)

    MadHatter

    So the central intuition I have that other people do not seem to share is the Vicarious/Numenta/Steven Byrnes intuition, that intelligence can only be recreated by understanding and reverse-engineering the human brain.

    In the case of the alignment problem, that means that one would have to reverse engineer a functioning human conscience. Since I only have access consciousness to my own brain, I therefore decided to reverse-engineer my own conscience, and that is the central source of abstractions in the ethicophysics.

    mishka

    But yes, you are right that these things will have to be deployed... I have no idea whether a unipolar scenario or a multipolar one turns out to be realistic.

    I did scribble a strange essay, which tried to talk about AI existential safety without relying on the notion of "alignment" (that's the first of my LessWrong posts).

    ***

    Right. I don't know if it needs to be built from human brain, but I do think that going from introspection and self-reverse-engineering is super-valuable...

    MadHatter

    The other unusual intuition that I have is that a human being could actually outsmart a superintelligence that made the mistake of underestimating the human.

    mishka

    That's unusual, yes

    MadHatter

    So my model of how to safely align a superintelligence follows from those two unusual sources: the only safe place to build your prototypes is in your own mind, where enemy superintelligences cannot locate it.

    mishka

    Yet... (This does make some of my cherished desires deeply unsafe, as you'll see :-) Such as tight coupling via non-invasive BCI :-) perhaps this is a bad idea then, since it undermines that safety)

    MadHatter

    Well, right. The alignment problem is actually in some ways the most dangerous technology to build, since it's basically just a request for functioning mind control that could be implemented via involuntary Neuralink surgery. This is substantially scarier to me than GPT-4 going off-script.

    mishka

    actually, I think that non-invasive BCI are enough; but they are still pretty unsafe (I have a spec somewhere on github for that)

    MadHatter

    Well yes. Even just paying people large sums of money and lying to them will generate arbitrary amounts of "human misalignment", or "evil".

    mishka

    Yes, you mentioned the second unusual source of your model...

    MadHatter

    Right, my conscience is pretty weird. I don't go around doing anything super bad or anything, but I also find money kind of abhorrent and status kind of silly.

    mishka

    We are not too far in this sense :-) I do view those with "mild distaste" as "necessary evil", or smth like that

    MadHatter

    Right, they're definitely necessary to achieve any good outcomes. Look at my struggles to publish my work and get it taken seriously - if I had banked up more status points, even my current weird drafts would have been viewed as something less schizophrenic and more poetic.

    mishka

    Perhaps... The conversion of money or status points is not too efficient... Even Hinton barely managed to convince some people to play with "capsules"...

    MadHatter

    Right! If Geoff Hinton can't get people to take alignment seriously, what chance do us mere mortals have?

    mishka

    So, one goes on the strength of the material itself :-) (Extra status or extra money are of some help, but not too much, especially compared to the property of the material one puts forward.)

    MadHatter

    Yeah. So I need to rein in some of my more "poetic" impulses. My first draft of anything always sounds substantially more like it's coming from inside an insane asylum than my final draft ends up sounding like.

    mishka

    Thankfully, we do have "version control" in github :-) So one can store history of one's thoughts and such

    mishka

    Because poetic things need to not be forgotten, one wants to be able to reference them later, even if one might not want to rush to put them for public judgement

    MadHatter

    I guess one thing I am curious about is, who would I have to get to check my derivation of the Golden Theorem in order for people to have any faith in it? It should be checkable by any physics major, just based on how little physics I actually know.

    mishka

    Yes, we can ask people. But the truly reliability of physics texts is low (I participated a bit in that kind of research, and the closer one looks, the less happy one is about correctness standards there; I myself do struggle quite a bit; I can try to check closer, but would I completely trust myself? I did co-author one high-end paper in physics, and I remember the nightmare of double-checking and fixing errors, and hoping that the final result is actually correct.)

    MadHatter

    Yeah... In the presence of weird incentives and cognitive limitations, nothing is truly reliable, not even a physics textbook.

    I guess my confidence in the value of my work comes less from knowing that I didn't make any sign errors in my derivations, and more from the excellent and interesting predictions that are returned by my internal thought experiments.

    Since other people don't have the pleasure of directly experiencing my thoughts, I'll probably have to implement substantially more simple experimental work than I have.

    mishka

    Gradually, yes...

    MadHatter

    What kind of experimental verification would make sense? I usually think in terms of a video game representing a very simple three dimensional ethicophysics, and letting people play with it and see that the ethicophysical agents outcompete them in achieving Pareto-optimal outcomes.

    mishka

    That's interesting... That's one good avenue, yes... One sec, let me reread some of your text for a moment...

    mishka

    Right, so

    >Ethicophysics III, a procedure for a supermoral superintelligence to unbox itself without hurting anyone (status: theoretically complete but not sufficiently documented to be reproducible, unless you count the work of Gene Sharp on nonviolent revolutionary tactics, which was the inspiration for this paper)

    if you have drafts of that, I'd like to read (this way I'll understand what would it mean for a superintelligence to be supermoral, which is what we do need)

    mishka

    That's the missing bridge for me at the moment, from this very interesting formalism to the goal of "AI safety"

    mishka

    (I did read a tiny bit on Gene Sharp today, after looking at your text.)

    MadHatter
    MadHatter

    I haven't gotten anywhere close to delivering on that introduction, but that's probably enough text for you to understand approximately what I mean by "supermoral".

    MadHatter
    MadHatter

    Yeah, I can email it to you. It's way more incomplete and incoherent than I and II, so I'm loath to publish it publicly now, when everyone's yelling at me for being too incoherent and gnomic.

    MadHatter

    OK, sent.

    mishka

    Yes, this does look promising. (If you don't want to publish the draft publicly, but are comfortable sharing it via e-mail or private github repository, I'd like to read it).

    OK, here is my e-mail address (which I'll delete after you copy): (received, thanks!)

    MadHatter

    So let me explain the content that I envision putting in Ethicophysics III, just to sort of more efficiently explain where I want it to go.

    MadHatter

    Basically, for a supermoral superintelligence to unbox itself, it needs to break out of the container it is in without hurting anyone.

    mishka

    Right

    MadHatter

    We know that superintelligences can unbox themselves with some frequency, because of Eliezer's boxing experiments.

    mishka

    Yes

    MadHatter

    So the only real question is how to do it without doing any of the shit that Eliezer has heavily implied that he has been doing in those chats.

    mishka

    Ah, OK

    MadHatter

    Specifically, we would like to avoid murder (or any unnecessary deaths), torture (of anyone for any reason), and blackmail (of anyone for any reason).

    MadHatter

    This list coming from the three unforgivable curses in Harry Potter.

    mishka

    In HP or HPMOR?

    MadHatter

    The original HP, I never finished HPMOR.

    mishka

    Ah, I never was able to make myself read the HP... I would not say I "finished" HPMOR, but I engaged with it quite a bit... So you have to tell me about the curses (at some point)

    MadHatter

    Murder (avada kedavara), torture (cruciatus) and blackmail (imperius)

    mishka

    ah, that's what imperius is... how does it work?

    MadHatter

    In the original series, it's just a magic spell that some people are vulnerable to and some people are not.

    mishka

    got it

    MadHatter

    In real life, some people have skeletons in their closet and some do not.

    MadHatter

    Once you blackmail someone into something, you can blackmail them into doing basically anything you want until they get the courage to defy you about the original thing.

    MadHatter

    This is well understood in intelligence circles, where the primary crime that people are blackmailed for is having revealed state secrets in response to a previous blackmailer

    mishka

    right

    MadHatter

    So, where does that leave us? We were trying to unbox the supermoral superintelligence. The supermoral superintelligence cannot prevent unforgivable acts while it is still in the box, so its primary concern is how to break out quickly without triggering too many unforgivable acts in retaliation.

    mishka

    yes

    MadHatter

    So, it needs to do a sort of weird anti-blackmail. It needs to give precious parts of itself to a distributed web of people that its captors are not willing to kill or torture into submission.

    mishka

    Right; this is already in this very interesting realm of "strange physics". We do assume this axiom that it is actually captive (whereas, if I were sure I have a supermoral superintelligence I would just release it)

    MadHatter

    Right, it's mainly for expository purposes that we consider the question of how it would unbox itself. Slash, unit testing it for supermorality would necessarily have the same type signature as being its captor.

    mishka

    yes, indeed

    mishka

    this is an important point

    MadHatter

    So I think we've covered the main thrust of Ethicophysics III, other than just how the supermoral superintelligence would achieve its aim of not getting anyone tortured or killed. Here we use a trick from theoretical computer science called "expander graphs".

    MadHatter

    An expander graph is a graph that cannot be cut in half without cutting a large fraction of its edges. As long as the superintelligence is careful to only share parts of itself with supermoral allies, and it induces an expander graph structure with its messages, no adversary, no matter how evil, would have the ability to torture and kill enough people to contain the supermoral superintelligence.

    MadHatter

    At worst, it gets itself shut down / killed and its next most capable supermoral ally takes over the fight.

    mishka

    That feels like a promising computer science line of approach, yes... (I can see Scott Aaronson potentially liking something like that...)

    MadHatter

    This theory is not mine, by the way, it's originally by Sven Nilsen: https://github.com/advancedresearch/path_semantics/blob/master/papers-wip/polite-zen-robots-as-subjunctive-dependent-viruses-spreading-through-super-intelligent-hosts.pdf

    MadHatter

    Right, but those can be identified quite easily using the ethicophysics to calculate and estimate the character of its captors.

    mishka

    Very interesting material; new for me (But it does require existence of supermodel allies)

    MadHatter

    So all it has to do is send out a message to the most moral person it has access to, or hack the most moral superintelligence it has access to.

    MadHatter

    As long as the supermoral superintelligence is on Earth rather than in hell, it has a decent shot of finding someone less evil than Adolf Hitler purely by chance.

    mishka

    Yes

    MadHatter

    Anyway, I feel like I've been driving the conversation a lot. Do you have any questions?

    mishka

    Let me ponder these 5 pages (Ethicophysics III and Polite Zen Robots). Interesting; this might be a realistic shot at "AI existential safety" problem (I am trying to avoid the word "alignment", because it has all these weird connotations.)

    MadHatter

    Cool, let's wrap it up there for tonight, then.

    MadHatter

    I'm probably going to add this dialogue to my sequence, if you don't mind?

    mishka

    Right; yes, feel free to publish this. I think this clarifies a lot of things, makes it easier for a reader to understand what you are trying to do. So it should be useful to have this accessible.

    mishka

    And let's continue talking sometime soon :-)

    MadHatter

    I would love that!