Morphism

AI Alignment researcher. Alternatively known as Pi Rogers.

Wiki Contributions

Comments

Sorted by

People often say things like "do x. Your future self will thank you." But I've found that I very rarely actually thank my past self, after x has been done, and I've reaped the benefits of x.

This quick take is a preregistration: For the next month I will thank my past self more, when I reap the benefits of a sacrifice of their immediate utility.

e.g. When I'm stuck in bed because the activation energy to leave is too high, and then I overcome that and go for a run and then feel a lot more energized, I'll look back and say "Thanks 7 am Morphism!"

(I already do this sometimes, but I will now make a TAP out of it, which will probably cause me to do it more often.)

Then I will make a full post describing in detail what I did and what (if anything) changed about my ability to sacrifice short-term gains for greater long-term gains, along with plausible theories w/ probabilities on the causal connection (or lack thereof), as well as a list of potential confounders.

Of course, it is possible that I completely fail to even install the TAP. I don't think that's very likely, because I'm #1-prioritizing my own emotional well-being right now (I'll shift focus back onto my world-saving pursuits once I'm more stablely not depressed). In that case I will not write a full post because the experiment would have not even been done. I will instead just make a comment on this shortform to that effect.

Edit: There are actually many ambiguities with the use of these words. This post is about one specific ambiguity that I think is often overlooked or forgotten.

The word "preference" is overloaded (and so are related words like "want"). It can refer to one of two things:

  • How you want the world to be i.e. your terminal values e.g. "I prefer worlds in which people don't needlessly suffer."
  • What makes you happy e.g. "I prefer my ice cream in a waffle cone"

I'm not sure how we should distinguish these. So far, my best idea is to call the former "global preferences" and the latter "local preferences", but that clashes with the pre-existing notion of locality of preferences as the quality of terminally caring more about people/objects closer to you in spacetime. Does anyone have a better name for this distinction?

I think we definitely need to distinguish them, however, because they often disagree, and most "values disagreements" between people are just disagreements in local preferences, and so could be resolved by considering global preferences.

I may write a longpost at some point on the nuances of local/global preference aggregation.

Example: Two alignment researchers, Alice and Bob, both want access to a limited supply of compute. The rest of this example is left as an exercise.

Emotions can be treated as properties of the world, optimized with respect to constraints like anything else. We can't edit our emotions directly but we can influence them.

Oh no I mean they have the private key stored on the client side and decrypt it there.

Ideally all of this is behind a nice UI, like Signal.

I mean, Signal messenger has worked pretty well in my experience.

But safety research can actually disproportionally help capabilities, e.g. the development of RLHF allowed OAI to turn their weird text predictors into a very generally useful product.

I could see embedded agency being harmful though, since an actual implementation of it would be really useful for inner alignment

Morphism105

Some off the top of my head:

  • Outer Alignment Research (e.g. analytic moral philosophy in an attempt to extrapolate CEV) seems to be totally useless to capabilities, so we should almost definitely publish that.
  • Evals for Governance? Not sure about this since a lot of eval research helps capabilities, but if it leads to regulation that lengthens timelines, it could be net positive.

Edit: oops i didn't see tammy's comment

Morphism10-1

Idea:

Have everyone who wants to share and recieve potentially exfohazardous ideas/research send out a 4096-bit RSA public key.

Then, make a clone of the alignment forum, where every time you make a post, you provide a list of the public keys of the people who you want to see the post. Then, on the client side, it encrypts the post using all of those public keys. The server only ever holds encrypted posts.

Then, users can put in their own private key to see a post. The encrypted post gets downloaded to the user's machine and is decrypted on the client side. Perhaps require users to be on open-source browsers for extra security.

Maybe also add some post-quantum thing like what Signal uses so that we don't all die when quantum computers get good enough.

Should I build this?

Is there someone else here more experienced with csec who should build this instead?

Is this a massive exfohazard? Should this have been published?

Load More