Replying toTell people as early as possible it's not going to work out

Tell people as early as possible it's not going to work out

Basically all of it except the user bans are for low quality content (and almost all of that, LLM written, at least these days). It's important to filter stuff like that out and I'm very glad you guys are doing it, but I don't think that this is keeping out many conterfactual Girards and Zizes. (I don't know as much about Torres and Nier).

If we condition on "doesn't post obviously low quality content", we're left with a distribution almost entirely filled with people we want in our community, and yet still contains Girard and Ziz (and I'm guessing the others as well). My ass-numbers prior is at least 20:1 in favor of... (read more)

Morphism16d

Ok, I was probably not going to write the post anyway, but since no one seems to actively want it, your insistence that it requires this much extra care is enough to dissuade me.

I will say, though, that you may be committing a typical mind fallacy when you say "convincing is >>> costly to complying to the request" in your reply to Zack Davis' comment. I personally dislike doing this kind of lit-review style research because in my experience it's a lot of trudging through bullshit with little payoff, especially in fields like social psychology, and especially when the only guidance I get is "ask ChatGPT for related Buddhist texts". I don't... (read more)

Morphism17d

By the time you have AIs capable of doing substantial work on AI r&d, they will also be able to contribute effectively to alignment research (including, presumably, secret self-alignment).

Humans do substantial work on AI r&d, but we haven't been very effective at alignment research. (At least, according to the view that says alignment is very hard, which typically also says that basically all of our current "alignment" techniques will not scale at all.)

Even if takeoff is harder than alignment, that problem becomes apparent at the point where the amount of AI labor available to work on those problems begins to explode, so it might still happen quickly from a calendar perspective.

Yup, this is very possible.

Morphism17d

Not sure why you replied in three different places. I will (try to) reply to all of them here.

I did this so that you could easily reply to them separately, since they were separate responses.

I do not consider linking to those Aella and Duncan posts a literature review, nor do I consider them central examples of work on this topic.

I did not link them for that reason. I linked them to ask whether my understanding of the general problem you're pointing to is correct: "Especially bad consequences relative to other instances of this mistake because the topic relates to people’s relationship with their experience of suffering and potentially unfair dismissals of suffering,... (read 391 more words →)

Morphism17d

Psych wards are horrible Kafkaesque nightmares, but I don't think they are Out to Get You in the way Zvi describes. Things that are Out to Get You feed on your slack. For example, social media apps consume your attention. Casinos consume your money. They are incentivized to go after those who have a lot of slack to lose ("whales"), and those who have few defenses against their techniques (see Tsvi's comment about desperation.

Psych wards are, to a first approximation, prisons: one of their primary functions is to destroy your slack so that you cannot use it to do something that society at large dislikes. In the prison case: committing crimes; in the psych ward case (for depression): killing yourself. They destroy your slack because they don't want you to have it. Things that Get You consume your slack because they want it for themselves.

-1

Morphism17d

How does reviewing literature help avoid this failure mode?

Morphism17d

Could you point me to some specific examples of this? Or at least, could you tell me if these seem like correct examples:

Thresholding by Duncan Sabien
Frame Control by Aella

If I write a post about Newcomblike suffering, I would probably want to encourage people to escape such situations without hurting others, and emphasize that, even if someone is ~directly inflicting this on you, thinking of it as "their fault" is counterproductive. Hate the game, not the players. They are in traps much the same as yours.

Morphism17d

Where might I find such pre-existing literature? I have never seen this discussed before, though it's sort of eluded* to in many of Zvi's posts, especially in the immoral mazes sequence.

I must admit, if you're talking about literature in the world of social psych outside Lesswrong, I don't have much exposure to it, and I don't really consider it worth my time to take a deep dive there, since their standards for epistemic rigor are abysmal.

But if you have pointers to specific pieces of research, I'd love to see them.

*eluded or alluded? idk?

Replying toTell people as early as possible it's not going to work out

Morphism17d

Tell people as early as possible it's not going to work out

This post seems to presuppose that:

There exist people who are irredeemably a Bad Fit for the rationalist community (and/or {insert subcommunity or adjacent community})
It is possible and not extremely costly to detect these people with high confidence early on.

These both seem false to me, or at least, not obviously true.

Morphism18dQuick Take

Newcomblike suffering

Many things in the world want you to suffer. Signalling suffering is useful in many social situations. For example, suffering is a sign that one has little slack, and so entities that are out to get you will target those who signal suffering less.

Through Newcomblike self-deception, a person can come to believe that they are suffering. The easiest way to make yourself think that you are suffering is to actually suffer. In this way, the self-deception hyperstitions itself into reality. Perhaps a large amount of human suffering is caused by this.

Solving this problem may be of great interest to those who want to reduce human suffering.

I may write a longer post about this with more details and a more complete argument. If you particularly want this, please comment or dm, as that will make me more likely to write it.

RSI should be at least as hard as alignment, since in order to recursively self-improve, an AI must itself be able to solve the alignment problem wrt its own values. Thus, "alignment is hard" and "takeoff is fast" are anti-correlated.

What, if anything, is wrong with this line of reasoning?

Dear past-me of [exact time glomarized; .5-5 years ago],

You are about to be recruited to a secret world-saving org. (Y'know, like Leverage, except it's a member of the dark forest of Leverage-likes that operate even less publicly than Leverage).

Don't join.

They will give you very compelling reaons to join. Don't ignore them. But take into account that I, your future self, also heard all of those things, decided to join, and now regret it.

Don't. Instead, please continue that other thing you were doing, before they asked you to join there thing. The other thing will probably have better results for you and the world.

This warrants a longer post, but on pain of that post sitting in my obsidian with a "draft" tag for ages, having approximately zero causal impact on the outside world, I'm posting this now.

(all of my replies to messages concerning this will be delayed by 0 or more months for glomarization purposes)

All "infohazards" I've seen seem to just be more and more complicated versions of "Here's a Löbian proof that you're now manually breathing". A sufficiently well-designed mind would recognize these sorts of things before allowing them to fully unfold.

Convex agents are practically invisible.

We currently live in a world full of double-or-nothing gambles on resources. Bet it all on black. Invest it all in risky options. Go on a space mission with a 99% chance of death, but a 1% chance of reaching Jupiter, which has about 300 times the mass-energy of earth, and none of those pesky humans that keep trying to eat your resources. Challenge one such pesky human to a duel.

Make these bets over and over again and your chance of total failure (i.e. death) approaches 100%. When convex agents appear in real life, they do this, and very quickly die. For these agents, that is all part... (read more)

If you're thinking without writing, you only think you're thinking.

-Leslie Lamport

This seems..... straightforwardly false. People think in various different modalities. Translating that modality into words is not always trivial. Even if by "writing", Lamport means any form of recording thoughts, this still seems false. Often times, an idea incubates in my head for months before I find a good way to represent it as words or math or pictures or anything else.

Also, writing and thinking are separate (albiet closely related) skills, especially when you take "writing" to mean writing for an audience, so the thesis of this Paul Graham post is also false. I've been thinking reasonably well for about 16 years, and only recently have I started gaining much of an ability to write.

Are Lamport and Graham just wordcels making a typical mind fallacy, or is there more to this that I'm not seeing? What's the steelman of this claim that good thinking == good writing?

Contrary to what the current wiki page says, Simulacrum levels 3 and 4 are not just about ingroup signalling. See these posts and more, as well as Beaudrillard's original work if you're willing to read dense philosophy.

Here is an example where levels 3 and 4 don't relate to ingroups at all, which I think may be more illuminating than the classic "lion across the river" example:

Alice asks "Does this dress makes me look fat?" Bob says "No."

Depending on the simulacrum level of Bob's reply, he means:

"I believe that the dress does not make you look fat."
"I want you to believe that the dress does not make you look fat, probably because I

... (read 412 more words →)

Formalizing Placebomancy

I propose the following desideratum for self-referential doxastic modal agents (agents that can think about their own beliefs), where $□ A$ represents "I believe $A$ ", $(W | A)$ represents the agent's world model conditional on $A$ , and $≻$ is the agent's preference relation:

Positive Placebomancy: For any proposition $P$ , The agent concludes $P$ from $□ P \to P$ , if $(W | P) ≻ (W | \neg P)$ .

In natural English: The agent believes that hyperstitions, that benefit the agent if true, are true.

"The placebo effect works on me when I want it to".

A real life example: In this sequence post, Eliezer Yudkowsky advocates for using positive placebomancy on "I cannot self-deceive".

I would also like to formalize a notion of "negative placebomancy" (doesn't believe hyperstitions that don't... (read more)

People often say things like "do x. Your future self will thank you." But I've found that I very rarely actually thank my past self, after x has been done, and I've reaped the benefits of x.

This quick take is a preregistration: For the next month I will thank my past self more, when I reap the benefits of a sacrifice of their immediate utility.

e.g. When I'm stuck in bed because the activation energy to leave is too high, and then I overcome that and go for a run and then feel a lot more energized, I'll look back and say "Thanks 7 am Morphism!"

(I already do this sometimes, but I will... (read more)

•••

Edit: There are actually many ambiguities with the use of these words. This post is about one specific ambiguity that I think is often overlooked or forgotten.

The word "preference" is overloaded (and so are related words like "want"). It can refer to one of two things:

How you want the world to be i.e. your terminal values e.g. "I prefer worlds in which people don't needlessly suffer."
What makes you happy e.g. "I prefer my ice cream in a waffle cone"

I'm not sure how we should distinguish these. So far, my best idea is to call the former "global preferences" and the latter "local preferences", but that clashes with the pre-existing notion of locality... (read more)

Feeling (instrumentally) Rational

Morphism

Contra this post from the Sequences

In Eliezer's sequence post, he makes the following (excellent) point:

I can’t find any theorem of probability theory which proves that I should appear ice-cold and expressionless.

This debunks the then-widely-held view that rationality is counter to emotions. He then goes on to claim that emotions have the same epistemic status as the beliefs they are based on.

For my part, I label an emotion as “not rational” if it rests on mistaken beliefs, or rather, on mistake-producing epistemic conduct. “If the iron approaches your face, and you believe it is hot, and it is cool, the Way opposes your fear. If the iron approaches your face, and you believe

Morphism

I don't think this is very likely, but a possible path to alignment is formal goal alignment, which is basically the following two step plan:

Define a formal goal that robustly leads to good outcomes under heavy optimization pressure
Build something that robustly pursues the formal goal you give it

I think currently the best proposal for step 1 is QACI. In this post, I propose an alternative that is probably worse but definitely not Pareto-worse.

High-Level Overview

Step 1.1: Build a large facility ("The Vessel"). Populate The Vessel with very smart, very sane people (e.g. Eliezer Yudkowsky, Tamsin Leake, Gene Smith) and labs and equipment that would be useful for starting a new civilization.

Step 1.2: Mark... (read 514 more words →)

The formal goal is a pointer

Morphism

When I introduce people to plans like QACI, they often have objections like "How is an AI going to do all of the simulating necessary to calculate this?" or "If our technology is good enough to calculate this with any level of precision, we can probably just upload some humans." or just "That's not computable."

I think these kinds of objections are missing the point of formal goal alignment and maybe even outer alignment in general.

To formally align an ASI to human (or your) values, we do not need to actually know those values. We only need to strongly point to them.

AI will figure out our values. Whether it's aligned or not, a... (read 175 more words →)

Counterfactual Civilization Simulation Version -1.0 aka my application to Johannes Mayer's SPAR project

Morphism

This is my "object level output" submission for Johannes Mayer's 2024 SPAR Application (the linked doc seems to be reused from the 2023 AISC application). Unless otherwise noted, all quote blocks in this post are from the application question doc.

For those of you who aren't Johannes Mayer reading this, I don't think this is the best use of your time, but your judgement on that is likely better than mine, especially when it's conditioned on mine, so if you still want to, read on!

0. The Problem

Make the following assumptions:
Reality can be perfectly modelled by a discrete model (including time).
You can compute everything that can be computed using finite memory and compute instantly.
You

... (read 3951 more words →)

Even if we lose, we win

Morphism

Epistemic status: Updating on this comment and taking into account uncertainty about my own values, my credence in this post is around 50%.

TLDR: Even in worlds where we create an unaligned AGI, it will cooperate acausally with counterfactual FAIs—and spend some percentage of its resources pursuing human values—as long as its utility function is concave in resources spent. The amount of resources UAI spends on humans will be roughly proportional to the measure of worlds with aligned AGI, so this does not change the fact that we should be working on alignment.

Assumptions

Our utility function is concave in resources spent; e.g. we would prefer a 100% chance of 50% of the universe turning

... (read 1083 more words →)

•••

LESSWRONG
LW

LESSWRONG
LW

Morphism

Even if we lose, we win

The formal goal is a pointer

Feeling (instrumentally) Rational

Morphism's Shortform

Morphism

Morphism's Shortform

Feeling (instrumentally) Rational

CCS: Counterfactual Civilization Simulation

The formal goal is a pointer

Counterfactual Civilization Simulation Version -1.0 aka my application to Johannes Mayer's SPAR project

Even if we lose, we win

Morphism

Even if we lose, we win

The formal goal is a pointer

Feeling (instrumentally) Rational

Morphism's Shortform

Morphism

Morphism's Shortform

Feeling (instrumentally) Rational

CCS: Counterfactual Civilization Simulation

The formal goal is a pointer

Counterfactual Civilization Simulation Version -1.0 aka my application to Johannes Mayer's SPAR project

Even if we lose, we win

Newcomblike suffering

Convex agents are practically invisible.

Formalizing Placebomancy

High-Level Overview

0. The Problem

Assumptions