Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for short-form writing by Tamsin Leake. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.

New to LessWrong?

34 comments, sorted by Click to highlight new comments since: Today at 4:38 PM

an approximate illustration of QACI:

Nice graphic!

What stops e.g. "QACI(expensive_computation())" from being an optimization process which ends up trying to "hack its way out" into the real QACI?

nothing fundamentally, the user has to be careful what computation they invoke.

That... seems like a big part of what having "solved alignment" would mean, given that you have AGI-level optimization aimed at (indirectly via a counter-factual) evaluating this (IIUC).

one solution to this problem is to simply never use that capability (running expensive computations) at all, or to not use it before the iterated counterfactual researchers have developed proofs that any expensive computation they run is safe, or before they have very slowly and carefully built dath-ilan-style corrigible aligned AGI.

A short comic I made to illustrate what I call "outside-view double-counting".

(resized to not ruin how it shows on lesswrong, full-scale version here)

I remember a character in Asimov's books saying something to the effect of

It took me 10 years to realize I had those powers of telepathy, and 10 more years to realize that other people don't have them.

and that quote has really stuck with me, and keeps striking me as true about many mindthings (object-level beliefs, ontologies, ways-to-use-one's-brain, etc).

For so many complicated problem (including technical problems), "what is the correct answer?" is not-as-difficult to figure out as "okay, now that I have the correct answer: how the hell do other people's wrong answers mismatch mine? what is the inferential gap even made of? what is even their model of the problem? what the heck is going on inside other people's minds???"

Answers to technical questions, once you have them, tend to be simple and compress easily with the rest of your ontology. But not models of other people's minds. People's minds are actually extremely large things that you fundamentally can't fully model and so you're often doomed to confusion about them. You're forced to fill in the details with projection, and that's often wrong because there's so much more diversity in human minds than we imagine.

The most complex software engineering projects in the world are absurdly tiny in complexity compared to a random human mind.

People's minds are actually extremely large things that you fundamentally can't fully model

Is this "fundamentally" as in "because you, the reader, are also a bounded human, like them"? Or "fundamentally" as in (something more fundamental than that)?

The first one. Alice fundamentally can't fully model Bob because Bob's brain is as large as Alice's, so she can't fit it all inside her own brain without simply becoming Bob.

I relate to this quite a bit ;-;

If timelines weren't so short, brain-computer-based telepathy would unironically be a big help for alignment.

(If a group had the money/talent to "hedge" on longer timelines by allocating some resources to that... well, instead of a hivemind, they first need to run through the relatively-lower-hanging fruit. Actually, maybe they should work on delaying capabilities research, or funding more hardcore alignment themselves, or...)

I should note that it's not entirely known whether quining is applicable for minds.

I've heard some describe my recent posts as "overconfident".

I think I used to calibrate how confident I sound based on how much I expect the people reading/listening-to me to agree with what I'm saying, kinda out of "politeness" for their beliefs; and I think I also used to calibrate my confidence based on how much they match with the apparent consensus, to avoid seeming strange.

I think I've done a good job learning over time to instead report my actual inside-view, including how confident I feel about it.

There's already an immense amount of outside-view double-counting going on in AI discourse, the least I can do is provide {the people who listen to me} with my inside-view beliefs, as opposed to just cycling other people's opinions through me.

Hence, how confident I sound while claiming things that don't match consensus. I actually am that confident in my inside-view. I strive to be honest by hedging what I say when I'm in doubt, but that means I also have to sound confident when I'm confident.

I'm a big fan of Rob Bensinger's "AI Views Snapshot" document idea. I recommend people fill their own before anchoring on anyone else's.

Here's mine at the moment:

(cross-posted from my blog)

let's stick with the term "moral patient"

"moral patient" means "entities that are eligible for moral consideration". as a recent post i've liked puts it:

And also, it’s not clear that “feelings” or “experiences” or “qualia” (or the nearest unconfused versions of those concepts) are pointing at the right line between moral patients and non-patients. These are nontrivial questions, and (needless to say) not the kinds of questions humans should rush to lock in an answer on today, when our understanding of morality and minds is still in its infancy.

in this spirit, i'd like us to stick with using the term "moral patient" or "moral patienthood" when we're talking about the set of things worthy of moral consideration. in particular, we should be using that term instead of:

  • "conscious things"
  • "sentient things"
  • "sapient things"
  • "self-aware things"
  • "things with qualia"
  • "things with experiences"
  • "things that aren't p-zombies"
  • "things for which there is something it's like to be them"

because those terms are hard to define, harder to meaningfully talk about, and we don't in fact know that those are what we'd ultimately want to base our notion of moral patienthood on.

so if you want to talk about the set of things which deserve moral consideration outside of a discussion of what precisely that means, don't use a term which you feel like it probably is the criterion that's gonna ultimately determine which things are worthy of moral consideration, such as "conscious beings", because you might in fact be wrong about what you'd consider to have moral patienthood under reflection. simply use the term "moral patients", because it is the term which unambiguously means exactly that.

(to be clear: this is more an amusing suggestion than a serious belief)


[This comment is no longer endorsed by its author]Reply

Have you seen this implemented in any blogging platform other people can use? I'd love to see this feature implemented in some Obsidian publishing solution like quartz, but for now they mostly don't care about access management.

Take our human civilization, at the point in time at which we invented fire. Now, compute forward all possible future timelines, each right up until the point where it's at risk of building superintelligent AI for the first time. Now, filter for only timelines which either look vaguely like earth or look vaguely like dath ilan.

What's the ratio between the number of such worlds that look vaguely like earth vs look vaguely like dath ilan? 100:1 earths:dath-ilans ? 1,000,000:1 ? 1:1 ?

Even in the fiction, I think dath ilan didn't look vaguely like dath ilan until after it was at risk of building superintelligent AI for the first time. They completely restructured their society and erased their history to avert the risk.

By "vaguely like dath ilan" I mean the parts that made them be the kind of society that can restructure in this way when faced with AI risk. Like, even before AI risk, they were already very different from us.

Ah, I see! Yeah, I have pretty much no idea.

I vaguely suspect that humans are not inherently well-suited to coordination in that sense, and that it would take an unusual cultural situation to achieve it. We never got anywhere close at any point in our history. It also seems likely that the window to achieve it could be fairly short. There seems to be a lot of widespread mathematical sophistication required as described, and I don't think that naturally arises long before AI.

On the other hand, maybe some earlier paths of history could and normally should have put some useful social technology and traditions in place that would be built on later in many places and ways, but for some reason that didn't happen for us. Some early unlikely accident predisposed us to our sorts of societies instead. Our sample size of 1 is difficult to generalize from.

I would put my credence median well below 1:1, but any distribution I have would be very broad, spanning orders of magnitude of likelihood and the overall credence something like 10%. Most of that would be "our early history was actually weird".

I'm kinda bewildered at how I've never observed someone say "I want to build aligned superintelligence in order to resurrect a loved one". I guess the sets of people who {have lost a loved one they wanna resurrect}, {take the singularity and the possibility of resurrection seriously}, and {would mention this} is… the empty set??

(I have met one person who is glad that alignment would also get them this, but I don't think it's their core motivation, even emotionally. Same for me.)

Do you have any (toy) math arguing that it's information-theoretically possible?

I currently consider it plausible that yeah, actually, for any person X who still exists in cultural memory (let alone living memory, let alone if they lived recently enough to leave a digital footprint), the set of theoretically-possible psychologically-human minds whose behavior would be consistent with X's recorded behavior is small enough that none of the combinatorial-explosion arguments apply, so you can just generate all of them and thereby effectively resurrect X.

But you sound more certain than that. What's the reasoning?

(Let's call the dead person "rescuee" and the person who wants to resurrect them "rescuer".)

The procedure you describe is what I call "lossy resurrection". What I'm talking about looks like: you resimulate the entire history of the past-lightcone on a quantum computer, right up until the present, and then either:

  • You have a quantum algorithm for "finding" which branch has the right person (and you select that timeline and discard the rest) (requires that such a quantum algorithm exists)
  • Each branch embeds a copy of the rescuer, and whichever branch looks like correct one isekai's the rescuer into the branch, right next to the rescuee (and also insta-utopia's the whole branch) (requires that the rescuer doesn't mind having their realityfluid exponentially reduced)

(The present time "only" serves as a "solomonoff checksum" to know which seed / branch is the right one.)

This is O(exp(size of the seed of the universe) * amount of history between the seed and the rescuee). Doable if the seed of the universe is small and either of the two requirements above hold, and if the future has enough negentropy to resimulate the past. (That last point is a new source of doubt for me; I kinda just assumed it was true until a friend told me it might not be.)

(Oh, and also you can't do this if resimulating the entire history of the universe — which contains at least four billion years of wild animal suffering(!) — is unethical.)

and if the future has enough negentropy to resimulate the past. (That last point is a new source of doubt for me; I kinda just assumed it was true until a friend told me it might not be.)

Yeah, I don't know about this one either.

Even if possible, it might be incredibly wasteful, in terms of how much negentropy (= future prosperity for new people) we'll need to burn in order to rescue one person. And then the more we rescue, the less value we get out of that as well, since burning negentropy will reduce their extended lifespans too. So we'd need to assign greater (dramatically greater?) value to extending the life of someone who'd previously existed, compared to letting a new person live for the same length of time.

"Lossy resurrection" seems like a more negentropy-efficient way of handling that, by the same tokens as acausal norms likely being a better way to handle acausal trade than low-level simulations and babble-and-prune not being the most efficient way of doing general-purpose search.

Like, the full-history resimulation will surely still not allow you to narrow things down to one branch. You'd get an equivalence class of them, each of them consistent with all available information. Which, in turn, would correspond to a probability distribution over the rescuee's mind; not a unique pick.

Given that, it seems plausible that there's some method by which we can get to the same end result – constrain the PD over the rescuee's mind by as much as the data available to us can let us – without actually running the full simulation.

Depends on how the space of human minds looks like, I suppose. Whether it's actually much lower-dimensional than a naive analysis of possible brain-states suggests.

I'm pretty sure we just need one resimulation to save everyone; once we have located an exact copy of our history, it's cheap to pluck out anyone (including people dead 100 or 1000 years ago). It's a one-time cost.

Lossy resurrection is better than nothing but it doesn't feel as "real" to me. If you resurrect a dead me, I expect that she says "I'm glad I exist! But — at least as per my ontology and values — you shouldn't quite think of me as the same person as the original. We're probly quite different, internally, and thus behaviorally as well, when ran over some time."

Like, the full-history resimulation will surely still not allow you to narrow things down to one branch. You'd get an equivalence class of them, each of them consistent with all available information. Which, in turn, would correspond to a probability distribution over the rescuee's mind; not a unique pick.

I feel like I'm not quite sure about this? It depends on what quantum mechanics entails, exactly, I think. For example: if BQP = P, then there's "only a polynomial amount" of timeline-information (whatever that means!), and then my intuition tells me that the "our world serves as a checksum for the one true (macro-)timeline" idea is more likely to be a thing. But this reasoning is still quite heuristical. Plausibly, yeah, the best we get is a polynomially large or even exponentially large distribution.

That said, to get back to my original point, I feel like there's enough unknowns making this scenario plausible here, that some people who really want to get reunited with their loved ones might totally pursue aligned superintelligence just for a potential shot at this, whether their idea of reuniting requires lossless resurrection or not.

I feel like there's enough unknowns making this scenario plausible here

No argument on that.

I don't find it particularly surprising that {have lost a loved one they wanna resurrect}  {take the singularity and the possibility of resurrection seriously}  {would mention this} is empty, though:

  • "Resurrection is information-theoretically possible" is a longer leap than "believes an unconditional pro-humanity utopia is possible", which is itself a bigger leap than just "takes singularity seriously". E. g., there's a standard-ish counter-argument to "resurrection is possible" which naively assumes a combinatorial explosion of possible human minds consistent with a given behavior. Thinking past it requires some additional less-common insights.
  • "Would mention this" is downgraded by it being an extremely weakness/vulnerability-revealing motivation. Much more so than just "I want an awesome future".
  • "Would mention this" is downgraded by... You know how people who want immortality get bombarded with pop-culture platitudes about accepting death? Well, as per above, immortality is dramatically more plausible-sounding than resurrection, and it's not as vulnerable-to-mention a motivation. Yet talking about it is still not a great idea in a "respectable" company. Goes double for resurrection.

Typical user of outside-view epistemics

(actually clipped from this YourMovieSucks video)

(Epistemic status: Not quite sure)

Realityfluid must normalize for utility functions to work (see 1, 2). But this is a property of the map, not the territory.

Normalizing realityfluid is a way to point to an actual (countably) infinite territory using a finite (conserved-mass) map object.

Many mechanisms of aggregation literally normalize random elements.  Simple addition of two (or more) evenly-distributed linear values (say, dice) yields a normal distribution (aka bell curve).

And yes, human experience is all map - the actual state of the universe is imperceptible.

I replied on discord that I feel there's maybe something more formalisable that's like:

  • reality runs on math because, and is the same thing as, there's a generalised-state-transition function
  • because reality has a notion of what happens next, realityfluid has to give you a notion of what happens next, i.e. it normalises
  • the idea of a realityfluid that doesn't normalise only comes to mind at all because you learned about R^n first in elementary school instead of S^n

which I do not claim confidently because I haven't actually generated that formalisation, and am posting here because maybe there will be another Lesswronger's eyes on it that's like "ah, but...". 

(cross-posted from my blog)

nostalgia: a value pointing home

i value moral patients everywhere having freedom, being diverse, engaging in art and other culture, not undergoing excessive unconsented suffering, in general having a good time, and probly other things as well. but those are all pretty abstract; given those values being satisfied to the same extent, i'd still prefer me and my friends and my home planet (and everyone who's been on it) having access to that utopia rather than not. this value, the value of not just getting an abstractly good future but also getting me and my friends and my culture and my fellow earth-inhabitants to live in it, my friend Prism coined as "nostalgia".

not that those abstract values are simple or robust, they're still plausibly not. but they're, in a sense, broader values about what happens everywhere, and they're not as much local and pointed at and around me. they could be the difference between what i'd call "global" and "personal" values, or perhaps between "global values" and "preferences".