Replacing Guilt

Wiki Contributions


Late 2021 MIRI Conversations: AMA / Discussion

For sure. It's tricky to wipe out humanity entirely without optimizing for that in particular -- nuclear war, climate change, and extremely bad natural pandemics look to me like they're at most global catastrophes, rather than existential threats. It might in fact be easier to wipe out humanity by enginering a pandemic that's specifically optimized for this task (than it is to develop AGI), but we don't see vast resources flowing into humanity-killing-virus projects, the way that we see vast resources flowing into AGI projects. By my accounting, most other x-risks look like wild tail risks (what if there's a large, competent, state-funded successfully-secretive death-cult???), whereas the AI x-risk is what happens by default, on the mainline (humanity is storming ahead towards AGI as fast as they can, pouring billions of dollars into it per year, and by default what happens when they succeed is that they accidentally unleash an optimizer that optimizes for our extinction, as a convergent instrumental subgoal of whatever rando thing it's optimizing).

Late 2021 MIRI Conversations: AMA / Discussion

Question for Richard, Paul, and/or Rohin: What's a story, full of implausibly concrete details but nevertheless a member of some largish plausible-to-you cluster of possible outcomes, in which things go well? (Paying particular attention to how early AGI systems are deployed and to what purposes, or how catastrophic deployments are otherwise forstalled.)

Late 2021 MIRI Conversations: AMA / Discussion

In response to your last couple paragraphs: the critique, afaict, is not "a real human cannot keep multiple concrete scenarios in mind and speak probabilistically about those", but rather "a common method for representing lots of hypotheses at once, is to decompose the hypotheses into component properties that can be used to describe lots of concrete hypotheses. (toy model: instead of imagining all numbers, you note that some numbers are odd and some numbers are even, and then think of evenness and oddness). A common failure mode when attempting this is that you lose track of which properties are incompatible (toy model: you claim you can visualize a number that is both even and odd). A way to avert this failure mode is to regularly exhibit at least one concrete hypothesis that simultaneousy posseses whatever collection of properties you say you can simultaneously visualize (toy model: demonstrating that 14 is even and 7 is odd does not in fact convince me that you are correct to imagine a number that is both even and odd)."

On my understanding of Eliezer's picture (and on my own personal picture), almost nobody ever visibly tries to do this (never mind succeeding), when it comes to hopeful AGI scenarios.

Insofar as you have thought about at least one specific hopeful world in great detail, I strongly recommend, spelling it out, in all its great detail, to Eliezer, next time you two chat. In fact, I personally request that you do this! It sounds great, and I expect it to constitute some progress in the debate.

Late 2021 MIRI Conversations: AMA / Discussion

The way the type corresponds loosely to the "type of agency" (if you kinda squint at the arrow symbol and play fast-and-loose) is that it suggests a machine that eats a description of how actions () leads to outcome (), and produces from that description an action.

Consider stating an alignment property for on elements of this type. What sort of thing must it say?

Perhaps you wish to say "when is fed the actual description of the world, it selects the best possible action". Congratulations, in fact exists, it is called . This does not help you.

Perhaps you instead wish to say "when is fed the actual description of the world, it selects an action that gets at least 0.5 utility, after consuming only 1^15 units of compute" or whatever. Now, set aside the fact that you won't find such a function with your theorem-prover AI before somebody else has ended the world (understanding intelligence well enough to build one that you can prove that theorem about, is pro'lly harder than whatever else people are deploying AGIs towards), and set aside also the fact that you're leaving a lot of utility on the table; even if that worked, you're still screwed.

Why are you still screwed? Because the resulting function has the property "if we feed in a correct description of which actions have which utilities, then the optimizer selects an action with high utility". But an enormous chunk of the hard work is in the creation of that description!

For one thing, while our world may have a simple mathematical description (a la "it's some quantum fields doing some quantum fielding"), we don't yet have the true name of our universe yet. For another thing, even if we did, the level of description that an optimizer works with, likely needs to be much coarser than this. For a third thing, even if we had a good coarse-grain description of the world, calculating the consequences that follow from a given action is hard. For a fourth thing, evaluating the goodness of the resulting outcome is hard.

If you can do all those things, then congrats!, you've solved alignment (and a good chunk of capabilities). All that's left is the thing that can operate your description and search through it for high-ranked actions (a remaining capabilities problem).

This isn't intended to be an argument that there does not exist any logical sentence such that a proof of it would save our skins. I'm trying to say something more like: by the time you can write down the sorts of sentences people usually seem to hope for, you've probably solved alignment, and can describe how to build an aligned cognitive system directly, without needing to postulate the indirection where you train up some other system to prove your theorem.

For this reason, I have little hope in sentences of the form "here is an aligned AGI", on account of how once you can say "aligned" in math, you're mostly done and probably don't need the intermediate. Maybe there's some separate, much simpler theorem that we could prove and save our skins -- I doubt we'll find one, but maybe there's some simple mathematical question at the heart of some pivotal action, such that a proof one way or the other would suddenly allow humans to... <??? something pivotal, I don't know, I don't expect such a thing, don't ask me>. But nobody's come up with one that I've heard of. And nobody seems close. And nobody even seems to be really trying all that hard. Like, you don't hear of people talking about their compelling theory of why a given mathematical conjecture is all that stands between humans and <???>, and them banging out the details of their formalization which they expect to only take one more year. Which is, y'know, what it would sound like if they were going to succeed at banging their thing out in five years, and have the pivotal act happen in 15. So, I'm not holding my breath.

Shah and Yudkowsky on alignment failures

("near-zero" is a red herring, and I worry that that phrasing bolsters the incorrect view that the reason MIRI folk think alignment is hard is that we want implausibly strong guarantees. I suggest replacing "reduce x-risk to near-zero" with "reduce x-risk to sub-50%".)

Impossibility results for unbounded utilities

I'd be happy to read it if you're so inclined and think the prompt would help you refine your own thoughts, but yeah, my anticipation is that it would mostly be updating my (already decent) probability that IB physicalism is a reasonable guess.

A few words on the sort of thing that would update me, in hopes of making it slightly more legible sooner rather than later/never: there's a difference between giving the correct answer to metaethics ("'goodness' refers to an objective (but complicated, and not objectively compelling) logical fact, which was physically shadowed by brains on account of the specifics of natural selection and the ancestral environment"), and the sort of argumentation that, like, walks someone from their confused state to the right answer (eg, Eliezer's metaethics sequence). Like, the confused person is still in a state of "it seems to me that either morality must be objectively compelling, or nothing truly matters", and telling them your favorite theory isn't really engaging with their intuitions. Demonstrating that your favorite theory can give consistent answers to all their questions is something, it's evidence that you have at least produced a plausible guess. But from their confused perspective, lots of people (including the nihilists, including the Bible-based moral realists) can confidently provide answers that seem superficially consistent.

The compelling thing, at least to me and my ilk, is the demonstration of mastery and the ability to build a path from the starting intuitions to the conclusion. In the case of a person confused about metaethics, this might correspond to the ability to deconstruct the "morality must be objectively compelling, or nothing truly matters" intuition, right in front of them, such that they can recognize all the pieces inside themselves, and with a flash of clarity see the knot they were tying themselves into. At which point you can help them untie the knot, and tug on the strings, and slowly work your way up to the answer.

(The metaethics sequence is, notably, a tad longer than the answer itself.)

(If I were to write this whole concept of solutions-vs-answers up properly, I'd attempt some dialogs that make the above more concrete and less metaphorical, but \shrug.)

In the case of IB physicalism (and IB more generally), I can see how it's providing enough consistent answers that it counts as a plausible guess. But I don't see how to operate it to resolve my pre-existing confusions. Like, we work with (infra)measures over , and we say some fancy words about how is our "beliefs about the computations", but as far as I've been able to make out this is just a neato formalism; I don't know how to get to that endpoint by, like, starting from my own messy intuitions about when/whether/how physical processes reflect some logical procedure. I don't know how to, like, look inside myself, and find confusions like "does logic or physics come first?" or "do I switch which algorithm I'm instantiating when I drink alcohol?", and disassemble them into their component parts, and gain new distinctions that show me how the apparent conflicts weren't true conflicts and all my previous intuitions were coming at things from slightly the wrong angle, and then shift angles and have a bunch of things click into place, and realize that the seeds of the answer were inside me all along, and that the answer is clearly that the universe isn't really just a physical arrangement of particles (or a wavefunction thereon, w/e), but one of those plus a mapping from syntax-trees to bits (here taking ). Or whatever the philosophy corresponding to "a hypothesis is a " is supposed to be. Like, I understand that it's a neat formalism that does cool math things, and I see how it can be operated to produce consistent answers to various philosophical questions, but that's a long shot from seeing it solve the philosophical problems at hand. Or, to say it another way, answering my confusion handles consistently is not nearly enough to get me to take a theory philosophically seriously, like, it's not enough to convince me that the universe actually has an assignment of syntax-trees to bits in addition to the physical state, which is what it looks to me like I'd need to believe if I actually took IB physicalism seriously.

Impossibility results for unbounded utilities

This is my view as well,

(I, in fact, lifted it off of you, a number of years ago :-p)

but you still need to handle the dependence on subjective uncertainty.

Of course. (And noting that I am, perhaps, more openly confused about how to handle the subjective uncertainty than you are, given my confusions around things like logical uncertainty and whether difficult-to-normalize arithmetical expressions meaningfully denote numbers.)

Running through your examples:

It's unclear whether we can have an extraordinarily long-lived civilization ...

I agree. Separately, I note that I doubt total Fun is linear in how much compute is available to civilization; continuity with the past & satisfactory completion of narrative arcs started in the past is worth something, from which we deduce that wiping out civilization and replacing it with another different civilization of similar flourish and with 2x as much space to flourish in, is not 2x as good as leaving the original civilization alone. But I'm basically like "yep, whether we can get reversibly-computed Fun chugging away through the high-entropy phase of the universe seems like an empiricle question with cosmically large swings in utility associated therewith."

But nearly-reversible civilizations can also have exponential returns to the resources they are able to acquire during the messy phase of the universe.

This seems fairly plausible to me! For instance, my best guess is that you can get more than 2x the Fun by computing two people interacting than by computing two individuals separately. (Although my best guess is also that this effect diminishes at scale, \shrug.)

By my lights, it sure would be nice to have more clarity on this stuff before needing to decide how much to rush our expansion. (Although, like, 1st world problems.)

But also it seems quite plausible that our universe is already even-more-exponentially spatially vast, and we merely can't reach parts of it

Sure, this is pretty plausible, but (arguendo) it shouldn't really be factoring into our action analysis, b/c of the part where we can't reach it. \shrug

Perhaps rather than having a single set of physical constants, our universe runs every possible set.

Sure. And again (arguendo) this doesn't much matter to us b/c the others are beyond our sphere of influence.

Why not all of the above? What if the universe is vast and it allows for very long lived civilization? And once we bite any of those bullets to grant 10^100 more people, then it starts to seem like even less of a further ask to assume that there were actually 10^1000 more people instead

I think this is where I get off the train (at least insofar as I entertain unbounded-utility hypotheses). Like, our ability to reversibly compute in the high-entropy regime is bounded by our error-correction capabilities, and we really start needing to upend modern physics as I understand it to make the numbers really huge. (Like, maybe 10^1000 is fine, but it's gonna fall off a cliff at some point.)

I have a sense that I'm missing some deeper point you're trying to make.

I also have a sense that... how to say... like, suppose someone argued "well, you don't have 1/∞ probability that "infinite utility" makes sense, so clearly you've got to take infinite utilities seriously". My response would be something like "That seems mixed up to me. Like, on my current understanding, "infinite utility" is meaningless, it's a confusion, and I just operate day-to-day without worrying about it. It's not so much that my operating model assigns probability 0 to the proposition "infinite utilities are meaningful", as that infinite utilities simply don't fit into my operating model, they don't make sense, they don't typecheck. And separately, I'm not yet philosophically mature, and I can give you various meta-probabilities about what sorts of things will and won't typecheck in my operating model tomorrow. And sure, I'm not 100% certain that we'll never find a way to rescue the idea of infinite utilities. But that meta-uncertainty doesn't bleed over into my operating model, and I'm not supposed to ram infinities into a place where they don't fit just b/c I might modify the type signatures tomorrow."

When you bandy around plausible ways that the universe could be real large, it doesn't look obviously divergent to me. Some of the bullets you're handling are ones that I am just happy to bite, and others involve stuff that I'm not sure I'm even going to think will typecheck, once I understand wtf is going on. Like, just as I'm not compelled by "but you have more than 0% probability that 'infinite utility' is meaningful" (b/c it's mixing up the operating model and my philosophical immaturity), I'm not compelled by "but your operating model, which says that X, Y, and Z all typecheck, is badly divergent". Yeah, sure, and maybe the resolution is that utilities are bounded, or maybe it's that my operating model is too permissive on account of my philosophical immaturity. Philosophical immaturity can lead to an operating model that's too permisive (cf. zombie arguments) just as easily as one that's too strict.

Like... the nature of physical law keeps seeming to play games like "You have continua!! But you can't do an arithmetic encoding. There's infinite space!! But most of it is unreachable. Time goes on forever!! But most of it is high-entropy. You can do reversible computing to have Fun in a high-entropy universe!! But error accumulates." And this could totally be a hint about how things that are real can't help but avoid the truly large numbers (never mind the infinities), or something, I don't know, I'm philisophically immature. But from my state of philosophical immaturity, it looks like this could totally still resolve in a "you were thinking about it wrong; the worst enhugening assumptions fail somehow to typecheck" sort of way.

Trying to figure out the point that you're making that I'm missing, it sounds like you're trying to say something like "Everyday reasoning at merely-cosmic scales already diverges, even without too much weird stuff. We already need to bound our utilities, when we shift from looking at the milk in the supermarket to looking at the stars in the sky (nevermind the rest of the mathematical multiverse, if there is such a thing)." Is that about right?

If so, I indeed do not yet buy it. Perhaps spell it out in more detail, for someone who's suspicious of any appeals to large swaths of terrain that we can't affect (eg, variants of this universe w/ sufficiently different cosmological constants, at least in the regions where the locals aren't thinking about us-in-particular); someone who buys reversible computing but is going to get suspicious when you try to drive the error rate to shockingly low lows?

To be clear, insofar as modern cosmic-scale reasoning diverges (without bringing in considerations that I consider suspicious and that I suspect I might later think belong in the 'probably not meaningful (in the relevant way)' bin), I do start to feel the vice grips on me, and I expect I'd give bounded utilities another look if I got there.

Impossibility results for unbounded utilities

Those & others. I flailed towards a bunch of others in my thread w/ Paul. Throwing out some taglines:

  • "does logic or physics come first???"
  • "does it even make sense to think of outcomes as being mathematical universes???"
  • "should I even be willing to admit that the expression "3^^^3" denotes a number before taking time proportional to at least log(3^^^3) to normalize it?"
  • "is the thing I care about more like which-computations-physics-instantiates, or more like the-results-of-various-computations??? is there even a difference?"
  • "how does the fact that larger quantum amplitudes correspond to more magical happening-ness relate to the question of how much more I should care about a simulation running on a computer with wires that are twice as thick???"

Note that these aren't supposed to be particularly well-formed questions. (They're more like handles for my own confusions.)

Note that I'm open to the hypothesis that you can resolve some but not others. From my own state of confusion, I'm not sure which issues are interwoven, and it's plausible to me that you, from a state of greater clarity, can see independences that I cannot.

Note that I'm not asking for you to show me how IB physicalism chooses a consistent set of answers to some formal interpretations of my confusion-handles. That's the sort of (non-trivial and virtuous!) feat that causes me to rate IB physicalism as a "plausible guess".

In the specific case of IB physicalism, I'm like "maaaybe? I don't yet see how to relate this Γ that you suggestively refer to as a 'map from programs to results' to a philosophical stance on computation and instantiation that I understand" and "I'm still not sold on the idea of handling non-realizability with inframeasures (on account of how I still feel confused about a bunch of things that inframeasures seem like a plausible guess for how to solve)" and etc.

Maybe at some point I'll write more about the difference, in my accounting, between plausible guesses and solutions.

Impossibility results for unbounded utilities

I am definitely entertaining the hypothesis that the solution to naturalism/anthropics is in no way related to unbounded utilities. (From my perspective, IB physicalism looks like a guess that shows how this could be so, rather than something I know to be a solution, ofc. (And as I said to Paul, the observation that would update me in favor of it would be demonstrated mastery of, and unravelling of, my own related confusions.))

Impossibility results for unbounded utilities

Ok, cool, I think I see where you're coming from now.

I don't think this is unlisted though ...

Fair! To a large degree, I was just being daft. Thanks for the clarification.

It seems to me that our actual situation (i.e. my actual subjective distribution over possible worlds) is divergent in the same way as the St Petersburg lottery, at least with respect to quantities like expected # of happy people.

I think this is a good point, and I hadn't had this thought quite this explicitly myself, and it shifts me a little. (Thanks!)

(I'm not terribly sold on this point myself, but I agree that it's a crux of the matter, and I'm sympathetic.)

But at that point it seems much more likely that preferences just aren't defined over probability distributions at all

This might be where we part ways? I'm not sure. A bunch of my guesses do kinda look like things you might describe as "preferences not being defined over probability distributions" (eg, "utility is a number, not a function"). But simultaneously, I feel solid in my ability to use probabliity distributions and utility functions in day-to-day reasoning problems after I've chunked the world into a small finite number of possible actions and corresponding outcomes, and I can see a bunch of reasons why this is a good way to reason, and whatever the better preference-formalism turns out to be, I expect it to act a lot like probability distributions and utility functions in the "local" situation after the reasoner has chunked the world.

Like, when someone comes to me and says "your small finite considerations in terms of actions and outcomes are super simplified, and everything goes nutso when we remove all the simplifications and take things to infinity, but don't worry, sanity can be recovered so long as you (eg) care less about each individual life in a big universe than in a small universe", then my response is "ok, well, maybe you removed the simplifications in the wrong way? or maybe you took limits in a bad way? or maybe utility is in fact bounded? or maybe this whole notion of big vs small universes was misguided?"

It looks to me like you're arguing that one should either accept bounded utilities, or reject the probability/utility factorization in normal circumstances, whereas to me it looks like there's still a whole lot of flex (ex: 'outcomes' like "I come back from the store with milk" and "I come back from the store empty-handed" shouldn't have been treated the same way as 'outcomes' like "Tegmark 3 multiverse branch A, which looks like B" and "Conway's game of life with initial conditions X, which looks like Y", and something was going wrong in our generalization from the everyday to the metaphysical, and we shouldn't have been identifying outcomes with universes and expecting preferences to be a function of probability distributions on those universes, but thinking of "returning with milk" as an outcome is still fine).

And maybe you'd say that this is just conceding your point? That when we pass from everyday reasoning about questions like "is there milk at the store, or not?" to metaphysical reasoning like "Conway's Life, or Tegmark 3?", we should either give up on unbounded utilities, or give up on thinking of preferences as defined on probability distributions on outcomes? I more-or-less buy that phrasing, with the caveat that I am open to the weak-point being this whole idea that metaphysical universes are outcomes and that probabilities on outcome-collections that large are reasonable objects (rather than the weakpoint being the probablity/utility factorization per se).

it seems odd to hold onto probability distributions as the object of preferences while restricting the space of probability distributions far enough that they appear to exclude our current situation

I agree that would be odd.

One response I have is similar to the above: I'm comfortable using probability distributions for stuff like "does the store have milk or not?" and less comfortable using them for stuff like "Conway's Life or Tegmark 3?", and wouldn't be surprised if thinking of mathematical universes as "outcomes" was a Bad Plan and that this (or some other such philosophically fraught assumption) was the source of the madness.

Also, to say a bit more on why I'm not sold that the current situation is divergent in the St. Petersburg way wrt, eg, amount of Fun: if I imagine someone in Vegas offering me a St. Petersburg gamble, I imagine thinking through it and being like "nah, you'd run out of money too soon for this to be sufficiently high EV". If you're like "ok, but imagine that the world actually did look like it could run the gamble infinitely", my gut sense is "wow, that seems real sus". Maybe the source of the susness is that eventually it's just not possible to get twice as much Fun. Or maybe it's that nobody anywhere is ever in a physical position to reliably double the amount of Fun in the region that they're able to affect. Or something.

And, I'm sympathetic to the objection "well, you surely shouldn't assign probability less than <some vanishingly small but nonzero number> that you're in such a situation!". And maybe that's true; it's definitely on my list of guesses. But I don't by any means feel forced into that corner. Like, maybe it turns out that the lightspeed limit in our universe is a hint about what sort of universes can be real at all (whatever the heck that turns out to mean), and an agent can't ever face a St. Petersburgish choice in some suitably general way. Or something. I'm more trying to gesture at how wide the space of possibilities seems to me from my state of confusion, than to make specific counterproposals that I think are competitive.

(And again, I note that the reason I'm not updating (more) towards your apparently-narrower stance, is that I'm uncertain about whether you see a narrower space of possible resolutions on account of being less confused than I am, vs because you are making premature philosophical commitments.)

To be clear, I agree that you need to do something weirder than "outcomes are mathematical universes, preferences are defined on (probability distributions over) those" if you're going to use unbounded utilities. And again, I note that "utility is bounded" is reasonably high on my list of guesses. But I'm just not all that enthusiastic about "outcomes are mathematical universes" in the first place, so \shrug.

The fact that B can never come about in reality doesn't really change the situation, you still would have expected consistently-correct intuitions to yield consistent answers.

I think I understand what you're saying about thought experiments, now. In my own tongue: even if you've convinced yourself that you can't face a St. Petersburg gamble in real life, it still seems like St. Petersburg gambles form a perfectly lawful thought experiment, and it's at least suspicious if your reasoning procedures would break down facing a perfectly lawful scenario (regardless of whether you happen to face it in fact).

I basically agree with this, and note that, insofar as my confusions resolve in the "unbounded utilities" direction, I expect some sort of account of metaphysical/anthropic/whatever reasoning that reveals St. Petersburg gambles (and suchlike) to be somehow ill-conceived or ill-typed. Like, in that world, what's supposed to happen when someone is like "but imagine you're offered a St. Petersburg bet" is roughly the same as what's supposed to happen when someone's like "but imagine a physically identical copy of you that lacks qualia" -- you're supposed to say "no", and then be able to explain why.

(Or, well, you're always supposed to say "no" to the gamble and be able to explain why, but what's up for grabs is whether the "why" is "because utility is bounded", or some other thing, where I at least am confused enough to still have some of my chips on "some other thing".)

To be explicit, the way that my story continues to shift in response to what you're saying, is an indication of continued updating & refinement of my position. Yay; thanks.

Load More