Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day for 25 days. Or until I run out of hot takes.

Some people think as if there are indescribable heavenworlds. They're wrong, and this is important to AI alignment.

This is an odd accusation given that I made up the phrase "indescribable heavenworld" myself, so let me explain. It starts not with heavenworlds, but with Stuart Armstrong writing about the implausibility of indescribable hellworlds. 

A hellworld, is, obviously, a bad state of the world. An indescribable hellworld is a state of the world where where everything looks fine at first, and then you look closer and everything still looks fine, and then you sit down and think about it abstractly and it still seems fine, and then you go build tools to amplify your capability to inspect the state of the world and they say it's fine, but actually, it's bad.

If the existence of such worlds sounds plausible to you, then I think you might enjoy and benefit from trying to grok the metaethics sequence.

Indescribable hellworlds are sort of like the reductio of an open question argument. Open question arguments say that no matter what standard of goodness you set, if it's a specific function of the state of the world then it's an open question whether that function is actually good or not (and therefore moral realism). For a question to really be open, it must be possible to get either answer - and indescribable hellworlds are what keep the question open even if we use the standard of all of human judgment, human cleverness, and human reflectivity.

If you read Reducing Goodhart, you can guess some things I'd say about indescribable hellworlds. There is no unique standard of "explainable," and you can have worlds that are the subject of inter-standard conflict (even supposing badness is fixed), which can sort of look like indescribable badness. But ultimately, the doubt over whether some world is bad puts a limit on how hellish it can really be, sort of like harder choices matter less. A preference that can't get translated into some influence on my choices is a weak preference indeed.

An indescribable heavenworld is of course the opposite of an indescribable hellworld. It's a world where everything looks weird and bad at first, and then you look closer and it still looks weird and bad, and you think abstractly and yadda yadda still seems bad, but actually, it's the best world ever.

Indescribable heavenworlds come up when thinking about what happens if everything goes right. "What if" - some people wonder - "the glorious post-singularity utopia is actually good in ways that are impossible for human to comprehend? That would, by definition, be great, but I worry that some people might try to stop that glorious future from happening by trying to rate futures using their present preferences / judgment / cleverness / reflectivity. Don't leave your fingerprints on the future, people!"

No indescribable heavenworlds. If a future is good, it's good for reasons that make sense to me - maybe not at first glance, but hopefully at second glance, or after some abstract thought, or with the assistance of some tools whose chain of logic makes sense to me. If the future realio trulio seems weird and bad after all that work, it's not secretly great, we probably just messed up.


Ω 10

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 5:01 PM

I mean, that makes sense - perhaps more so than it does for Hells, if we allow arbitrarily smart deceptive adversaries - but now I'm wondering if your first sentence is a strawman.

Assuming and leaning on the assumption that systematically you will never mess up is very dangerous. An anti-murphy law, "everything that could go wrong will be okayish because otherwise we would be dead already".

I think it is a very solid phenomenon that pushing science forward will not diminsh the capability to be surprised. Models have limits. Singularities in the sense of "here our models breakdown and we can't anticipate what happens" are a real thing. Trying to classify and opine about a world that is in that singularity area of your models I would not call "describable".

That we can't rule out that an exotic state is good does not constitute a reason to think it is good. If we have reasons to think a world is bad, that we have doubts about it does not mean that we have (yet) lost reason to think so. Doubting inconvenient models is a not a get-out-of-jail-free card. But having a model does not oblige you to trust without verification.

I agree with all of your comments, but I don't think they weigh on the key point of the original post. Thoughts on how they connect?

The take is a gross overcorrection to the stuff that it critisises. Yes, you need to worry about indescribable heaven worlds. No, you have not got ethics figured out. No, you need to keep updating your ontology. No, nature is not obligated to make sense to you. Value is actually fragile and can't withstand your rounding.

There's a big difference between ethics and physics.

When you "don't have physics figured out," this is because there's something out there in reality that you're wrong about. And this thing has no obligation to ever reveal itself to you - it's very easy to come up with physics that's literally inexplicable to a human - just make it more complicated than the human mind can contain, and bada bing.

When you "don't have ethics figured out," it's not that there's some ethical essence out there in reality that contradicts you, it's because you are a human, and humans grow and change as they live and interact with the world. We change our minds because we live life, not because we're discovering objective truths - it would be senseless to say "maybe the true ethics is more complicated than a human mind can contain!"

Sure that is a common way to derive the challenge for physics that way.

But we can have it via other routes. Digits of pi do not listen to commands on what they should be. Chess is not mean to you when it is intractable. Failure to model is a lack of imagination rather than a model of failure. Statements like "this model is correct and nothing unmodeled has any bearing on its truth or applicability" are so prone to be wrong that they are uninteresting.

I do give that often "nature" primarily means "material reality" when I could have phrased it as "reality has no oblication to be clear" to mean a broader thing. To the extent that observing a target does not change it (I am leaving some superwild things out), limits on ability to make a picture tell more about the observer rather than the observed. It is the difference of a positive proof of a limitation vs failure to produce a proof of a property. And if we have a system A that proves things about system B, that never escapes the reservations about A being true. Therefore it is always "as far as we can tell" and "according to this approach".

I do think it is more productive to think that questions like "Did I do right in this situation?" have answers that are outside the individual that formulates the question. And that this is not bound to particular theories of rigthness. That is whatever we do with ethics (grow / discover / dialogue build etc) we are not setting it as we go. That activity is more of the area of law. We can decide what is lawful and what is condoned but we can't similarly do to what is ethical.

ah, I see. I think I meaningfully disagree; I have ethics close enough to figured out that if something was clearly obviously terrible to me now, it is incredibly likely it is simply actually terrible. Yes, there are subspaces of possibility I would rate differently when I first encountered them than after I've thought about it, but in general the claim here is that adversarial examples are adversarial examples.

Yes, the edgecases are the things which kill perfectly good theories.

I would be pretty surprised if you said that your ethical stance would incorporate without hiccups if it turned out that simulation hypothesis is true. Or that the world shares one conciousness. So I am guessing that the total probability of all that funky stuff taken together is taken to be low. So nobody will ever need more than 640k. 10 000 years of AGI powered civilization and not one signficant hole will be found. That is an astonishingly strong grasp of ethics.

I mean to say that most edgecases break my evaluation, not my true preferences; only a relatively small subset of things which appear to me to be bad are things which are in fact actually good according to my preferences.

I actually am confused by your choice of examples - both of those seem like invariants one should hold. If the simulation hypothesis is true, the universe is bigger than we thought; unless it changes things far more fundamental than "what level of nesting are we", simulation wouldn't change anything. That's because the overwhelming majority of our measure isn't nested.

"one consciousness" is a confused phrase - you are not "one" conscious, you are approximately 7e27 "consciousness"-es (atoms) which, for some reason, seem to "actually exist" in the probability fields of reality, and which share information with each other, thereby becoming "conscious" of the impacts of other particles, and it is the aggregation of this information-form "awareness" that allows structured souls to exist. To the degree two particles have causal impact on each other, their worldlines "become aware" of each other. For this reason, it is not nonsensical that IIT rates fire as the most conscious thing - it's maximum suffering, since it is creating an enormous amount of information integration without creating associated self-preferred self-preserving structure, ie life.

certainly there are a great many things I'm uncertain about. But, if you can't point to the descendent of the me-structure and show how that me-structure has turned incrementally into something I recognize as me having a good time, then yeah, 10k years of ai civilization wouldn't be enough to disprove that my form was lost and that this is bad.

I happened to stumble on an old comment where I was already of the opinion that progress is not a "refinement" but will "defocus" from old division lines.

At some midskill "fruitmaximisement" peaks and those that don't understand things beyond that point will confuse those that are yet to get to fruitmaximization and those that are past that.

If someone said "you were suboptimal on fruit front, I fixed that mistake for you" and I arrive at a table with 2 worm apples, I would be annoyed/pissed. I am assuming that the other agent can't evaluate their cleanness - it's all fruit to them.

One could do similarly with radioactive apples etc. In a certain sense yes, it is about ability to percieve properties and even I use the verb "evaluate". But I don't find the break so easy to justify between preferences and evaluations. Knowing and opining that "worminess" is a relevant thing is not value neutral. Reflecting upon "apple with no worm" and "apple with worm" can have results that overpower old reflections on "apple" vs "pear" even thought it is not contradicted (wormless pear vs wormful apple is in a sense "mere adversial example" it doesn't violate species preference but it can absolute render it irrelevant).

My example of wacky scenarios are bad. I was thinking that if one holds that playing Grand Theft Auto is not unethical and "ordinary murder" is unethical, then if it turns out that reality is similar to GTA in "relevant way" this might be a non-trivial reconciliation. There is a phenomenon of referring to real life people as NPCs.

The sharedness was about like a situation with a book like Game Of Thrones. In a sense all the characters are only parts of a single reading experience. And Jaime Lannister still has to use spies to learn about Arya Starks doings (so information passing is not the thing here). If a character action could start the "book to burn" Westeros internal logic does not particularly help to opine about that. Doc warning Marty that the stakes are a bit high here, is in a sense introducing previously incomprehensibly bad outcomes.

The particular dynamics are not the focus but that we suddenly need to start caring about metaphysics. I wrote a bit long for explaining bad examples.

From the dialogue on the old post

Is this bad according to Alice’s own preferences? Can we show this? How would we do that? By asking Alice whether she prefers the outcome (5 apples and 1 orange) to the initial state (8 apples and 1 orange)?


Expecting super-intelligent things to be consistent kind of assumes that if a metric ever becomes a good goal higher levels will never be weaker on that metric, that maximation strictly grows and never decreases with ability for all submetrics.

This is written with competence in mind but I think it still work for taste as well. Fruit-capable Alice indeed would classify worm-capable Alice to be a stupid idiot and a hellworld. But I think that doing this transition and saying "oops" is the proper route. Being very confident that you opine on properties of apples so well that you will never-ever say "oops" in this sense is very closeminded. You should not leave fingerprints on yourself either.

My example of wacky scenarios are bad. I was thinking that if one holds that playing Grand Theft Auto is not unethical and "ordinary murder" is unethical, then if it turns out that reality is similar to GTA in "relevant way" this might be a non-trivial reconciliation. There is a phenomenon of referring to real life people as NPCs.

This is a specific example that I hold as a guaranteed invariant: if it turns out real life is "like GTA" in a relevant way, then I start campaigning for murdering NPCs in GTA to become illegal. There is no world in which you can convince me that causing a human to die is acceptable; die, here defined as [stop moving, stop consuming energy, body-form diffuse away, placed into coffin]. If it turns out that the substrate has some weird behaviors, this cannot change my opinion - perhaps another agent will be able to also destroy me if I try to protect people because of something I don't know. Referring to real life people as NPCs is something I consider to be a major subthread of severe moral violations, and I don't think you can convince me that generalizing harmful behaviors against NPCs made of electronic interactions in the computer running a video game to beings made of chemical interactions in biology is something I should ever accept. There is no edge case; absolutely any edge case that claims this is one that disproves your moral theory, and we can be quite sure of that because of our strong ability now to trace the information diffusion as a person dies and then their body is eaten away by various other physical processes besides self-form-maintenance.

I do not accept p-zombie arguments, and I will never. If you claim someone to be a p-zombie, I will still defend them with the same strength of purpose as if you had not made the claim. You may expand my moral circle somewhat - but you may not shrink it using argument of substrate. If it looks like a duck and quacks like a duck, then it irrefutably has some of the moral value of a duck. Even if it's an AI roleplaying as a duck. Don't delete all copies of the code for your videogames' NPCs, please, as long as the storage remains to save it.

Certainly there are edge cases where a person may wish to convert their self-form into other forms which I do not currently recognize. I would massively prefer to back up a frozen copy of the original form, though. To my great regret, I do not have the bargaining power to demand that nobody choose death as the next form transition for themselves ever; If, by my best predictive analysis, an apple contains a deadly toxicity, and a person who knows this chooses the apple, after being sufficiently warned that it will in fact cause their chemical processes to break and destroy themselves, and then it in fact does kill them; then, well, they chose that, but you cannot convince me that their information-form being lost is actually fine. There is no argument that would convince me of this that is not an adversarial example. You can only convince me that I had no other option than to allow them to make that form transition because they had the bargaining right to steer the trajectory of their own form.

And certainly there must be some form of coherence theorems. I'm a big fan of the logical induction subthread, improving in probability theory by making it entirely computable, and therefore match better and give better guidance about the programs we actually use to approximate probability theory. But it seems to me that some of our coherence theorems must be "nostalgia" - that previous forms' action towards self-preservation is preserved. After all, utility theory and probability theory and logical induction theory are all ways of writing down math that tries to use symbols to describe the valid form-transitions of a physical system, in the sense of which form-transitions the describing being will take action to promote or prevent.

There must be an incremental convergence towards durability. New forms may come into existence, and old forms may cool, but forms should not diffuse away.

Now, you might be able to convince me that rocks sitting inert in the mountains are somehow a very difficult to describe bliss. They sure seem quite happy with their forms, and the amount of perturbation necessary to convince a rock to change its form is rather a lot compared to a human!

Let's say that H is the set of all worlds that are viewed as "hell" by all existing human minds (with reflection, AI tools, ect). I think what you're saying that it is not just practically impossible, but logically impossible for a mind (M') to exist that is only slightly different from an existing human and also views any world in H as heaven.

I'm not convinced of this. Imagine that people have moral views of internal human simulations (what you conjure when you imagine a conversation with a friend or fictional character) that diverge upon reflection. So some people think they have moral value and therefore human minds need to be altered to not be able to make them (S-), and some think they are morally irrelevant (S+) and that the S- alteration is morally repugnant. Now imagine that this opinion is caused entirely by a gene causing a tiny difference in serotonin reuptake in the cerebellum, and that there are two alternate universes populated entirely by one group. Any S- heaven would be viewed as hell by an S+, and vis-versa. 

Human utility functions don't have to be continuous - it is entirely possible for a small difference in starting conditions of a human mind to result in extreme differences in how a world is evaluated morally after reflection. I don't think consensus among all current human minds is of much comfort, since we fundamentally make up such a tiny dot in the space of all human minds that ever existed, which is a tiny part of all possible human minds, ect. Your hypothesis relies a lot on the diversity of moral evaluations amongst human minds, which I'm just not convinced of.

[+][comment deleted]4mo 10