(This post has been sitting in my drafts folder for 6 years. Not sure why I didn't make it public, but here it is now after some editing.)
There are two problems closely related to the Ontological Crisis in Humans. I'll call them the "Partial Utility Function Problem" and the "Decision Theory Upgrade Problem".
Partial Utility Function Problem
As I mentioned in a previous post, the only apparent utility function we have seems to be defined over an ontology very different from the fundamental ontology of the universe. But even on it's native domain, the utility function seems only partially defined. In other words, it will throw an error (i.e., say "I don't know") on some possible states of the heuristical model. For example, this happens for me when the number of people gets sufficiently large, like 3^^^3 in Eliezer's Torture vs Dust Specks scenario. When we try to compute the expected utility of some action, how should we deal with these "I don't know" values that come up?
(Note that I'm presenting a simplified version of the real problem we face, where in addition to "I don't know", our utility function could also return essentially random extrapolated values outside of the region where it gives sensible outputs.)
Decision Theory Upgrade Problem
In the Decision Theory Upgrade Problem, an agent decides that their current decision theory is inadequate in some way, and needs to be upgraded. (Note that the Ontological Crisis could be considered an instance of this more general problem.) The question is whether and how to transfer their values over to the new decision theory.
For example a human might be be running a mix of several decision theories: reinforcement learning, heuristical model-based consequentialism, identity-based decision making (where you adopt one or more social roles, like "environmentalist" or "academic" as part of your identity and then make decisions based on pattern matching what that role would do in any given situation), as well as virtual ethics and deontology. If you are tempted to drop one or more of these in favor of a more "advanced" or "rational" decision theory, such as UDT, you have to figure out how to transfer the values embodied in the old decision theory, which may not even be represented as any kind of utility function, over to the new.
Another instance of this problem can be seen in someone just wanting to be a bit more consequentialist. Maybe UDT is too strange and impractical, but our native model-based consequentialism at least seems closer to being rational than the other decision procedures we have. In this case we tend to assume that the consequentialist module already has our real values and we don't need to "port" values from the other decision procedures that we're deprecating. But I'm not entirely sure this is safe, since the step going from (for example) identity-based decision making to heuristical model-based consequentialism doesn't seem that different from the step between heuristical model-based consequentialism and something like UDT.
In addition to the Ontological Crisis in Humans post that Wei linked, this (underappreciated?) post by Eliezer from 2016 might be helpful background material: Rescuing the utility function.
(It was probably my favorite piece of his writing from that year.)
Humans are not immediately prepared to solve many decision problems, and one of the hardest problems is formulation of preference for a consequentialist agent. In expanding the scope of well-defined/reasonable decisions, formulating our goals well enough for use in a formal decision theory is perhaps the last milestone, far outside of what can be reached with a lot of work!
Indirect normativity (after distillation) can make the timeline for reaching this milestone mostly irrelevant, as long as there is sufficient capability to compute the outcome, and amplification is about capability. It's unclear how the scope of reasonable decisions is related to capability within that scope, amplification seems ambiguous between the two, perhaps the scope of reasonable decisions is just another kind of stuff that can be improved. And it's corrigibility's aspect to keep AI within the scope of well-defined decisions.
But with these principles in place, it's unclear if formulating goals for consequentialist agents remains a thing, when instead it's possible to just continue to expand the scope of reasonable decisions and to distill/amplify them.
Did you mean to say "without a lot of work"?
(Or did you really mean to say that we can't reach it, even with a lot of work?)
The latter, where "a lot of work" is the kind of thing humanity can manage in subjective centuries. In an indirect normativity design, doing much more work than that should still be feasible, since it's only specified abstractly, to be predicted by an AI, enabling distillation. So we can still reach it, if there is an AI to compute the result. But if there is already such an AI, perhaps the work is pointless, because the AI can carry out the work's purpose in a different way.
I agree strongly that, as a problem for humans, assuming that the consequentialist model has all our real values is not a safe assumption.
I would go further, and say that this assumption is almost always going to be importantly wrong and result in loss of important values. Nor do I think this is a hypothetical failure mode, at all; I believe it is common in our circles.
Consequentialism over world-histories feels pretty safe to me. Consequentialism over world states seems pretty unsafe to me. Do you feel that even consequentialism over world histories is unsafe? What would be a potential value that couldn't be captured by that model?
Name one example? :)
Well, I think it's not very hard (even in our circles) to find people doing consequentialism badly, looking only at short-term / easily observable consequences (I think this is especially common among newer EA folk, and some wannabe-slytherin-types). It seemed likely Zvi meant a stronger version of the claim though, which I'm not sure how I'd operationalize.
I have lost the link, but I read a post from someone in the community about how grieving takes place over time because you have to grieve separately for each place or scenario that is important to your memory of the person.
This seems like the same mechanism would be required, just for reasoning.
Valentine's The Art of Grieving Well, perhaps?
That’s the one! Greatly appreciated.
Here is a possible mechanism for the Decision Theory Update Problem.
The agent first considers several scenarios, each with a weight for its importance. Then for each scenario the agent compares what the output of its current decision theory is with what the output of its candidate decision theory would be, and computes a loss for that scenario. The loss will be higher when the current decision theory considers the new preferred actions to be certainly wrong or very wrong. The agent will update its decision theory when the above weighted average loss is small enough to be compensated by a gain in representation simplicity.
This points to a possible solution to the ontological crisis as well. The agent will basically look for a simple decision theory under the new ontology that approximates its actions under the old one.
In the Decision Theory Upgrade Problem, presumably the agent decides that their current decision theory is inadequate using their current decision theory. Why wouldn't it then also show the way on what to replace it with?