(ETA: The name "policy approval" wasn't great. I think I will use the term "policy alignment" to contrast with "value alignment" going forward, at the suggestion of Wei Dai in the comments.)
I recently had a conversation with Stuart Armstrong in which I claimed that an agent which learns your utility function (pretending for a moment that "your utility function" really is a well-defined thing) and attempts to optimize it is still not perfectly aligned with you. He challenged me to write up specific examples to back up my claims.
I'll also give a very sketchy alternative to value learning, which I call policy alignment. (The policy alignment idea emerged out of a conversation with Andrew Critch.)
Stuart Armstrong has recently been doing work showing the difficulty of inferring human values. To summarize: because humans are irrational, a value-learning approach like CIRL needs to jointly estimate the human utility function and the degree to which the human is rational -- otherwise, it would take all the mistakes humans make to be preferences. Unfortunately, this leads to a severe problem of identifiability: humans can be assigned any values whatsoever if we assume the right kind of irrationality, and the usual trick of preferring simpler hypotheses doesn't seem to help in this case.
I also want to point out that a similar problem arises even without irrationality. Vladimir Nesov explored how probability and utility can be mixed into each other without changing any decisions an agent makes. So, in principle, we can't determine the utility or probability function of an agent uniquely based on the agent's behavior alone (even including hypothetical behavior in counterfactual situations). This fact was discovered earlier by Jeffrey and Bolker, and is analyzed in more detail in the book The Logic of Decision. For this reason, I call the transform "Jeffrey-Bolker rotation".
To give an illustrative example: it doesn't matter whether we assign very low probability to an event, or care very little about what happens given that event. Suppose a love-maximizing agent is unable to assign nonzero utility to a universe where love isn't real. The agent may appear to ignore evidence that love isn't real. We can interpret this as not caring what happens conditioned on love not being real; or, equally valid (in terms of the actions which the agent chooses), we can interpret the agent as having an extremely low prior probability on love not being real.
At MIRI, we sometimes use the term "probutility" to indicate the probability,utility pair in a way which reminds us that they can't be disentangled from one another. Jeffrey-Bolker rotation changes probabilities and utilities, but does not change the overall probutilities.
Given these problems, it would be nice if we did not actually need to learn the human utility function. I'll advocate that position.
My understanding is that Stuart Armstrong is optimistic that human values can be inferred despite these problems, because we have a lot of useful prior information we can take advantage of.
It is intuitive that a CIRL-like agent should learn what is irrational and then "throw it out", IE, de-noise human preferences by looking only at what we really prefer, not at what we mistakenly do out of short-sightedness or other mistakes. On the other hand, it is not so obvious that the probability/utility distinction should be handled in the same way. Should an agent disentangle beliefs from preferences just so that it can throw out human beliefs and optimize the preferences alone? I argue against this here.
Ignoring issues of irrationality or bounded rationality, what an agent wants out of a helper agent is that the helper agent does preferred things.
Suppose a robot is trying to help a perfectly rational human. The human has probability function and utility function . The robot is in epistemic state e. The robot has a set of actions . The proposition "the robot takes the ith action when in epistemic state e" is written as . The set of full world-states is S. What the human would like the robot to do is given by:
(Or by the analogous causal counterfactual, if the human thinks that way.)
This notion of what the human wants is invariant to Jeffrey-Bolker rotation; the robot doesn't need to disentangle probability and utility! It only needs to learn probutilities.
The equation written above can't be directly optimized, since the robot doesn't have direct access to human probutilities. However, I'll broadly call any attempt to approximate that equation "policy alignment".
Notice that this is closely analogous to UDT. UDT solves dynamic inconsistencies -- situations in which an AI could predictably dislike the decisions of its future self -- by optimizing its actions from the perspective of a fixed prior, IE, its initial self. Policy alignment resolves inconsistencies between the AI and the human by optimizing the AI's actions from the human's perspective. The main point of this post is that we can use this analogy to produce counterexamples to the typical value-learning approach, in which the AI tries to optimize human utility but not according to human beliefs.
I will somewhat ignore the distinction between UDT1.0 and UDT1.1.
These examples serve to illustrate that "optimizing human utility according to AI beliefs" is not exactly the same as "do what the human would want you to do", even when we suppose "the human utility function" is perfectly well-defined and can be learned exactly by the AI.
In these examples, I will suppose that the AI has its own probability distribution . It reasons updatelessly with respect to evidence e it sees, but with full prior knowledge of the human utility function:
I use an updateless agent to avoid accusations that of course an updateful agent would fail classic UDT problems. However, it is not really very important for the examples.
I assume prior knowledge of to avoid any tricky issues which might arise by attempting to combine updatelessness with value learning.
It seems reasonable to suppose that the AI will start out with some mathematical knowledge. Imagine that the AI has a database of theorems in memory when it boots up, including the first million digits of pi. Treat these as part of the agent's prior.
Suppose, on the other hand, that the human which the AI wants to help does not know more than a hundred digits of pi.
The human and the AI will disagree on what to do about counterfactual mugging with a logical coin involving digits of pi which the AI knows and the human does not. If Omega approaches the AI, the AI will refuse to participate, but the human will wish the AI would. If Omega approaches the human, the AI may try to prevent the human from participating, to the extent that it can do so without violating other aspects of the human utility function.
Maybe the problem with the counterfactual mugging example is that it doesn't make sense to program the AI with a bunch of knowledge in its prior which the human doesn't have.
We can go in the opposite extreme, and make a broad prior such as the Solomonoff distribution, with no information about our world in particular.
I believe the observation has been made before that running UDT on such a prior could have weird results. There could be a world with higher prior probability than ours, inhabited by Omegas who ask the AI to optimize alien values in most universes (including Earth) in exchange for the Omegas maximizing in their own world. (This particular scenario doesn't seem particularly probable, but it does seem quite plausible that some weird universes will have higher probability than our universe in the Solomonoff prior, and may make some such bargain.)
Again, this is something which can happen in the maximization using but not in the one using -- unless humans themselves would approve of the multiversal bargain.
"Just Having a Very Different Prior"
Maybe is neither strictly more knowledgable than nor less, but the two are very different on some specific issues. Perhaps there's a specific plan which, when is conditioned on evidence so far, looks very likely to have many good consequences. considers the plan very likely to have many bad consequences. Also suppose that there aren't any interesting consequences of this plan in counterfactual branches, so UDT considerations don't come in.
Also, suppose that there isn't time to test the differing hypotheses involved which make humans think this is such a bad plan while AIs think it is so good. The AI has to decide right now whether to enact the plan.
The value-learning agent will implement this plan, since it seems good on net for human values. The policy-alignment agent will not, since humans wouldn't want it to.
Obviously, one might question whether it is reasonable to assume that things got to a point where there was such a large difference of opinion between the AI and the humans, and no time to resolve it. Arguably, there should be safeguards against this scenario which the value-learning AI itself would want to set up, due to facts about human values such as "the humans want to be involved in big decisions about their future" or the like.
Nonetheless, faced with this situation, it seems like policy-alignment agents do the right thing while value-learning agents do not.
Aren't human beliefs bad?
Isn't it problematic to optimize via human beliefs, since human beliefs are low-quality?
I think this is somewhat true and somewhat not.
- Partly, this is like saying "isn't UDT bad because it doesn't learn?" -- actually, UDT acts as if it updates most of the time, so it is wrong to think of it as incapable of learning. Similarly, although the policy-alignment agent uses , it will mostly act as if it has updated on a lot of information. So, maybe you believe human beliefs aren't very good -- but do you think we're capable of learning almost anything eventually? If so, this may address a large component of the concern. In particular, if you trust the output of certain machine learning algorithms more than you trust yourself, the AI can run those algorithms and use their output.
- On the other hand, humans probably have incoherent , and not just because of logical uncertainty. So, the AI still needs to figure out what is "irrational" and what is "real" in , just like value-learning needs to do for .
If humans would want an AI to optimize via human beliefs, won't that be reflected in the human utility function?
Or: If policy-alignment were good, wouldn't a value-learner self modify into policy-alignment anyway?
I don't think this is true, but I'm not sure. Certainly there could be simple agents who value-learners cooperate with without ever deciding to self-modify into policy-alignment agents. Perhaps there is something about human preference which desires the AI to cooperate with the human even when the AI thinks this is (otherwise) net-negative for human values.
Aren't I ignoring the fact that the AI needs its own beliefs?
In "Just Having a Very Different Prior", I claimed that if and disagree about the consequences of a plan, value-learning can do something humans strongly don't want it to do, whereas policy-alignment cannot. However, my definition of policy-alignment ignores learning. Realistically, the policy-alignment agent needs to also have beliefs , which it uses to approximate the human approval of its actions. Can't the same large disagreement emerge from this?
I think the concern is qualitatively less, because the policy-alignment agent uses only to estimate and . If the AI knows that humans would have a large disagreement with the plan, the policy-alignment agent would not implement the plan, while the value-learning agent would.
For policy-alignment to go wrong, it needs to have a bad estimate of and .
The policy is too big.
Even if the process of learning is doing the work to turn it into a coherent probability distribution (removing irrationality and making things well-defined), it still may not be able to conceive of important possibilities. The evidence which the AI uses to decide how to act, in the equations given earlier, may be a large data stream with some human-incomprehensible parts.
As a result, it seems like the AI needs to optimize over compact/abstract representations of its policy, similarly to how policy selection in logical inductors works.
This isn't an entirely satisfactory answer, since (1) the representation of a policy as a computer program could still escape human understanding, and (2) it is unclear what it means to correctly represent the policy in a human-understandable way.
[Aside from issues with the approach, my term "policy approval" may be terrible. It sounds too much like "approval-directed agent", which means something different. I think there are similarities, but they aren't strong enough to justify referring to both as "approval". Any suggestions?]
[Now using "Policy Alignment" for this. Editing post accordingly.]
(These are very speculative.)
One of the major obstacles to progress in decision theory right now is that we don't know of a good updateless perspective for logical uncertainty. Maybe a policy-alignment agent doesn't need to solve this problem, since it tries to optimize from the human perspective rather than its own. Roughly: logical updatelessness is hard because it tends to fall into the "too updateless" issue above. So, maybe it can be a non-issue in the right formulation of policy alignment.
Stuart Armstrong is somewhat pessimistic about corrigibility. Perhaps there is something which can be done in policy-alignment land which can't be done otherwise. The "Just Having Very Different Priors" example points in this direction; it is an example where policy-alignment acts in a much more corrigible way.
A value-learning agent can always resist humans if it is highly confidant that its plan is a good one which humans are opposing irrationally. A policy-alignment agent can think its plan is a good one but also think that humans would prefer it to be corrigible on principle regardless of that.
On the other hand, a policy-alignment agent isn't guaranteed to think that. Perhaps policy-alignment learning can be specified with some kind of highly corrigible bias, so that it requires a lot of evidence to decide that humans don't want it to behave corrigibly in a particular case?
I've left out some speculation about what policy-alignment agents should actually look like, for the sake of keeping mostly to the point (the discussion with Stuart). I like this idea because it involves a change in perspective of what an agent should be, similar to the change which UDT itself made.
I'm confused. If the AI knows a million digits of pi, and it can prevent Omega from counterfactually mugging me where it knows I will lose money... shouldn't it try to prevent that from happening? That seems like the right behavior to me. Similarly, if I knew that the AI knows a million digits of pi, then if it gets counterfactually mugged, it shouldn't give up the money.
(Perhaps the argument is that as long as Omega was uncertain about the digit when deciding what game to propose, then you should pay up as necessary, regardless of what you know. But if that's the argument, then why can't the AI go through the same reasoning?)
If the AI knows the winning numbers for the lottery, then it should buy that ticket for me, even though (if I don't know that the AI knows the winning numbers) I would disprefer that action. Even better would be if it explained to me what it was doing, after which I would prefer the action, but let's say that wasn't possible for some reason (maybe it performed a very complex simulation of the world to figure out the winning number).
It seems like if the AI knows my utility function and is optimizing it, that does perform well. Now for practical reasons, we probably want to instead build an AI that does what we prefer it to do, but this seems to be because it would be hard to learn the right utility function, and errors along the way could lead to catastrophe, not because it would be bad for the AI to optimize the right utility function.
ETA: My strawman-ML-version of your argument is that you would prefer imitation learning instead of inverse reinforcement learning (which differ when the AI and human know different things). This seems wrong to me.
If you don't think one should pay up in counterfactual mugging in general, then my argument won't land. Rather than arguing that you want to be counterfactually mugged, I'll try and argue a different decision problem.
Suppose that Omega is running a fairly simple and quick algorithm which is nonetheless able to predict an AI with more processing power, due to using a stronger logic or similar tricks. Omega will put either $10 or $1000 in a box. Our AI can press a button on the box to get either all or half of the money inside. Omega puts in $1000 if it predicts that our AI will take half the money; otherwise, it puts in $10.
We suppose that, since there is a short proof of exactly what Omega does, it is already present in the mathematical database included in the AI's prior.
If the AI is a value-learning agent, it will take all the money, since it already knows how much money there is -- taking less money just has a lower expected utility. So, it will get only $10 from Omega.
If the AI is a policy-approval agent, it will think about what would have a higher expectation in the human's expectation: taking half, or taking it all. It's quite possible in this case that it takes all the money.
That is part of the argument for paying up in counterfactual mugging, yes. But both us and Omega need to be uncertain about the digit, since if our prior can already predict that Omega is going to ask us for $10 rather than give us any money, there's no reason for us to pay up. So, it depends on the prior, and can turn out differently if our vs the agent's prior is used.
If I think that the AI tends to be miscalibrated about lottery-ticket beliefs, there is no reason for me to want it to buy the ticket. If I think it is calibrated about lottery-tirket beliefs, I'll like the policy of buying lottery tickets in such cases, so the AI will buy.
You could argue that an AI which is trying to be helpful will buy lottery tickets in such cases no matter how deluded the humans think it is. But, not only is this not very corrigible behavior, but also it doesn't make any sense from our perspective to make an AI reason in that way: we don't want the AI to act in ways which we have good reason to believe are unreliable.
The analogy isn't perfect, since the AI can still do things to maximize human approval which the human would never have thought of, as well as things which the human could think of but didn't have the computational resources to do. It does seem like a fairly good analogy, though.
Okay, I think I misunderstood what you were claiming in this post. Based on the following line:
I thought you were arguing, "Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this true utility function is still not aligned with you." (Yes, having written it down I can see that is not what you actually said, but that's the interpretation I originally ended up with.)
I would now rephrase your claim as "Even assuming we know the true utility function, optimizing it is hard."
Yeah, an AI that optimizes the true utility function probably won't be corrigible. From a theoretical standpoint, that seems fine -- corrigibility seems like an easier target to shoot for, not a necessary aspect of an aligned AI. The reason we don't want the scenario above is "we have good reason to believe [the AI is] unreliable", which sounds like the AI is failing to optimize the utility function correctly.
This also sounds like the value-learning agent is simply bad at correctly optimizing the true utility function. (It seems to me that all of decision theory is about how to properly optimize a utility function in theory.)
Again, seems like this proposal for making an aligned AI is just bad at optimizing the true utility function.
So I guess the way I would summarize this post:
Is this right?
I would correct it to "Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this in expectation according to some prior is still not aligned with you."
This part is tricky for me to interpret.
On the one hand, yes: specifically, even if you have all the processing power you need, you still need to optimize via a particular prior (AIXI optimizes via Solomonoff induction) since you can't directly see what the consequences of your actions will be. So, I'm specifically pointing at an aspect of "optimizing it is hard" which is about having a good prior. You could say that "utility" is the true target, and "expected utility" is the proxy which you have to use in decision theory.
On the other hand, this might be a misleading way of framing the problem. It suggests that something with a perfect prior (magically exactly equal to the universe we're actually in) would be perfectly aligned: "If you know the true utility function, and you know the true state of the universe and consequences of alternative actions you can take, then you are aligned." This isn't necessarily objectionable, but it is not the notion of alignment in the post.
So, to summarize,
1. I agree that "even assuming we know the true utility function, optimizing it is hard" -- but I am specifically pointing at the fact that we need beliefs to supplement utility functions, so that we can maximize expected utility as a proxy for utility. And this proxy can be bad.
2. Even under the idealized assumption that humans are perfectly coherent decision-theoretic agents, I'm not sure it makes sense to say there's a "true human utility function" -- the VNM theorem only gets a UH which is unique up to such-and-such by assuming a fixed notion of probability. The Jeffrey-Bolker representation theorem, which justifies rational agents having probability and utility functions in one theorem rather than justifying the two independently, shows that we can do this "rotation" which shifts which part of the preferences are represented in the probability vs in the utility, without changing the underlying preferences.
3. If we think of the objective as "building AI such that there is a good argument for humans trusting that the AI has human interest in mind" rather than "building AI which optimizes human utility", then we naturally want to solve #1 in a way which takes human beliefs into account. This addresses the concern from #2; we don't actually have to figure out which part of preferences are "probability" vs "utility".
Sure. I was claiming that it is also a reasonable notion of alignment. My reason for not using that notion of alignment is that it doesn't seem practically realizable.
However, if we could magically give the AI the "true universe" prior with the "true utility function", I would be happy and say we were done, even if it wasn't justifiable and couldn't explain it to humans. I agree it would not be aligned in the sense of the post.
This seems to argue that if my AI knew the winning lottery numbers, but didn't have a chance to tell me how it knows this, then it shouldn't buy the winning lottery ticket. I agree the Jeffrey-Bolker rotation seems to indicate that we should think of probutilities instead of probabilities and utilities separately, but it seems like there really are some very clear actual differences in the real world, and we should account for it somehow. Perhaps one difference is that probabilities change in response to new information, whereas (idealized) utility functions don't. (Obviously humans don't have idealized utility functions, but this is all a theoretical exercise anyway.)
Thanks for clarifying, that's clearer to me now.
I generally agree with the objective you propose (for practical reasons). The obvious way to do this is to do imitation learning, where (to a first approximation) you just copy the human's policy. (Or alternatively, have the policy that a human would approve of you having.) This won't let you exceed human intelligence, which seems like a pretty big problem. Do you expect an AI using policy alignment to do better than humans at tasks? If so, how is it doing better? My normal answer to this in the EV framework is "it has better estimates of probabilities of future states", but we can't do that any more. Perhaps you're hoping that the AI can explain its plan to a human, and the human will then approve of it even though they wouldn't have before the explanation. In that case, the human's probutilities have changed, which means that policy alignment is now "alignment to a thing that I can manipulate", which seems bad.
Fwiw I am generally in favor of approaches along the lines of policy alignment, I'm more confused about the theory behind it here.
I'm not even sure whether you are closer or further from understanding what I meant, now. I think you are probably closer, but stating it in a way I wouldn't. I see that I need to do some careful disambiguation of background assumptions and language.
This part, at least, is getting at the same intuition I'm coming from. However, I can only assume that you are confused why I would have set up things the way I did in the post if this was my point, since I didn't end up talking much about directly learning the policies. (I am thinking I'll write another post to make that connection clearer.)
I will have to think harder about the difference between how you're framing things and how I would frame things, to try to clarify more.
My assumption was that you were arguing for why learning policies directly (assuming we could do it) has advantages over the default approach of value learning + optimization. That framing seems to explain most of the post.
Did you mean to say that it's quite possible that it takes half the money?
Separately, I still don't understand the counterfactual mugging case. (Disclaimer, I haven't gone through any math around counterfactual mugging.) It seems really strange that if the human was certain about the digit, they wouldn't pay up, but if the human is uncertain about the digit but is certain that the AI knows the digit, then the human would not want the AI to intervene. But possibly it's not worth getting into this detail.
I think assuming that you have access to the proof of what Omega does means that you have already determined your own behavior. Presumably, "what Omega does" depends on your own policy, so if you have a proof about what Omega does, that proof also determines your action, and there is nothing left for the agent to consider.
To be clear, I think it's reasonable to consider AIs that try to figure out proofs of "what Omega does", but if that's taken to be _part of the prior_, then it seems you no longer have the chance to (acausally) influence what Omega does. And if it's not part of the prior, then I think a value-learning agent with a good decision theory can get the $500.
You may not recognize it as such, especially if Omega is using a different axiom system than you. So, you can still be ignorant of what you'll do while knowing what Omega's prediction of you is. This makes it impossible for your probability distribution to treat the two as correlated anymore.
Yeah, that's the problem here.
Only if the agent takes that one proof out of the prior, but still has enough structure in the prior to see how the decision problem plays out. This is the problem of constructing a thin prior. You can (more or less) solve any decision problem by making the agent sufficiently updateless, but you run up against the problem of making it too updateless, at which point it behaves in absurd ways (lacking enough structure to even understand the consequences of policies correctly).
Hence the intuition that the correct prior to be updateless with respect to is the human one (which is, essentially, the main point of the post).
A useful thing would be an example of when a policy approval agent would do something that a human wouldn't, and what gains in efficiency the policy approval agent has over a normal human acting.
I feel that the formulation "the humans have a utility function" may obscure part of what's going on. Part of the advantages of approval agents is that they allow humans to express their sometimes incoherent meta-preferences as well ("yeah, I want to do X, but don't force me to do it"). Assuming the human preferences are already coherent reduces the attraction of the approach.
Ah, I agree that this proposal may have better ways to relax the assumption that the human has a utility function than value-learning does. I wanted to focus on the simpler case here. Perhaps I'll write a follow-up post considering the generalization.
Maybe I'll try to insert an example where the policy approval agent does something the human wouldn't into this post, though.
Here's a first stab: suppose that the AI has a subroutine which solves complex planning problems. Furthermore, the human trusts the subroutine (does not expect it to be cleverly choosing plans which solve the problems as stated but cause other problems). The human is smart enough to formulate day-to-day management problems which arise at work as formally-specified planning problems, and would like to be told what the answer to those problems are. In this case, the AI will tell the human those answers.
This also illustrates a limited way the policy-approval agent can avoid over-optimizing simplified problem statements: if the human does not trust the planning subroutine (expects it to goodhart or such), then the AI will not use such a subroutine.
(This isn't maximally satisfactory, since the human may easily be mistaken about what subroutines to trust. I think the AI can do a little better than this, but maybe not in a way which addresses the fundamental issue.)
Iterated distillation and amplification seems like an example of a thing that is like policy approval, and it could do lots of things that a human is unable to, such as becoming really good at chess or Go. (You can imagine removing the distillation steps if those seem too different from policy approval, and the point still applies.)
I think there are interesting connections between HCH/IDA and policy approval, which I hope to write more about some time.
What about calling it "policy alignment" in analogy with "value alignment"?
Since I'm very confused about what my PH should be (I may be happy to change it in any number of ways if someone gave me the correct solutions to a bunch of philosophical problems), there may not be anything "real" in my PH that I'd want an AI to learn and use in an uncritical way. It seems like this mostly comes down to what probabilities really are: if probabilities are something objective like "how real" or "how much existence" each possible world is/has, then I'd want an AI to use its greater intellect to figure out what is the correct prior and use that, but if probabilities are something subjective like how much I care about each possible world, then maybe I'd want the AI to learn and use my PH. I'm kind of confused that you give a bunch of what seem to me to be less important considerations on whether the AI should use my probability function or its own to make decisions, and don't mention this one.
"Policy alignment" seems like an improvement, especially since "policy approval" invokes government policy.
With respect to the rest:
On the one hand, I'm tempted to say that to the extent you recognize how confused you are about what probabilities are, and that this confusion has to do with how you reason in the real world, your PH is going to change a lot when updated on certain philosophical arguments. As a result, optimizing a strategy updatelessly via PH is going to take that into account, shifting behavior significantly in contingencies in which various philosophical arguments emerge, and potentially putting a significant amount of processing power toward searching for such arguments.
On the other hand, I buy my "policy alignment" proposal only to the extent that I buy UDT, which is not entirely. I don't know how to think about UDT together with the shifting probabilities which come from logical induction. The problem is similar to the one you outline: just as it is unclear that a human should think its own PH has any useful content which should be locked in forever in an updateless reasoner, it is similarly unclear that a fixed logical inductor state (after running for a finite amount of time) has any useful content which one would want to lock in forever.
I don't yet know how to think about this problem. I suspect there's something non-obvious to be said about the extent to which PH trusts other belief distributions (IE, something at least a bit more compelling than the answer I gave first, but not entirely different in form).
I was really surprised that the "background problem" is almost the same problem as in value learning in some formulations of bounded rationality. In information-theoretic bounded rationality formalism, the bounded agent acts based on combination of prior (representing previous knowledge) and utilities (what the agent wants). (It seems in some cases of updating humans, it is possible to disentangle the two.)
While the "counterexamples" to "optimizing human utility according to AI belief" show how this fails in somewhat tricky cases, it seems to me it will be easy to find "counterexamples" where "policy-approval agent" would fail (as compared to what is intuitively good)
From an "engineering perspective", if I was forced to choose something right now, it would be an AI "optimizing human utility according to AI beliefs" but asking for clarification when such choice diverges too much from the "policy-approval".
I agree that it'll be easy to find counterexamples to policy-approval, but I think it'll be harder than for value-alignment agents. We have the advantage that (in the limited sense provided by the assumption that the human has a coherent probability and utility) we can prove that we "do what the human would want" (in a more comprehensive sense than we can for value alignment).
To me this sort of approach feels like a non-starter because you're ignoring the thing that generates the policy in favor of the policy itself, which would seem to expose you to Goodharting that would be even worse than the Goodharting we expect in terms of values since policy is a grosser instrument. Is there some way in which you think this is not that case, namely that focusing on policy alignment would help us better avoid Goodharting than is possible with value alignment?
"Ignoring issues of irrationality or bounded rationality, what an agent wants out of a helper agent is that the helper agent does preferred things."
I don't want a "helper agent" to do what I think I'd prefer it to do. I mean, I REALLY don't want that or anything like that.
If I wanted that, I could just set it up to follow orders to the best of its understanding, and then order it around. The whole point is to make use of the fact that it's smarter than I am and can achieve outcomes I can't foresee in ways I can't think up.
What I intuitively want it to do is what makes me happiest with the state of the world after it's done it. That particular formulation may get hairy with cases where its actions alter my preferences, but just abandoning every possible improvement in favor of my pre-existing guesses about desirable actions isn't a satisfactory answer.
The AI here can do things which you wouldn't think up.
For example, it could have more computational power than you to search for plans which maximize expected utility according to your probability and utility functions. Then, it could tell you the answer, if you're the kind of person who likes to be told those kinds of answers (IE, if this doesn't violate your sense of autonomy/self-determination).
Or, if there is any algorithm P′ whose beliefs you trust more than your own, or would trust more than your own if some conditions held (which the AI can itself check), then the AI can optimize your utility function under expected value under P′ rather than under your own beliefs, since you would prefer that.
Would it, though? It's not evaluating actions on my future probutility, otherwise it would wirehead me. It's evaluating actions on my present probutility. So now the answer seems to depend on whether we allow "tell me the right answer" as a primitive action, or if it is evaluated as "tell me [String]," which has low probutility.
But of course, if tell me the right answer is primitive, how do we stop "do the right thing" from being primitive, which lands us right back in the hot water of strong optimization of 'utility' this proposal was supposed to prevent? So I think it should evaluate the specific output, which has low probability(human), and therefore not tell you.
I'll try and write up a proof that it can do what I think it can.
I had trouble parsing the end of this sentence. Is the idea that the AI might get into situations that are too complex for the humans to understand?
Yeah. I've edited it a bit for clarity.