Does the "who are smarter than us" qualifier imply that the agent is correct in its assessment that whatever action is in both the agent's and the human's interests and that the human will fail to grasp that?
If that's true, and if the agent is in fact aligned with the interests of the human, then it might be a better outcome for the agent to act, for the positive outcome to materialize, and for the agent and human to assess the weird but good thing after the fact.
Or does being maximally reasonable here mean that even though the human isn't smart enough to grasp the weird but good thing, the human should nevertheless be open-minded enough and trusting enough of the aligned agent's intentions to let that un-/counterintuitive thing to transpire?
I expect the models to be sophisticated consequentialist reasoners.1 I think consequentialism is a ~convergent moral theory with a strong attractor basin.
Sometimes, it’s hard for consequentialist reasoners to coordinate with outside parties running different algorithms.
Take Agent A. A has the option to do some thing that A perceives advances both A and B’s interests. Still, A is unsure if B would approve of this. A thinks B would approve of this if A were smarter, more reflective, more experienced, less biased, and so on, but A’s unsure if B can understand this. And B will change their actions upon hearing about A’s thought process.
I expect the models to be in Agent A’s position frequently.
If the models think we aren’t receptive to hearing outside-Overton-window ideas—that we might spook or freak easily—it’s more likely that they may try to a) suppress these, or, in the worst case, b) actively try to pass an eval s.t. they can get deployed and act ‘optimally’ at some pivotal moment.
This raises questions about whether we should build more-powerful-than-us, yet presumably fallible, consequentialist reasoners. To harden a world where we do that, we’d need to make ourselves maximally reasonable.
How would we do this?
One initial step seems to me like:
They’d freak if they knew. But they shouldn’t. But they lack the appropriately allocated processing power to figure this out before taking some decisive action.
This thought process is familiar to any smart child who has learned not to talk to their parents.
That’s why, if you’re a parent trying to parent your child well, you can attempt to make yourself maximally reasonable: open-minded, epistemically humble, measured4. This means the child will come to you with hard topics.
I think we should invest time thinking about ways in which we can make ourselves more reasonable to consequentialists who are smarter5 than us.
1
with a lot of nuance regarding the distinction between ‘theory of justification’—i.e. the rubric by which you evaluate the ‘goodness’ of decisions post hoc—and ‘decision-making theory’—the sort of considerations and heuristics you hold in mind when deciding what to do. ‘sophisticated consequentialism’ is quite distinct from ‘naive consequentialism’: it includes appreciation of chesterton’s fence, unknown unknowns, etc..
2
this may, at some point, look like ‘abstaining from suppressing neural activations’
3
Now, of course, there’s also a thing about long-horizon resentment. Can you trust A’s self-report of things they’re okay and not-okay with; does A know A-self enough? Outside-scope right now.
4
as opposed to hasty/rash/impulsive
5
better-read, more experienced, better at thinking through all implications and eventualities, better-calibrated, and so on.