As you're aware, I'm very much exploring this approach using a multi-objective decision-making approach, with conservativism through only acting when an action is non-negative on the whole set of objective functions that an actor regards.

The alternative, Bayesian AGI approach is also worth thinking about too. A conservative Bayesian AGI might not need multiple objectives. For each action, it just needs a single probability distribution of outcomes. If there are multiple theories of how to translate consequences of its actions into its single utility function, each of those theories might be given some weight, and then they'd be combined into the probability distribution. Then a conservative Bayesian AGI only acts if an action's utility function doesn't exceed below zero. Or maybe there's always some remote possibility of going below zero, and programming this sort of behavior would be absolutely paralyising. In that case maybe we just make it loss-averse rather than strictly avoiding any possibility of a negative outcome.

Reply

[-]trentbrick5y10

Why does this approach only need to be implemented in neo-cortex like AGIs? If we have a factored series of value functions in an RL agent then we should be able to take the same approach? But I guess you are thinking that the basal ganglia learning algorithms already do this for us so it is a convenient approach?

Side note. I found the distinction between confusion and conflict a bit... confusing! Confusion here is the agent updating a belief while conflict is the agent deciding to take an action?

Reply

[-]Steven Byrnes5y20

Why does this approach only need to be implemented in neo-cortex like AGIs?

Oh, I wasn't saying that. I wouldn't know either way, I haven't thought about it. RL is a very big and heterogeneous field. I only know little bits and pieces of it. It's a lot easier to make a specific proposal that applies to a specific architecture—the architecture that I happen to be familiar with—than to try to make a more general proposal. So that's all I did.

What do you mean by "factored series of value functions"? If you're thinking of my other post, maybe that's possible, although not what I had in mind, because humans can feel conflicted, but my other post is talking about a mechanism that does not exist in the human brain.

Confusion here is the agent updating a belief while conflict is the agent deciding to take an action?

Yeah, that's what I was going for.

Reply

[-]MaxRa5y10

Thanks, I find your neocortex-like AGI approach really illuminating.

Random thought:

(I think you also need to somehow set up the system so that "do nothing" is the automatically-acceptable default operation when every possibility is unpalatable.)

I was wondering if this is necessarily the best „everything is unpalatable“ policy. I could imagine that the best fallback option could also be something like „preserve your options while gathering information, strategizing and communicating with relevant other agents“, assuming that this is not unpalatable, too. I guess we may not yet trust the AGI to do this, option preservation might cause much more harm than doing nothing. But I still wonder if there are cases in which every option is unpalatable but doing nothing is clearly worse.

Reply

[-]Steven Byrnes5y30

Yeah I was really only thinking about "not yet trust the AGI" as the main concern. Like, I'm somewhat hopeful that we can get the AGI to have a snap negative reaction to the thought of deceiving its operator, but it's bound to have a lot of other motivations too, and some of those might conflict with that. And it seems like a harder task to make sure that the latter motivations will never ever outbid the former, than to just give every snap negative reaction a veto, or something like that, if that's possible.

I don't think "if every option is bad, freeze in place paralyzed forever" is a good strategy for humans :-P and eventuality it would be a bad strategy for AGIs too, as you say.

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

22

Conservatism in neocortex-like AGIs

22

Ω 10

22

Ω 10

What are the algorithmic correlates of feeling confused?

What are the algorithmic correlates of feeling conflicted?

What do we do when the "AGI is confused" detector circuit activates?

What do we do when the "AGI is conflicted" detector circuit activates?

What could go wrong?

Appendix: Speculations on how reward aggregation works in humans / animals