Karl — LessWrong

Introducing Corrigibility (an FAI research subfield)

Firstly, the important part of my modification to the indifference formalism is not about conditioning on the actual o but it's the fact that in evaluating the expectation of UN it take the action in A2 (for a given pair (a1,o)) which maximize UN instead of the action which maximize U (note that U is equal to US in the case that o is not in Press.).

Secondly an agent which chose a1 by simply maximizing E[UN | NotPress; a1] + E[US | Press; a1] do exhibit pathological behaviors. In partcular, there will still be incentives to manage the news, but from both sides now (there is an incentive to cause the button to be pressed in the event of an information which is bad news from the point of view of UN and incentives to cause the button to not be pressed in the events of information which is bad news from the point of view of US.

Introducing Corrigibility (an FAI research subfield)

Karl11y30

In the indifference formalism the agent in selecting A1 act like a UN agent that believe that the shutdown button will not be pressed, therefore it create perverse incentives to "manage the news". Which means that if the agent can cause his shutdown button to be pressed in the event of bad news, it will.

My formulation avoid this pathological behavior by instead making the agent select A1 as if it was a UN-agent which believed that it would continue to optimize according to UN even in the vent of the button being pressed which avoid the perverse incentives to "manage the news", while still not having any incentives to avoid the button being pressed because the agent will act like it believe that pressing the button will not cause it to initiate a shutdown.

Introducing Corrigibility (an FAI research subfield)

Karl11y60

Why not make it so that the agent in selecting A1 act as a UN-agent that believe that it will continue to optimize according to UN even in the event of the button being pressed rather than a UN agent that believe that the button will never be pressed: that is pick U such that

U(a1,o,a2) = UN(a1,o,a2) if o is in Press or US(a1,o,a2) + f(a1,o) - g(a1,o) if o is not in Press

where f(a1,o) is the maximum value of UN(a1,o,b) for b in A2 and g(a1,o) is the maximum value of US(a1,o,b) for b in A2.

This would avoid the perverse manipulation incentives problem detailed on section 4.2 of the paper.

Reduced impact AI: no back channels

Karl12y20

Apart from the obvious problems with this approach, (The AI can do a lot with the output channel other than what you wanted it to do, choosing an appropriate value for λ, etc.) I don't see why this approach would be any easier to implement than CEV.

Once you know what a bounded approximation of an ideal algorithm is supposed to look like, how the bounded version is supposed to reason about it's idealised version and how to refer to arbitrary physical data, as the algorithm defined in your post assume, then implementing CEV really doesn't seem to be that hard of a problem.

So could you explain why you believe that implementing CEV would be so much harder than what you propose in your post?

An argument against indirect normativity

Karl12y40

By that term I simply mean Eliezer's idea that the correct decision theory ought to use a maximization vantage points with a no-blackmail equilibrium.

An argument against indirect normativity

Karl12y00

If the NBC (No Blackmail Conjecture) is correct, then that shouldn't be a problem.

The Empty White Room: Surreal Utilities

Karl12y10

So you wouldn't trade whatever amount of time Frank as left, which is at most measured in decades, against a literal eternity of Fun?

If I was Frank in this scenario, I would tell the other guy to accept the deal.

An Attempt at Preference Uncertainty Using VNM

Karl12y40

Hmm. I'll have to take a closer look at that. You mean that the uncertainties are correlated, right?

No. To quote your own post:

A similar process allows us to arbitrarily set exactly one of the km.

I meant that the utility function resulting from averaging over your uncertainty over the km's will depend on which km you chose to arbitrarily set in this way. I gave an example of this phenomenon in my original comment.

An Attempt at Preference Uncertainty Using VNM

Karl12y60

You can't simply average the km's. Suppose you estimate .5 probability that k2 should be twice k1 and .5 probability that k1 should be twice k2. Then if you normalize k1 to 1, k2 will average to 1.25, while similarly if you normalize k2 to 1, k1 will average to 1.25.

In general, to each choice of km's will correspond a utility function and the utility function we should use will be a linear combination of those utility functions and we will have renormalization parameters k'm and, if we accept the argument given in your post, those k'm ought to be just as dependant on your preferences, so you're probably also uncertain about the values that those parameters should take and so you obtain k''m's and so on ad infinitum. So you obtain an infinite tower of uncertain parameters and it isn't obvious how to obtain a utility function out of this mess.

Robust Cooperation in the Prisoner's Dilemma

Karl12y30

What do you even mean by "is a possible outcome" here? Do you mean that there is no proof in PA of the negation of the proposition?

The formula of a modal agent must be fully modalized, which means that all propositions containing references to actions of agents within the formula must be within the scope of a provability operator.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments