The statement of the law of logical causality is:
Law of Logical Causality: If conditioning on any event changes the probability an agent assigns to its own action, that event must be treated as causally downstream.
If I'm interpreting things correctly, this is just because anything that's upstream gets screened off, because the agent knows what action it's going to take.
You say that LICDT pays the blackmail in XOR blackmail because it follows this law of logical causality. Is this because, conditioned on the letter being sent, if there is a disaster the agent assigns p=0 to sending money, and if there isn't a disaster the agent assigns p=1 to sending money, so the disaster must be causally downstream of the decision to send money if the agent is to know whether or not it sends money?
I didn't find the conclusion about the smoke-lovers and non-smoke-lovers obvious in the EDT case at first glance, so I added in some numbers and ran through the calculations that the robots will do to see for myself and get a better handle on what not being able to introspect but still gaining evidence about your utility function actually looks like.
Suppose that, out of the N robots that have ever been built, nN are smoke-lovers and (1−n)N are non-smoke-lovers. Suppose also the smoke-lovers end up smoking with probability p and non-smoke-lovers end up smoking with probability q.
Then (pn+q(1−n))N robots smoke, and ((1−p)n+(1−q)(1−n))N robots don't smoke. So by Bayes' theorem, if a robot smokes, there is a pnpn+q(1−n) chance that it's killed, and if a robot doesn't smoke, there's a (1−p)n1−(pn+q(1−n))chance that it's killed.
Hence, the expected utilities are:
Now consider some equilibria. Suppose that no non-smoke-lovers smoke, but some smoke-lovers smoke. So q=ε and p≫ε. So (taking limits as ε→0 along the way):
But wait! This is fine if more than 90% are smoke-lovers, but if fewer than 90% are smoke-lovers, then they would always choose not to smoke, that's inconsistent with the assumption that p is much larger than ε. So instead suppose that p is only only a little bit bigger than ε=q, say that p=kε. Then:
I wrote a quick python simulation to check these conclusions, and it was the case that p=10−9n for 0.9<n<1, and p=(1+19n−10n2)ε for 0<n<0.9 there as well.
Your reliable thermometer doesn't need to be well-calibrated - it only has to show the same value whenever it's used to measure boiling water, regardless of what that value is. So the dependence isn't quite so circular, thankfully.
So the definition of myopia given in Defining Myopia was quite similar to my expansion in the But Wait There's More section; you can roughly match them up by saying r(x)=∑ifiri(x) and yi(x)=(1−fi)ri(x) , where fi is a real number corresponding to the amount that the agent cares about rewards obtained in episode i and ri is the reward obtained in episode i. Putting both of these into the sum gives R(x)=∑iri(x), the undiscounted, non-myopic reward that the agent eventually obtains.
In terms of the R=R0+R1 definition that I give in the uncertainty framing, this is R0=R(x,y0)=∑ifiri(x)+∑i(1−fi)ri(x0), and R1=R(x,y)−R(x,y0)=∑i(1−fi)(ri(x)−ri(x0)).
So if you let r be a vector of the reward obtained on each step and f be a vector of how much the agent cares about each step then x→x+ϵ∑ifi∂ri∂x , and thus the change to the overall reward is R→R+ϵ∑i∂ri∂x∑jfj∂rj∂x , which can be negative if the two sums have different signs.
I was hoping that a point would reveal itself to me about now but I'll have to get back to you on that one.
Thoughts on Abram Demski's Partial Agency:
When I read Partial Agency, I was struck with a desire to try formalizing this partial agency thing. Defining Myopia seems like it might have a definition of myopia; one day I might look at it. Anyway,
Formalization of Partial Agency: Try One
A myopic agent is optimizing a reward function R(x0,y(x0)) where x is the vector of parameters it's thinking about and y is the vector of parameters it isn't thinking about. The gradient descent step picks the δx in the direction that maximizes R(x0+δx,y(x0)) (it is myopic so it can't consider the effects on y), and then moves the agent to the point (x0+δx,y(x0+δx)).
This is dual to a stop-gradient agent, which picks the δx in the direction that maximizes f(x0+δx,y(x0+δx)) but then moves the agent to the point (x0+δx,y(x0)) (the gradient through y is stopped).
Partial Agency due to Uncertainty?
Is it possible to cast partial agency in terms of uncertainty over reward functions? One reason I'd be myopic is if I didn't believe that I could, in expectation, improve some part of the reward, perhaps because it's intractable to calculate (behavior of other agents) or something I'm not programmed to care about (reward in other episodes).
Let R1 be drawn from a probability distribution over reward functions. Then one could decompose the true, uncertain, reward into R′=R0(x0)+R1(x0) defined in such a way that E(R1(x0+δx)−R1(x0))≈0 for any δx? Then this is would be myopia where the agent either doesn't know or doesn't care about R1, or at least doesn't know or care what its output does to R1. This seems sufficient, but not necessary.
Now I have two things that might describe myopia, so let's use both of them at once! Since you only end up doing gradient descent on R0, it would make sense to say R′(x)=R(x,y(x)) , R0(x)=R(x,y(x0)) , and hence that R1(x)=R(x,y(x))−R(x,y(x0)).
Since R1(x0+δx)=R1(x0)+δx∂R1∂x for small δx, this means that E(∂R1∂x)=0 , so substituting in my expression for R1 gives E(∂R∂x+∂R∂y∂y∂x−∂R∂x)=0 , so E(∂R∂y∂y∂x)=0 . Uncertainly is only over R, so this is just the claim that the agent will be myopic with respect to y if E(∂R∂y)=0. So it won't want to include y in its gradient calculation if it thinks the gradients with respect to y are, on average, 0. Well, at least I didn't derive something obviously false!
But Wait There's More
When writing the examples for the gradient descenty formalisation, something struck me: it seems there's a R(x)=r(x)+∑iyi(x) structure to a lot of them, where r is the reward on the current episode, and yi are rewards obtained on future episodes.
You could maybe even use this to have soft episode boundaries, like say the agent receives a reward rt on each timestep so R(x)=r0(x)+r1(x)α+r2(x)α2+∑i=3ri(x)αi , and saying that α3≪1 so that ∂R∂ri≪1 for i≥3, which is basically the criterion for myopia up above.
On a completely unrelated note, I read the Parable of Predict-O-Matic in the past, but foolishly neglected to read Partial Agency beforehand. The only thing that I took away from PoPOM the first time around was the bit about inner optimisers, coincidentally the only concept introduced that I had been thinking about beforehand. I should have read the manga before I watched the anime.
The Whole City is Center:
This story had a pretty big impact on me and made me try to generate examples of things that could happen such that I would really want the perpetrators to suffer, even more than consequentialism demanded. I may have turned some very nasty and imaginative parts of my brain, the ones that wrote the Broadcast interlude in Unsong, to imagining crimes perfectly calculated to enrage me. And in the end I did it. I broke my brain to the point where I can very much imagine certain things that would happen and make me want the perpetrator to suffer – not infinitely, but not zero either.
The AI Box game, in contrast with the thing it's a metaphor for, is a two player game played over text chat by two humans where the goal is for Player A to persuade Player B to let them win (traditionally by getting them to say "I let you out of the box"), within a time limit.
Thoughts on Dylan Hadfield-Menell et al.'s The Off-Switch Game.
Just under a month ago, I said "web app idea: one where you can set up a play-money prediction market with only a few clicks", because I was playing around on Hypermind and wishing that I could do my own Hypermind. It then occurred to me that I can make web apps, so after getting up to date on modern web frameworks I embarked in creating such a site.
Anyway, it's now complete enough to use, provided that you don't blow on it too hard. Here it is: pmarket-maker.herokuapp.com. Enjoy!
You can create a market, and then create a set of options within that market. Players can make buy and sell limit orders on those options. You can close an option and pay out a specific amount per owned share. There are no market makers, despite the pun in the name, but players start with 1000 internet points that they can use to shortsell.
Thoughts on Ryan Carey's Incorrigibility in the CIRL Framework (I am going to try to post these semi-regularly).