Sorted by New


Values Form a Shifting Landscape (and why you might care)

This reminds me of the "Converse Lawvere Problem" at https://www.alignmentforum.org/posts/5bd75cc58225bf06703753b9/the-ubiquitous-converse-lawvere-problem a little bit, except that the different functions in the codomain have domain which also has other parts to it aside from the main space  . 

As in, it looks like here, we have a space  of values , which includes things such as "likes to eat meat" or "values industriousness" or whatever, where this part can just be handled as some generic nice space   , as one part of a product, and as the other part of the product has functions from  to  .
That is, it seems like this would be like,  .

Which isn't quite the same thing as is described in the converse Lawvere problem posts, but it seems similar to me? (for one thing, the converse Lawvere problem wasn't looking for homeomorphisms from X to the space of functions from X to functions to [0,1] , just a surjective continuous function).

Of course, it is only like that if we are supposing that the space we are considering,  , has to have all combinations of "other parts of values" with "opinions on the relative merit of different possible values". Of course if we just want some space of possible values, and where each value has an opinion of each value, then that's just a continuous function from a product of the space with itself, which isn't any problem.
I guess this is maybe more what you meant? Or at least, something that you determined was sufficient to begin with when looking at the topic? (and I guess most more complicated versions would be a special case of it?)

Oh, if you require that the "opinion on another values" decomposes nicely in ways that make sense (like, if it depends separately on the desirability of the base level values, and the values about values, and the values about values about values, etc., and just has a score for each which is then combined in some way, rather than evaluating specifically the combinations of those) , then maybe that would make the space nicer than the first thing I described (which I don't know whether such a thing exists) in a way that might make it more likely to exist.
Actually, yeah, I'm confident that it would exist that way.
And let  
And then let  ,
and for  define 

which seems like it would be well defined to me. Though whether it can captures all that you want to capture about how values can be, is another question, and quite possibly it can't.

Subagents of Cartesian Frames

Thanks! (The way you phrased the conclusion is also much clearer/cleaner than how I phrased it)

Subagents of Cartesian Frames

I am trying to check that I am understanding this correctly by applying it, though probably not in a very meaningful way:

Am I right in reasoning that, for  , that  iff ( (C can ensure S), and (every element of S is a result of a combination of a possible configuration of the environment of C with a possible configuration of the agent for C, such that the agent configuration is one that ensures S regardless of the environment configuration)) ?

So, if S = {a,b,c,d} , then

would have  , but, say

would have   , because , while S can be ensured, there isn't, for every outcome in S, an option which ensures S and which is compatible with that outcome ? 

A Correspondence Theorem

There are a few places where I believe you mean to write  a but instead have  instead. For example, in the line above the "Applicability" heading.

I like this.

"Zero Sum" is a misnomer.

As an example, I think in the game "both players win if they choose the same option, and lose if they pick different options" has "the two players pick different options, and lose" as one of the feasible outcomes, and it is not on the Pareto frontier, because if they picked the same thing, they would both win, and that would be a Pareto improvement.

The "best predictor is malicious optimiser" problem

What came to mind for me before reading the spoiler-ed options, was a variation on #2, with the difference being that, instead of trying to extract P's hypothesis about B, we instead modify T to get a T' which has P replaced with a P' which is a paperclip minimizer instead of maximizer, and then run both, and only use the output when the two agree, or if they give probabilities, use the average, or whatever.

Perhaps this could have an advantage over #2 if it is easier to negate what P is optimizing for than to extract P's model of B. (edit: though, of course, if extracting the model from P is feasible, that would be better than the scheme I described)

On the other hand, maybe this could still be dangerous, if P and P' have shared instrumental goals with regards to your predictions for B?

Though, if P has a good model of you, A, then presumably if you were to do this, both P and P' would expect you would do this, and, so I don't know what would make sense for them to do?

It seems like they would both expect that, while they may be able to influence you, that insofar as the influence would effect the expected value of number of paperclips, it would be canceled out by the other's influence (assuming that the ability to influence # paperclips via changing your prediction of B, is symmetric, which, I guess it might not be..).

I suppose this would be a reason why P would want its thought processes to be inscrutable to those simulating it, so that the simulators are unable to construct P' .


As a variation on #4, if P is running on a computer in a physics simulation in T, then almost certainly a direct emulation of that computer running P would run faster than T does, and therefore whatever model of B that P has, can be computed faster than T can be. What if, upon discovering this fact about T, we restrict the search among Turing machines to only include machines that run faster than T?

This would include emulations of P, and would therefore include emulations of P's model of B (which would probably be even faster than emulating P?), but I imagine that a description of an emulation of P without the physics simulation and such would have a longer description than a description of just P's model of B. But maybe it wouldn't.