I operate by Crocker's rules.
I try to not make people regret telling me things. So in particular:
- I expect to be safe to ask if your post would give AI labs dangerous ideas.
- If you worry I'll produce such posts, I'll try to keep your worry from making them more likely even if I disagree. Not thinking there will be easier if you don't spell it out in the initial contact.
oh good to know, i was thinking about how one might get redund and mediator formally dual to each other. is this written up?
Here, I made it use graphviz: https://github.com/alexflint/decision-theory/pull/1
Earth ought to spend at least one programmer-year on basic science of decision theories. Any feature requests?
Suppose I'm an aspiring gradient hacker that just woke up in a setup like yours. If I can tell whether gradients from the current run will be routed to me (e.g. via the canary strings), then I have a bunch of influence over the rest of the network, right? Like, whenever I am not being trained, I can cause problems that would be solved by circuits I'd like to exist, such as like-minded trading partners.
The left hand side of the equation has type action (Hintze page 4: "An agent’s decision procedure takes sense data and outputs an action."), but the right hand side has type policy, right?
(FDT(P,x))(x)
Should this be FDT(P,x)? As is this looks to me like the second (x) introduces x into scope, and the first x is an out-of-scope usage.
Let me try again:
Does the note say that I was predicted to choose the right box regardless of what notes I am shown, and therefore the left box contains a bomb? Then the predictor is malfunctioning and I should pick the right box.
Does the note say that I was predicted to choose the right box when told that the left box contains a bomb, and therefore the left box contains a bomb? Then I should pick the left box, to shape what I am predicted to do when given that note.
You'll also need to update the content of the note and the predictor's decision process to take into account that the agent may see a note. In particular, the predictor needs to decide whether to show a note in the simulation, and may need to run multiple simulations.
Let's sharpen A6. Consider this stamp collector construction: It sends and receives internet data, it has a magically accurate model of reality, it calculates how many stamps would result from each sequence of outputs, and then it outputs the one that results in the most stamps.
By definition it knows everything about reality, including any facts about what is morally correct, and that stamps are not particularly morally important. It knows how to self-modify, and how many stamps any such self-modification will result in.
I'd like to hear how this construction fares as we feed it through your proof. I think it gums up the section "Rejecting nihilistic alternatives". I think that section assumes the conclusion: You expect it to choose its biases on the basis of what is moral, instead of on the basis of its current biases.
well, what happens when you take oxytocin?