I'd like to get different answers in those two worlds. That definitely requires having some term in the loss that is different in W1 and W2. There are three ways the kinds of proposals in the doc can handle this:
In the case of the AI, the Bayes net is explicit, in the sense that we could print it out on a sheet of paper and try to study it once training is done, and the main reason we don't do that is because it's likely to be too big to make much sense of.
We don't quite have access to the AI Bayes net---we just have a big neural network, and we sometimes talk about examples where what the neural net is doing internally can be well-described as "inference in a Bayes net."
So ideally a solution would use neither the human Bayes net or the AI Bayes net.
But when thinking about existing counterexamples, it can still be useful to talk about how we want an algorithm to behave in the case where the human/AI are using a Bayes net, and we do often think about ideas that use those Bayes nets (with the understanding that we'd ultimately need to refine them into approaches that don't depend on having an explicit Bayes net).
We're going to accept submissions through February 10.
(We actually ended up receiving more submissions than I expected but it seems valuable, and Mark has been handling all the reviews, so running for another 20 days seems worthwhile.)
"The goal is" -- is this describing Redwood's research or your research or a goal you have more broadly?
My general goal, Redwood's current goal, and my understanding of the goal of adversarial training (applied to AI-murdering-everyone) generally.
I'm curious how this is connected to "doesn't write fiction where a human is harmed".
"Don't produce outputs where someone is injured" is just an arbitrary thing not to do. It's chosen to be fairly easy not to do (and to have the right valence so that you can easily remember which direction is good and which direction is bad, though in retrospect I think it's plausible that a predicate with neutral valence would have been better to avoid confusion).
The goal is not to remove concepts or change what the model is capable of thinking about, it's to make a model that never tries to deliberately kill everyone. There's no doubt that it could deliberately kill everyone if it wanted to.
I'd be fine with a proposal that flips coins and fails with small probability (in every possible world).
I'm not sure I follow your reasoning, but IBP sort of does that. In IBP we don't have subjective expectations per se, only an equation for how to "updatelessly" evaluate different policies.
It seems like any approach that evaluates policies based on their consequences is fine, isn't it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable.
I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent. It seems like we are discussing a version that defines values differently, but where neither agent uses Solomonoff induction directly. Is that right?
Sure. But it becomes much more amenable to methods such as confidence thresholds, which are applicable to some alignment protocols at least.
It seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don't work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn't enough to get you there.
I'm not sure I understand what you mean by "decision-theoretic approach"
I mean that you have some utility function, are choosing actions based on E[utility|action], and perform solomonoff induction only instrumentally because it suggests ways in which your own decision is correlated with utility. There is still something like the universal prior in the definition of utility, but it no longer cares at all about your particular experiences (and if you try to define utility in terms of solomonoff induction applied to your experiences, e.g. by learning a human, then it seems again vulnerable to attack bridging hypotheses or no).
This seems wrong to me. The inductor doesn't literally simulate the attacker. It reasons about the attacker (using some theory of metacosmology) and infers what the attacker would do, which doesn't imply any wastefulness.
I agree that the situation is better when solomonoff induction is something you are reasoning about rather than an approximate description of your reasoning. In that case it's not completely pathological, but it still seems bad in a similar way to reason about the world by reasoning about other agents reasoning about the world (rather than by direct learning the lessons that those agents have learned and applying those lessons in the same way that those agents would apply them).