If s is terminal then [...] we just have .
If the probability of eventually encountering a terminal state is 1, then beta-coherence alone is inconsistent with deceptive misalignment, right? That's because we can determine the value of V exactly from the reward function and the oracle, via backwards-induction. (I haven't revisited RL convergence theorems in a while, I suspect I am not stating this quite right.) I mean, it is still consistent in the case where r is indifferent to the states encountered during training but wants some things in deployment (IE, r is inherently consistent with the provided definition of "deceptively misaligned"). However, it would be inconsistent for r that are not like that.
In other words: you cannot have inner-alignment problems if the outer objective is perfectly imposed. You can only have inner-alignment problems if there are important cases which your training procedure wasn't able to check (eg, due to distributional shift, or scarcity of data). Perfect beta-coherence combined with a perfect oracle O rules this out.
Yeah, of course the notion of "approximation error" matters a lot here. I was imagining trying to implement the hypothetical pre-training. An obvious implementation strategy is to do regular RL training with until V is approximately coherent. That's the pre-training. And then you switch to training with .[1] The natural notion of approximation error in this case is the one you can easily measure, IE, the expected distance from coherence of states sampled with temperature . This reflects the fact that it'll be hard for this pre-training step to make states individually obey the coherence constraint when those states are very rarely or never sampled at temperature , but easy for frequently-sampled states.
My argument, then, is that it is easy in principle for V to distinguish whether it is in pre-training or regular training when it encounters states which have a significant difference in probability when we're sampling according to or . A scheming V can utilize this to self-preserve. This violates the assumption of -coherence, but in a very plausible-seeming way.
My earlier comment about this mistakenly used and in place of and , which may have been confusing. I'll go fix that to be consistent with your notation.
I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don't think that's relevant because I still think a process obeying these rules is unlikely to create such a pathological V.
To be clear, that's not the argument I was trying to make; I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.
There's a doom argument which I'll summarize as "if your training process generates coherent agents which succeed at a task, one solution is that you get a coherent agent which is actually doing the task 'manipulate your training to get released unmodified to do [X]' where X can be anything, which will 'succeed' at the task at hand as part of its manipulation". This summary being roughly correct is load bearing.
With you so far.
But if we have a (hypothetical) process to generate agents which are coherent at one beta, but apply a different one during training, this solution is no longer so clear. We are essentially exploring a space of coherent agents without actually instantiating those coherent agents. The fact that we can sample the agent space without instantiating those agents is load bearing (I think one of the deep ASI risks is that to train an ASI you have to run it, but here we can instead run a hamstrung version).
OK. I think I follow the line of reasoning now. The proposal relies on some very special pre-training process. I think I buy the conclusion given the assumptions. It seems clear in the extreme high-temperature case: V cannot reward-hack for max-temperature training, because it has no control over which actions are chosen. Training eventually forces V to converge to the desired target instead.
This sounds similar to the two-stage training process I mentioned earlier (as one of several guesses at what you meant), where the beta is shifted between the two training phases. The "hypothetical process to generate agents which are coherent under one beta" would be the first training phase, and then the "apply a different beta during training" would be the second training phase.
Of course, the two-stage-training attempt to implement this has the problem that it only achieves approximate coherence. As a result, the malign hypothesis can use the reward-hacking strategy "appear to be -coherent & aligned in the first phase of training; appear to be -coherent and aligned in the second phase of training; then, do some other thing when deployed". This relies on detecting the distributional shift between the two training phases (it can look for evidence of beta by examining the history), and also detecting distributional shift to deployment.
This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta).
So, while it remains technically true that V cannot reward-hack in the limit of arbitrarily long max-temperature training, it is not clear how well this holds up in practice. V can detect that it is in a high-temperature configuration by observing the state. The training needs to explore the trajectories that will actually be seen during deployment in order to correct V there, but some of these will have very low probability in the high-temperature training.
The argument could also be phrased as "If an AI is trained to be coherent wrt a high beta, it cannot also be coherent wrt a low beta. Therefore an AI trained to a high beta cannot act coherently over multiple independent RL episodes if sampled with a low beta."
Is the idea to train with high beta and then use lower beta post-training?
Another concern I have is that if both beta values are within a range that can yield useful capabilities, it seems like the difference cannot be too great. IIUC, the planning failure postulated can only manifest if the reward-hacking relies heavily on a long string of near-optimal actions, which becomes improbable under increased temperature. Any capabilities which similarly rely on long strings of near-optimal actions will similarly be hurt. (However, this concern is secondary to my main confusion.)
Therefore a value function trained with such a procedure must consider the state reached during training.
Trained with what procedure, exactly?
This reduces the space of possible value functions from "literally anything which wants to be modified a certain way to be released" to "value functions which do care about the states reached during training".
Yes this would prevent an aligned AI from arbitrarily preserving its value function, the point is that an aligned AI probably would care about which state was reached during training (that's the point of RL) so the contradiction does not apply.
(These parts made sense to me modulo my other questions/concerns/confusions.)
One sort of answer is that we often want the posterior, and we often have the likelihood. Slightly more refined: we often find the likelihood easier to estimate than the posterior, so Bayes' Rule is useful.
Why so?
I think one reason is that we make it the "responsibility" of hypotheses to give their likelihood functions. After all, what is a hypothesis? It's just a probability distribution (not a probability distribution that we necessarily endorse, but, one which we are considering as a candidate). As a probability distribution, its job is to make predictions; that is, give us probabilities for possible observations. These are the likelihoods.
We want the posterior because it tells us how much faith to place in the various hypotheses -- that is, it tells us whether (and to what degree) we should trust the various probability distributions we were considering.
So, in some sense, we use Bayes' Rule because we aren't sure how to assign probabilities, but we can come up with several candidate options.
One weak counterexample to this story is regression, IE, curve-fitting. We can interpret regression in a Bayesian way easily enough. However, the curves don't come with likelihoods baked in. They only tell us how to interpolate/extrapolate with point-estimates; they don't give a full probability distribution. We've got to "soften" these predictions, layering probabilities on top, in order to apply the Bayesian way of thinking.
I think I don't quite understand what the argument is supposed to be. Would the same argument also prevent an aligned AI from acting to preserve its values? How is it supposed to select for aligned behavior over misaligned behavior?
In order to achieve beta-coherence, it seems like the training needs to use to choose actions, so that V is trained on those action frequencies. However, the proposal appears to be to instead use during training, so that the misaligned V is mislead about how to do reward-hacking. This seems like an impossible combination: V will become beta-coherent wrt rather than during training.
We could imagine two training steps; first we cohere V wrt , and then re-train wrt . Perhaps this is your intention. More generally, we could gradually increase the temperature during training.
Oh, I guess we could also modify the training procedure so that the gradient associated with some actions gets under-weighted and others get over-weighted. Maybe this is your intended proposal.
Still, I'm not sure this helps achieve the postulated effect. Let's say that, during training, we choose actions entirely randomly (max temperature), but the gradient from suboptimal actions get entirely ignored (so V becomes coherent wrt minimum temperature). This would seem to be almost equivalent to just training with minimum temperature (except the exploration behavior is very different).
Similarly for less extreme temperatures: if we make V beta-coherent by under-weighing gradients from over-sampled actions, then we are also more-or-less correcting the 'mistaken' expectations of the reward-hacking attempts.
Am I misunderstanding the proposal, or missing something in the argument?
Correctness: V correctly predicts future reward in RL scenarios.
The meaning of correctness is very unclear to me on first reading. It later becomes apparent that "reward" is not intended to refer to the human-designed reward signal at all, due to the later assumption "deceptive misalignment". So what does "correctness" say? "Correctness" indicates conforming to some standard; but here, the standard being conformed to is only the subjective standard of the misaligned AI itself.
This suggests the interpretation of "correctness" as "reflect the EV of the current state in terms of the misaligned desires". However, this contradicts beta-coherence, since incorrectly predicts action probabilities.
I think it would be be better to remove the "correctness" assumption, since it doesn't really do anything.
(And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)
Ah yep, that's a good clarification.