J Bostock

Sequences

Dead Ends
Statistical Mechanics
Independent AI Research
Rationality in Research

Wiki Contributions

Comments

Sorted by

I'm mildly against this being immortalized as part of the 2023 review, though I think it serves excellently as a community announcement for Bay Area rats, which seems to be its original purpose.

I think it has the most long-term relevant information (about AI and community building) back loaded and the least relevant information (statistics and details about a no-longer-existent office space in the Bay Area) front loaded. This is a very Bay Area centric post, which I don't think is ideal.

A better version of this post would be structured as a round up of the main future-relevant takeaways, with specifics from the office space as examples.

I'm only referring to the reward constraint being satisfied for scenarios that are in the training distribution, since this maths is entirely applied to a decision taking place in training. Therefore I don't think distributional shift applies.

I haven't actually thought much about particular training algorithms yet. I think I'm working on a higher level of abstraction than that at the moment, since my maths doesn't depend on any specifics about V's behaviour. I do expect that in practice an already-scheming V would be able to escape some finite-time reasonable-beta-difference situations like this, with partial success.

I'm also imagining that during training, V is made up of different circuits which might be reinforced or weakened.

My view is that, if V is shaped by a training process like this, then scheming Vs are no longer a natural solution in the same way that they are in the standard view of deceptive alignment. We might be able to use this maths to construct training procedures where the expected importance of a scheming circuit in V is to become (weakly) weaker over time, rather than being reinforced.

If we do that for the entire training process, we would not expect to end up with a scheming V.

The question is which RL and inference paradigms approximate this. I suspect it might be a relatively large portion of them. I think that if this work is relevant to alignment then there's a >50% chance it's already factoring into the SOTA "alignment" techniques used by labs.

I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.

All arguments break down a bit when introduced to the real world. Is there a particular reason why the approximation error to argument breakdown ratio should be particularly high in this case? 

Example, if we introduce some error to the beta-coherence assumption:

Assume beta_t = 1, beta_s = 0.5, r_1 = 1, r_2 = 0.

V(s_0) = e/(1+e) +/- delta = 0.732 +/- delta

Actual expected value = 0.622

Even if |delta| = 0.1 the system cannot be coherent over training in this case. This seems to be relatively robust to me

This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta).

I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don't think that's relevant because I still think a process obeying these rules is unlikely to create such a pathological V.

My model for how the strong doom-case works is that it requires there to be an actually-coherent mathematical object for the learning process to approach. This is the motivation for expecting arbitrary learning processes to approach e.g. utility maximizers. What I believe I have shown is that under these assumptions there is no such coherent mathematical object for a particular case of misalignment. Therefore I think this provides some evidence that an otherwise arbitrary learning process which pushes towards correctness and beta coherence but samples at a different beta is unlikely to approach this particular type of misaligned V.

Trained with what procedure, exactly?

Fair point. I was going to add that I don't really view this as a "proposal" but more of an observation. We will have to imagine a procedure which converges on correctness and beta-coherence. I was abstracting this away because I don't expect something like this to be too hard to achieve.

Since I've evidently done a bad job of explaining myself, I'll backtrack and try again:

There's a doom argument which I'll summarize as "if your training process generates coherent agents which succeed at a task, one solution is that you get a coherent agent which is actually doing the task 'manipulate your training to get released unmodified to do [X]' where X can be anything, which will 'succeed' at the task at hand as part of its manipulation". This summary being roughly correct is load bearing.

But if we have a (hypothetical) process to generate agents which are coherent at one beta, but apply a different one during training, this solution is no longer so clear. We are essentially exploring a space of coherent agents without actually instantiating those coherent agents. The fact that we can sample the agent space without instantiating those agents is load bearing (I think one of the deep ASI risks is that to train an ASI you have to run it, but here we can instead run a hamstrung version).

Therefore, during the training, the value function will not be shaped into something which looks like 'manipulate your training to get released unmodified to do [X]'.

Whether or not the beta difference required is too large to make this feasible in practice, I do not know.

The argument could also be phrased as "If an AI is trained to be coherent wrt a high beta, it cannot also be coherent wrt a low beta. Therefore an AI trained to a high beta cannot act coherently over multiple independent RL episodes if sampled with a low beta."

The contradiction that I (attempt to) show only arises because we assume that the value function is totally agnostic of the state actually reached during training, other than due to its effects on a later deployed AI.

Therefore a value function trained with such a procedure must consider the state reached during training. This reduces the space of possible value functions from "literally anything which wants to be modified a certain way to be released" to "value functions which do care about the states reached during training".

Yes this would prevent an aligned AI from arbitrarily preserving its value function, the point is that an aligned AI probably would care about which state was reached during training (that's the point of RL) so the contradiction does not apply.

I think you're right, correctness and beta-coherence can be rolled up into one specific property. I think I wrote down correctness as a constraint first, then tried to add coherence, but the specific property is that:

For non-terminal s, this can be written as:

If s is terminal then [...] we just have .

Which captures both. I will edit the post to clarify this when I get time.

I somehow missed that they had a discord! I couldn't find anything on mRNA on their front-facing website, and since it hasn't been updated in a while I assumed they were relatively inactive. Thanks! 

Thinking back to the various rationalist attempts to make vaccine. https://www.lesswrong.com/posts/niQ3heWwF6SydhS7R/making-vaccine For bird-flu related reasons. Since then, we've seen mRNA vaccines arise as a new vaccination method. mRNA vaccines have been used intra-nasally for COVID with success in hamsters. If one can order mRNA for a flu protein, it would only take mixing that with some sort of delivery mechanism (such as Lipofectamine, which is commercially available) and snorting it to get what could actually be a pretty good vaccine. Has RaDVac or similar looked at this?

Load More